Compare commits

...

2422 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
6ce6c33b65 test/alternator: fix copyright year to 2026
Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>
2026-03-03 16:28:02 +00:00
copilot-swe-agent[bot]
fe77675455 test/alternator: rename fixture, split x/y attrs, reorder tests, fix index naming check
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-03 15:46:23 +00:00
copilot-swe-agent[bot]
2fd5383bd0 test/alternator: add GSI/LSI naming and index key encoding tests
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-03 15:28:34 +00:00
copilot-swe-agent[bot]
c7969f7a46 test/alternator: minor cleanup in test_encoding.py per review
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-03 15:16:11 +00:00
copilot-swe-agent[bot]
2c17c90825 test/alternator: address review feedback on test_encoding.py
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-03 15:00:56 +00:00
copilot-swe-agent[bot]
e00dbfa334 test/alternator: add test_encoding.py to test Alternator's on-disk data encoding
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-03 11:07:10 +00:00
copilot-swe-agent[bot]
744034eec6 Initial plan 2026-03-03 10:51:13 +00:00
Calle Wilund
69f8e722bf table::snapshot_table_on_all_shards: Use set to keep track of tablets in manifest
Fixes: SCYLLADB-828

Avoid iterating linear set of tablets when building manifest. Reduces complexity.

Closes scylladb/scylladb#28851
2026-03-03 08:09:33 +02:00
Karol Nowacki
30487e8854 index: fix vector index with filtering target column
The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```

This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````

This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.

Fixes: SCYLLADB-635

Closes scylladb/scylladb#28740
2026-03-02 18:47:58 +02:00
Sergey Zolotukhin
33923578eb Update DROP INDEX statement documentation
Clarify behavior of DROP INDEX during ongoing builds.

Closes scylladb/scylladb#28659
2026-03-02 17:31:23 +02:00
Botond Dénes
ab532882db tools/scylla-sstable: introduce scylla sstable split
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.

Fixes: SCYLLADB-10

Closes scylladb/scylladb#28741
2026-03-02 15:19:17 +01:00
Botond Dénes
bf3edaf220 tools/scylla-sstable: filter_operation(): use deferred_close() to close reader
Manual closing is bypassed with exceptions, promoting an exception to a
crash due to unclosed reader.

Closes scylladb/scylladb#28797
2026-03-02 14:16:08 +01:00
Marcin Maliszkiewicz
6bf706ef1b Merge 'scylla-sstable: query: handle nested UDTs' from Botond Dénes
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.

Backport: Implements missing functionality of a tool, no backport.

Closes scylladb/scylladb#28798

* github.com:scylladb/scylladb:
  tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
  tools/scylla-sstable: generalize dump_if_user_type
  tools/scylla-sstable: move dump_if_user_type() definition
2026-03-02 14:14:43 +01:00
dependabot[bot]
f5fa77ac9d build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.4 to 0.3.7.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#28833
2026-03-02 14:13:03 +01:00
Marcin Maliszkiewicz
a83ee6cf66 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature
2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak
ba7f314cdc test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```

Fixes SCYLLADB-805

Closes scylladb/scylladb#28829
2026-03-02 10:26:57 +02:00
Yaron Kaikov
ab02486ce8 .github/workflows/trigger-scylla-ci.yaml: fix org membership check in trigger-scylla-ci workflow
Following becb48b586 it seems we have a regression with trigger CI logic
The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR
with GITHUB_TOKEN to check if the user is an org member. However,
GITHUB_TOKEN does not have read:org scope, so the API call fails for all
users — even actual scylladb org members — causing CI triggers to be
silently skipped.

Replace the API call with the author_association field from the GitHub
event payload, which is set by GitHub itself and requires no special
token permissions. This allows any scylladb org member (MEMBER or OWNER)
to trigger CI via comment, regardless of whether they authored the PR.

Closes scylladb/scylladb#28837
2026-03-02 10:23:14 +02:00
Michael Litvak
8c4bc33e51 test: remove test_view_building_with_tablet_move
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.

the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.

this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.

besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.

Fixes SCYLLADB-842

Closes scylladb/scylladb#28842
2026-03-02 07:42:08 +01:00
Marcin Maliszkiewicz
8c2da76fde test/cqlpy: remove xfail from test_constant_function_parameter
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.

Closes scylladb/scylladb#28776
2026-03-01 20:03:42 +02:00
Jenkins Promoter
fb6eebc383 Update pgo profiles - aarch64 2026-03-01 05:17:44 +02:00
Jenkins Promoter
8edd532c05 Update pgo profiles - x86_64 2026-03-01 04:31:57 +02:00
Botond Dénes
1f09fcfb26 Merge 'Use standard ks/cf/data creation methods in test_restore_with_streaming_scopes' from Pavel Emelyanov
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.

Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.

Continuation of #28600.

Cleaning tests, not backporting

Closes scylladb/scylladb#28608

* github.com:scylladb/scylladb:
  test/object_store: Use itertools.product() for deeply nested loops
  test/object_store: Replace dataset creation usage with standard methods
  test/object_store: Shift indentation right for test_restore_with_streaming_scopes
2026-02-27 16:15:55 +02:00
Avi Kivity
450a09b152 test: tools: restrict embedded perf tests from taking over host
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.

All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.

Closes scylladb/scylladb#28821
2026-02-27 16:06:22 +02:00
Botond Dénes
d3a3921487 Merge 'Re-use and improve the take_snapshot() helper in backup tests' from Pavel Emelyanov
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.

Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.

Cleaning up tests, not backporting.

Closes scylladb/scylladb#28660

* github.com:scylladb/scylladb:
  test/backup: Run keyspace flush and snapshot taking API in parallel
  test/backup: Re-use take_snapshot() helper in do_abort_restore()
  test/backup: Move take_snapshot() helper up
2026-02-27 15:58:18 +02:00
Łukasz Paszkowski
fb40d1e411 compaction_manager: improve readability of maybe_wait_for_sstable_count_reduction()
Split the chained inject_parameter().value_or() expression into two
separate named variables for clarity.

Use condition_variable::when() instead of wait(). when() is the
coroutine-native API, designed specifically for co_await contexts —
it avoids the heap allocation of a promise_waiter that wait() uses,
and instead uses a stack-based awaiter.

Closes scylladb/scylladb#28824
2026-02-27 15:51:30 +02:00
Patryk Jędrzejczak
9a9202c909 Merge 'Remove gossiper topology code' from Gleb Natapov
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.

Refs #15422

No backport needed since this removes functionality.

Closes scylladb/scylladb#28514

* https://github.com/scylladb/scylladb:
  group0: fix indentation after previous patch
  raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
  raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
  raft_group0: remove unused code from raft_group0
  node_ops: remove topology over node ops code
  topology: fix indentation after the previous patch
  topology: drop topology_change_enabled parameter from raft_group0 code
  storage_service: remove unused handle_state_* functions
  gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
  storage_service: fix indentation after the last patch
  storage_service: remove gossiper bootstrapping code
  storage_service: drop get_group_server_if_raft_topolgy_enabled
  storage_service: drop is_topology_coordinator_enabled and its uses
  storage_service: drop run_with_api_lock_in_gossiper_mode_only
  topology: remove code that assumes raft_topology_change_enabled() may return false
  test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
  test: schema_change_test: drop schema tests relevant for no raft mode only
  topology: remove upgrade to raft topology code
  group0: remove upgrade to group0 code
  group0: refuse to boot if a cluster is still is not in a raft topology mode
  storage_service: refuse to join a cluster in legacy mode
2026-02-27 14:43:41 +01:00
Anna Stuchlik
dfd46ad3fb doc: add the upgrade guide from 2025.x to 2026.1
This commit adds the upgrade guide for version 2026.1.

According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.

In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.

Fixes https://github.com/scylladb/scylladb/issues/28533

Fixes  https://github.com/scylladb/scylladb/issues/28532

Closes scylladb/scylladb#28789
2026-02-27 15:36:34 +02:00
Botond Dénes
fcc570c697 Merge 'Exorcise assertions from Alternator, using a new throwing_assert() macro' from Nadav Har'El
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.

In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().

Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.

There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.

Fixes #28308.

Closes scylladb/scylladb#28445

* github.com:scylladb/scylladb:
  alternator: replace SCYLLA_ASSERT with throwing_assert
  utils: introduce throwing_assert(), a safe replacement for assert
2026-02-27 15:35:36 +02:00
Roy Dahan
822c1597c9 install.sh: fix REST API paths for nonroot installations
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.

Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)

Fixes: SCYLLADB-721

Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.

Closes scylladb/scylladb#28805
2026-02-27 15:32:54 +02:00
Botond Dénes
9521a51e4c Merge 'generic_server: scale connection concurrency semaphore by listener count' from Marcin Maliszkiewicz
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.

Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.

It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem

Closes scylladb/scylladb#28747

* github.com:scylladb/scylladb:
  test: add test_uninitialized_conns_semaphore
  generic_server: fix waiters count in shed log
  generic_server: scale connection concurrency semaphore by listener count
2026-02-27 15:06:50 +02:00
Łukasz Paszkowski
bb57b0f3b7 compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
    co_await cstate.compaction_done.wait([..] {
        return num_runs_for_compaction() <= threshold
            || !can_perform_regular_compaction(t);
    });
```
to a while loop with a predicated wait:
```
    while (can_perform_regular_compaction(t)
           && co_await num_runs_for_compaction() > threshold) {
        co_await cstate.compaction_done.wait([this, &t] {
            return !can_perform_regular_compaction(t);
        });
    }
```

This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).

However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.

This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.

Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610

Closes scylladb/scylladb#28801
2026-02-26 20:13:50 +02:00
Marcin Maliszkiewicz
a03ebe1a29 Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.

The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.

In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:

```cql
CREATE TABLE tab (
    id int PRIMARY KEY,
    t text,
    expiration timestamp TTL
);
```

Or after the table already exists with:

```cql
ALTER TABLE tab TTL expiration
```

Expiration can also be disabled, with:

```cql
ALTER TABLE tab TTL NULL
```

The new per-row TTL feature has two features that users have been asking for:

1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.

To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.

The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.

The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`

The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.

This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.

Fixes #13000

Closes scylladb/scylladb#28320

* github.com:scylladb/scylladb:
  test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
  docs/cql: document the new CQL per-row TTL feature
  test/cluster: tests for the new CQL per-row TTL feature
  test/cqlpy: tests for the new CQL per-row TTL feature
  test: set low alternator_ttl_period_in_seconds in CQL tests
  cql ttl: fix ALTER TABLE to disable TTL if column is dropped
  cql ttl: add setting/unsetting of TTL column to ALTER TABLE
  cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
  ttl: add CQL support to Alternator's TTL expiration service
  alternator ttl: move TTL_TAG_KEY to a header file
  alternator ttl: remove unnecessary check of feature flag
  cql: add "cql_row_ttl" cluster feature
  alternator: fix error message if UpdateTimeToLive is not supported
2026-02-26 15:29:12 +01:00
Marcin Maliszkiewicz
4d0f1bf5c9 conf: improve rf_rack_valid_keyspaces documentation is scylla.yaml
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-761

Closes scylladb/scylladb#28738
2026-02-26 14:34:28 +01:00
Yaniv Michael Kaul
ead9961783 cql: vector: fix vector dimension type
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.

The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.

Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.

Add tests to verify the type boundaries.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>

Closes scylladb/scylladb#28762
2026-02-26 14:46:53 +02:00
Michael Litvak
4a60ee28a2 test/cqlpy/test_materialized_view.py: increase view build timeout
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.

Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.

Fixes SCYLLADB-769

Closes scylladb/scylladb#28817
2026-02-26 11:27:51 +02:00
Michał Hudobski
579ed6f19f secondary_index_manager: fix double registration bug
We have observed a bug that caused Scylla to crash due to metrics double
registration. This bug is really difficult to reproduce and was seen
only once in the wild. We think that it may be caused by a request
in-flight keeping a reference to the stats object, making it not
deregister when the index is dropped, which casues a double registration
when we recreate the index, however we are not 100% sure.

This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug.

Fixes: #27252

Closes scylladb/scylladb#28655
2026-02-26 09:39:53 +01:00
Marcin Maliszkiewicz
30f18a91fd Merge 'dtest: wait_for speedup' from Dario Mirovic
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.

This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.

Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.

This patch also improves performance of some alternator tests that use the same wait_for dtest function.

`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.

Refs SCYLLADB-573

This is a performance improvement of testing framework. No need to backport.

Closes scylladb/scylladb#28590

* github.com:scylladb/scylladb:
  dtest: shorten default sleep step in wait_for
  dtest: wait_for speedup
2026-02-26 09:33:38 +01:00
Yaron Kaikov
b211590bc0 .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804
2026-02-25 16:39:17 +02:00
Nadav Har'El
1d265e7d6d test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".

So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:45 +02:00
Nadav Har'El
34c0c64d9d docs/cql: document the new CQL per-row TTL feature
Add user-facing documentation for the new CQL per-row TTL feature,
in docs/cql/cql-extensions.md.

Also mention (and link) the new alternative TTL feature in a few
relevant documents about the old (per-write) TTL,  about CDC,
and about the CREATE TABLE and ALTER TABLE commands.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
23ad0be034 test/cluster: tests for the new CQL per-row TTL feature
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:

1. Test that the TTL expiration work - both the scanning threads and
   the actual deletion work on all nodes - happens on the "streaming"
   scheduling group.

2. Test that even if one of the cluster's nodes is down, still all
   the items get expired - another node "takes over" the dead node's
   work.

3. Test that rolling upgrade works as designed for the CQL per-row TTL
   feature: Before every single node in the cluster is upgraded to
   support this feature, a TTL column cannot be enabled on a table.
   And as soon as the last node of the cluster is upgraded, the TTL
   feature begins to work completely (you don't need to reboot all
   the nodes again).

4. Test that expiration works correctly on a multi-DC setup. The test
   doesn't check the efficiency of this process - i.e., that today each
   DC scans part of the data, reading with LOCAL_QUORUM, and writing
   the deletions across the entire cluster. Rather, the test only
   verifies the correctness - that expired rows do get deleted -
   for the usual case the data across the DCs is consistent.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
7a1351c6cf test/cqlpy: tests for the new CQL per-row TTL feature
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.

These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.

These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.

The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.

All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.

Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
154cecda71 test: set low alternator_ttl_period_in_seconds in CQL tests
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.

Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
eebd7b0fbc cql ttl: fix ALTER TABLE to disable TTL if column is dropped
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.

If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.

A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
acbdf637b6 cql ttl: add setting/unsetting of TTL column to ALTER TABLE
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:

     ALTER TABLE tab TTL colname

and also the ability to disable per-row TTL with

     ALTER TABLE tab TTL NULL

as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.

You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.

A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
22c79b6af8 cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
This patch enables the per-row TTL feature in CQL (Refs #13000).

This patch allows the user to create a new table with one of its columns
designated as the TTL column with a syntax like:

    CREATE TABLE tab (
        id int PRIMARY KEY,
        t text,
        expiration timestamp TTL
    );

The column marked "TTL" must have the "timestamp", "bigint" or "int"
types (the choice of these types was explained in the previous patch),
and there can only be one such column. We decided not to allow a column
to be both a primary key column and a TTL column - although it would
have worked (it's supported in Alternator), I considered this non-useful
and confusing, and decided not to allow it in CQL. A TTL column also
can't be a static column.

We save the information of which column is the TTL column in a tag which
is read by the "expiration service" - originally a part of Alternator's
TTL implementation. After the previous patch, the expiration service is
running and knows how to understand CQL tables, so the CQL per-row TTL
feature will start to work.

This patch also implements DESC TABLE, printing the word "TTL" in the
right place of the output.

This patch doesn't yet implement ALTER TABLE that should allow enabling
or disabling the TTL column setting on an existing table - we'll do that
in the next patch.

A large collection of functional tests (in test/cqlpy), for every detail
of this feature will be added in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
e636bc39ad ttl: add CQL support to Alternator's TTL expiration service
The Alternator TTL feature uses an "expiration service", a background
thread on each shard which periodically scans for expired items and
deletes them. When writing the expiration service, we already
anticipated that the day will come that we'll want to use it for CQL
too. Well, now that we want to use it for CQL, we only need to make
two changes:

1. Before this patch, the expiration service was only started if
   Alternator was enabled. Now we need to start it unconditionally,
   as both Alternator and CQL will need to use it.
   The performance impact of the new background threads, when not
   needed, should be minimal: These threads will wake up every
   alternator_ttl_period_in_seconds (by default - once a day) and
   just check if any table has per-row TTL enabled, and if not, do
   nothing.

2. Before this patch, the expiration-time column had to be of type
   "decimal" - a variable-precision floating-point type. This made
   sense in Alternator - where all numbers are of this type, but CQL
   offers better and more efficient types for this purpose. In this
   patch we add support for two additional types for the expiration
   time column: The "timestamp" type (which uses millisecond precision,
   which our implementation truncates to whole seconds) and for the
   "bigint" type storing a number of seconds since the UNIX epoch.
   We also support the smaller "int" type for compatibility with
   existing data, but it is not recommended because a signed
   32-bit integer counting time from 1970 will break in 2038.

After this patch, the expiration service supports CQL tables, but there
is nothing yet that can enable it on CQL tables - i.e., nothing that
sets the appropriate tag on the table to tell the expiration service
which column is the expiration-time column. We'll add new syntax to
do this in the next patch.

At the moment, we leave the expiration service implementation in
its existing location - alternator/ttl.cc. This is despite the fact
that we now start it and use it also for CQL. For better modularity,
we should probably later move the expiration service implementation
to a separate module (directory).

Similarly, the expiration service's period is still configured via
alternator_ttl_period_in_seconds, which is now a misnomer because it
also affects CQL. Later we can rename this configuration parameter,
or alternatively, consider different scan periods for different tables
and table types, and have separate configuration for Alternator TTL
and CQL per-row TTL.

The metrics kept by the expiration service are the same metrics existing
for Alternator TTL, and fortunately do not have the name "alternator" in
their name:

   * scylla_expiration_scan_passes
   * scylla_expiration_scan_table
   * scylla_expiration_items_deleted
   * scylla_expiration_secondary_ranges_scanned

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
2823780557 alternator ttl: move TTL_TAG_KEY to a header file
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.

We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
5e16c59312 alternator ttl: remove unnecessary check of feature flag
Every node that supports the Alternator TTL feature should start its
background expiration-checking thread, *without* checking if other
nodes support this feature. This patch removes the unnecessary check.

Indeed, until all other nodes enable this feature, the background thread
will have nothing to do. but when finally all nodes have this feature -
we need this thread to already be on - without requiring another reboot
of all nodes to start this thread.

In practice, this change won't change anything on modern installations
because this feature is already three years old and always enabled on
modern clusters. But I don't want to repeat the same mistake for the new
CQL per-row TTL feature, so better fix it in Alternator too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
4f4e93b695 cql: add "cql_row_ttl" cluster feature
This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL
per-row TTL feature.

With this patch, this node reports supporting this feature, but the CQL
per-row TTL feature can only be used once all the nodes in the cluster
supports the feature. In other words, user requests to enable per-row TTL
on a table should check this feature flag (on the whole cluster) before
proceeding.

This is needed because the implementation of the per-row-TTL expiration
requires the cooperation of all nodes to participate in scanning for
expired items, so the feature can't be trusted until all nodes participate
in it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
0d6b7a6211 alternator: fix error message if UpdateTimeToLive is not supported
Since commit 2dedb5ea75, the Alternator
TTL feature is no longer experimental. It is still a "cluster feature"
meaning it cannot be used on a partially-upgraded cluster until the entire
cluster supports this feature.

The error message we printed when the cluster doesn't support this
feature was outdated, referring to the no-longer-existing experimental
feature. So this patch fixes the error message.

Since this feature is already three years old, nobody is likely to ever
see this error message (it can be seen only by someone upgrading an
even older cluster, during the rolling upgrade), but better not have
wrong error messages in the code, even if it's not seen by users.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
b78bb914d7 alternator: replace SCYLLA_ASSERT with throwing_assert
Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and
safer throwing_assert() introduced in the previous patch.

As a result of this patch, if one of the call sites for these asserts
is buggy and ever fails, only the involved operation will be killed
by an exception, instead of crashing the whole server - and often the
entire cluster (as the same buggy request reaches all nodes and
crashes them all).

Additionally, this patch replaces a few existing uses in Alternator
of on_internal_error() with a non-interesting message with a
more-or-less equivalent, but shorter, throwing_assert(). The idea is
to convert the verbose idiom:

       if (!condition) {
          on_internal_error(logger, "some error message")
       }

With the shorter

       throwing_assert(condition)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Nadav Har'El
d876e7cd0a utils: introduce throwing_assert(), a safe replacement for assert
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").

throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc.  For example, you can do write just one line of throwing_assert():

    throwing_assert(p != nullptr);

Instead of much more verbose on_internal_error:

    if (p == nullptr) {
        utils::on_internal_error("assertion failed: p != nullptr")
    }

Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:

* assert() is compiled-out in release builds. This means that the
  condition is not checked, and the code after the failed condition
  continues to run normally, potentially to disasterous consequences.

  In contrast, throwing_assert() continues to check the condition even in
  release builds, and if the condition is false it throws an exception.
  This ensures that the code following the condition doesn't run.

* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
  Scylla if the condition is not met.

  In contrast, throwing_assert() doesn't crash, but throws an exception.
  This means that the specific operation that encountered the error
  is aborted, instead of the entire server. It often also means that
  the user of this operation will see this error somehow and know
  which operation failed - instead of encountering a mysterious
  server (or even whole-cluster crash) without any indication which
  operation caused it.

Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.

Fixes #28308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Botond Dénes
99244179f7 Merge 'CQL transport: Add histogram-based request/response size tracking' from Amnon Heiman
This series closes a gap in how CQL request and response sizes are reported.

Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.

After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.

The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns

To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).

The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**

Fixes #14850

Closes scylladb/scylladb#28419

* github.com:scylladb/scylladb:
  test/boost/estimated_histogram_test.cc: Switch to real Sum
  transport/server: to bytes_histogram
  approx_exponential_histogram: Add sum() method for accurate value tracking
  utils/estimated_histogram.hh: Add bytes_histogram
2026-02-25 13:05:18 +02:00
Yaron Kaikov
98494e08eb ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs
refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv

- Remove -v and -i flags from curl to prevent credentials from being
  logged in workflow output
- Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting
  to prevent shell injection via crafted PR metadata
- Add org membership verification step for pull_request_target events so
  that only PRs from scylladb org members can trigger Jenkins CI

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796

Closes scylladb/scylladb#28785
2026-02-25 12:42:18 +02:00
Avi Kivity
511fab1f28 gossiper: exit failure detector sleep faster
When running unit tests, there's a visible ~1-second sleep
when gossip exits the failure detector loop.

Improve this by adding a condition variable for exiting the loop
and signaling it when any of the exit conditions are satisfied:
the abort_source is pulled, the gossiper is shut down, or the sleep
is complete. We can't just use the abort_source because gossip can be
shut down independently of the rest of the system.

To see the improvement, I ran cql_query_test in dev mode:

Before:

$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2  > /dev/null 2>&1

real	2m26.904s
user	0m24.307s
sys	0m13.402s

After:

$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2  > /dev/null 2>&1

real	0m26.579s
user	0m24.671s
sys	0m13.636s

Two minutes of real-time saved.

Real-life improvement in test.py will be lower, because of the overhead
of launching pytest for each test case.

Closes scylladb/scylladb#28649
2026-02-25 11:41:02 +02:00
Avi Kivity
5baf16005f build: install antlr3 from maven + source, not rpm packages
Fedora removed the C++ backend from antlr3 [1], citing incompatible
license. The license in question (the Unicode license) is fine for us.

To be able to continue using antlr3, build it ourselves. The main
executable can be used as is from Maven, since we don't need any
patches for the parser. The runtime needs to be patched, so we download
the source and patch it.

Regenerated frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-773

Closes scylladb/scylladb#28765
2026-02-25 11:03:19 +02:00
Andrei Chekun
729bad77b1 test.py: add possibility to run downloaded Scylla binary
Add possibility to run Scylla binary that is stored or download the
relocatable package with Scylla.

Closes scylladb/scylladb#28787
2026-02-25 10:23:19 +02:00
Łukasz Paszkowski
9ade0b23da reader_concurrency_semaphore: set _ex in on_preemptive_abort()
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.

This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.

Closes scylladb/scylladb#28591
2026-02-25 10:20:06 +02:00
Grzegorz Burzyński
b4f0eb666f packaging: add systemctl command to dependencies
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.

Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.

Fixes: scylladb/scylla-operator#3080

Closes scylladb/scylladb#28567
2026-02-25 10:19:32 +02:00
Botond Dénes
56cc7bbeec Merge 'Allow "global" snapshot using topology coordinator + add tablet metadata to manifest' from Calle Wilund
Refs: SCYLLADB-193

Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.

Logic is similar, and thus broken out and semi-shared with, truncation.

Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.

As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.

The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.

TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.

Closes scylladb/scylladb#28525

* github.com:scylladb/scylladb:
  topology::snapshot: Add expiry (ttl) to RPC/topo op
  test_snapshot_with_tablets: Extend test to check manifest content
  table::manifest: Add tablet info to manifest.json
  test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
  scylla-nodetool: Add "cluster snapshot" command
  api::storage_service: Add tablets/snapshots command for cluster level snapshot
  db::snapshot-ctl: Add method to do snapshot using topo coordinator
  storage_proxy: Add snapshot_keyspace method
  topology_coordinator: Add handler for snapshot_tables
  storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
  messaging_service: Add SNAPSHOT_WITH_TABLETS verb
  feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
  topology_mutation: Add setter for snapshot part of row
  system_keyspace::topology_requests_entry: Add snapshot info to table
  topology_state_machine: Add snapshot_tables operation
  topology_coordinator: Break out logic from handle_truncate_table
  storage_proxy: Break out logic from request_truncate_with_tablets
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-25 10:17:53 +02:00
Botond Dénes
166e245097 Merge 'test.py: Topology test pytest integration' from Andrei Chekun
Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest.
With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm.

Backport is not needed because it framework enhancement.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46

Closes scylladb/scylladb#27618

* github.com:scylladb/scylladb:
  test.py: remove setsid from the framework
  test.py: rename suite.yaml to test_config.yaml
  test.py: add cluster tests to be executed by pytest
  test.py: add random seed for topology tests reproducibility
  test.py: add explicit default values to pytest options
  test.py: replace SCYLLA env var with build_mode fixture
2026-02-25 10:17:20 +02:00
Botond Dénes
9dff9752b4 Merge 'Fix regression in Alternator TTL with tablets and node going down' from Nadav Har'El
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.

Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.

It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.

Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.

So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):

1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).

Fixes SCYLLADB-777.

Closes scylladb/scylladb#28771

* github.com:scylladb/scylladb:
  test: add unit test for tablet_map::get_secondary_replica()
  test, alternator: add test for TTL expiration with a node down
  locator: fix get_secondary_replica() to match get_primary_replica()
2026-02-25 10:13:55 +02:00
Gleb Natapov
0f8cdd81f3 group0: fix indentation after previous patch 2026-02-25 10:08:32 +02:00
Gleb Natapov
7d7cbae763 raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
No need for locking any more so the function may just return a value and
be synchronous.
2026-02-25 10:08:32 +02:00
Gleb Natapov
0689fb5ab2 raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream 2026-02-25 10:08:32 +02:00
Gleb Natapov
cd76604c79 raft_group0: remove unused code from raft_group0
Also do not pass raft_replace_info into setup_group0 since it is not
used there for a long time now.
2026-02-25 10:08:32 +02:00
Gleb Natapov
6173ea476b node_ops: remove topology over node ops code
The code is no longer called.
2026-02-25 10:08:32 +02:00
Gleb Natapov
758d1c9c39 topology: fix indentation after the previous patch 2026-02-25 10:08:31 +02:00
Gleb Natapov
67cd5755b2 topology: drop topology_change_enabled parameter from raft_group0 code
Since the parameter is always true there is no point to pass it
everywhere. Just assume it is true at the point of use.
2026-02-25 10:08:31 +02:00
Gleb Natapov
50da142e77 storage_service: remove unused handle_state_* functions
The handler are no longer called.
2026-02-25 10:08:31 +02:00
Gleb Natapov
1a57f2b22d gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
The function is unused now and the option that allows to skip the wait
is no longer needed as well.
2026-02-25 10:08:31 +02:00
Gleb Natapov
aa0f103eb9 storage_service: fix indentation after the last patch 2026-02-25 10:08:31 +02:00
Gleb Natapov
be6cced978 storage_service: remove gossiper bootstrapping code
Remove code that is responsible for bootstrapping a node in gossiper
mode since the mode is no longer supported.
2026-02-25 10:08:31 +02:00
Gleb Natapov
8776d00c44 storage_service: drop get_group_server_if_raft_topolgy_enabled
Raft topology is always enabled now so the function does not make sense
any longer.
2026-02-25 10:08:30 +02:00
Gleb Natapov
5fa4f5b08f storage_service: drop is_topology_coordinator_enabled and its uses
The code can now assume that topology coordinator is enabled.
2026-02-25 10:08:30 +02:00
Gleb Natapov
1e8d4436c7 storage_service: drop run_with_api_lock_in_gossiper_mode_only
Since gossiper mode is no longer exists the function is no longer
needed.
2026-02-25 10:08:30 +02:00
Gleb Natapov
a8a167623a topology: remove code that assumes raft_topology_change_enabled() may return false
The path removes the code protected by !raft_topology_change_enabled()
since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft
since not raft mode is no longer supported.
2026-02-25 10:08:30 +02:00
Botond Dénes
8dbcd8a0b3 tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.

Add a test which fails before and passes after this patch.
2026-02-25 08:51:25 +02:00
Botond Dénes
cf39a5e610 tools/scylla-sstable: generalize dump_if_user_type
Rename to invoke_on_user_type() and make the action taken on user types
a function parameter. Enables reuse of the traverse logic by other code.
2026-02-25 08:51:25 +02:00
Botond Dénes
80049c88e9 tools/scylla-sstable: move dump_if_user_type() definition
So it can be used by create_table_in_cql_env() code.
2026-02-25 08:51:25 +02:00
Dario Mirovic
3222a1a559 dtest: shorten default sleep step in wait_for
Default sleep step of 1s is too long. Reduce it to make the test
environment more responsive and faster.

Refs SCYLLADB-573
2026-02-25 03:17:47 +01:00
Dario Mirovic
51e7c2f8d9 dtest: wait_for speedup
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.

This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.

Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.

This patch also improves performance of some alternator tests that
use the same wait_for dtest function.

Refs SCYLLADB-573
2026-02-25 03:17:46 +01:00
Andrei Chekun
1b92b140ee test.py: improve stdout output for boost test
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.

Closes scylladb/scylladb#28783
2026-02-25 00:50:25 +01:00
Ferenc Szili
f70ca9a406 load_stats: implement the optimized sum of tablet sizes
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.

Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.

In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.

Fixes: SCYLLADB-678

Closes scylladb/scylladb#28757
2026-02-24 22:19:25 +01:00
Marcin Maliszkiewicz
aa7816882e test: add test_uninitialized_conns_semaphore
Runtime in dev mode: 2s
2026-02-24 17:28:51 +01:00
Alex
5557770b59 test_mv_build_during_shutdown started two async CREATE MATERIALIZED VIEW operations and never awaited them (asyncio.gather(...) without await).
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.

Closes scylladb/scylladb#28774
2026-02-24 17:25:05 +01:00
Anna Stuchlik
64b1798513 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254
2026-02-24 14:37:39 +01:00
Anna Stuchlik
e2333a57ad doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781
2026-02-24 14:24:30 +02:00
Andrzej Jackowski
cd4caed3d3 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746
2026-02-24 12:08:44 +01:00
Botond Dénes
067bb5f888 test/scylla_gdb: skip coroutine tests if coroutine frame is not found
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.

It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.

Fixes: #22501

Closes scylladb/scylladb#28745
2026-02-24 10:12:03 +01:00
Marcin Maliszkiewicz
d5684b98c8 test: cluster: add continue-after-error to perf tool tests
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.

Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)

When CPU is starved on CI we can return timeouts and/or other errors.

The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759

Closes scylladb/scylladb#28767
2026-02-24 11:08:34 +02:00
Andrei Chekun
d9ce2db1a3 test.py: remove setsid from the framework
With previous architecture, scylla servers were handled by test.py and
if pytest fails, test.py was responsible for stopping scylla processes.
Now with only pytest handling, there is no such mechanism, that's why
I'm removing the setsid, so when the parent pytest process closes it
will automatically close all child including any started process during
testing. This will allow to not leave any scylla process in case pytest
was killed.
2026-02-24 09:48:38 +01:00
Andrei Chekun
d3f5f7468c test.py: rename suite.yaml to test_config.yaml
Switch of discovery of the tests by test.py
2026-02-24 09:48:38 +01:00
Andrei Chekun
e439ec9d67 test.py: add cluster tests to be executed by pytest
cluster tests are now executed by pytest also
Run pytest in an executor to avoid blocking the event loop, allowing
resource monitoring to run concurrently
Logic for passing the arguments to pytest changed due to the fact that
almost all directories now executed by pytest and directories that are
not handled excluded in pytest.ini
Modify the threads count for debug mode, because with the default
logic CI agents die
2026-02-24 09:48:38 +01:00
Andrei Chekun
edf7154fee test.py: add random seed for topology tests reproducibility
Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable
during pytest configuration to enable to ensure that all xdist workers will
discover the same scope of the tests. This is a known limitation of the
xdist plugi where the discovered tests should be consistenta across
master and workers.
2026-02-24 09:48:38 +01:00
Andrei Chekun
4a7d8cd99d test.py: add explicit default values to pytest options
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
2026-02-24 09:48:38 +01:00
Andrei Chekun
99234f0a83 test.py: replace SCYLLA env var with build_mode fixture
Replace direct usage of SCYLLA environment variable with the build_mode
pytest fixture and path_to helper function. This makes tests more
flexible and consistent with the test framework. Also this allows to use
tests with xdist, where environment variable can be left in the master
process and will not be set in the workers
Add using the fixture to get the scylla binary from the suite, this will
align with getting relocatable Scylla exe.
2026-02-24 09:48:38 +01:00
Avi Kivity
0add130b9d lua: avoid undefined behavior when converting doubles to integers
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.

The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.

Fix by tightening the conversion to not invoke undefined behavior.

Closes scylladb/scylladb#28503
2026-02-24 10:41:21 +02:00
Botond Dénes
1d5b8cc562 Merge 'Fix use after free in auth cache' from Marcin Maliszkiewicz
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet

Closes scylladb/scylladb#28766

* github.com:scylladb/scylladb:
  auth: cache: fix permissions iterator invalidation in reload_all_permissions
  auth/cache: acquire _loading_sem in cross-shard callbacks
2026-02-24 10:35:46 +02:00
Pavel Emelyanov
5a5eb67144 vector_search/dns: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28565
2026-02-23 21:28:43 +02:00
Pavel Emelyanov
6b02b50e3d Merge 'object_storage: add retryable machinery to object storage' from Ernest Zaslavsky
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate

No backport neede since it is only refactoring

Closes scylladb/scylladb#28161

* github.com:scylladb/scylladb:
  object_storage: add retryable machinery to object storage
  rest_client: add `simple_send` overload
2026-02-23 21:28:51 +03:00
Nadav Har'El
e463d528fe test: add unit test for tablet_map::get_secondary_replica()
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.

This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
0c7f499750 test, alternator: add test for TTL expiration with a node down
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:

 1. Even though Alternator TTL splits the work of scanning and expiring
    items between nodes, all the items get correctly expired.
 2. When one node is down, all the items still expire because the
    "secondary" owner of each token range takes over expiring the
   items in this range while the "primary" owner is down.

This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
9ab3d5b946 locator: fix get_secondary_replica() to match get_primary_replica()
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.

The next two patches will have two tests that fail before this patch,
and pass with it:

1. A unit test that checks that get_secondary_replica() returns a
   different node than get_primary_replica().

2. An Alternator TTL test that checks that when a node is down,
   expirations still happen because the secondary replica takes over
   the primary replica's work.

Fixes SCYLLADB-777

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:30 +02:00
Botond Dénes
dcd8de86ee Merge 'docs: update a documentation of adding/removing DC and rebuilding a node' from Aleksandra Martyniuk
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28521

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-23 16:15:16 +02:00
Andrei Chekun
6ae58c6fa6 test.py: move storage tests to cluster subdirectory
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster

Closes scylladb/scylladb#28634
2026-02-23 16:14:15 +02:00
Gleb Natapov
e23af998e1 test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
They were running in recovery to reuse existing system tables without
group0 id, but since we want to remove recovery mode we need to
re-generate the tables.
2026-02-23 14:54:24 +02:00
Gleb Natapov
f589740a39 test: schema_change_test: drop schema tests relevant for no raft mode only
They were running in no longer supported recovery mode to force gossip
topology.
2026-02-23 14:54:24 +02:00
Gleb Natapov
92049c3205 topology: remove upgrade to raft topology code
We do no longer need this code since we expect that cluster to be
upgraded before moving to this version.
2026-02-23 14:54:24 +02:00
Gleb Natapov
4a9cf687cc group0: remove upgrade to group0 code
This patch removes ability of a cluster to upgrade from not having
group0 to having one. This ability is used in gossiper based recovery
procedure that is deprecated and removed in this version. Also remove
tests that uses the procedure.
2026-02-23 14:54:24 +02:00
Gleb Natapov
dcafb5c083 group0: refuse to boot if a cluster is still is not in a raft topology mode
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch refuses to boot if a
cluster is not in raft topology mode yet. It may happen if a node of a
cluster that is not yet in a raft topology is upgraded to a newer
version. If this happens the node has to be downgraded. Raft topology
has to be enabled on a cluster and then the node can be upgraded again.
2026-02-23 14:54:24 +02:00
Gleb Natapov
ed52d1a292 storage_service: refuse to join a cluster in legacy mode
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch disallows a node to
join a cluster that is still in legacy mode. A cluster needs to enable
raft mode first.
2026-02-23 14:54:24 +02:00
Marcin Maliszkiewicz
c5dc086baf Merge 'vector_search: return NaN for similarity_cosine with all-zero vectors' from Dawid Pawlik
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

Closes scylladb/scylladb#28609

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-23 13:10:44 +01:00
Aleksandra Martyniuk
9ccc95808f docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e4c42acd8f docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
1c764cf6ea docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e08ac60161 docs: describe upgrade to enforce_rack_list option 2026-02-23 12:44:57 +01:00
Aleksandra Martyniuk
eefe66b2b2 docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
2026-02-23 12:41:33 +01:00
Marcin Maliszkiewicz
54dca90e8c Merge 'test: move dtest/guardrails_test.py to test_guardrails.py' from Andrzej Jackowski
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file

No backport, `dtest/guardrails_test.py` is only on master

Closes scylladb/scylladb#28737

* github.com:scylladb/scylladb:
  test: move dtest/guardrails_test.py to test_guardrails.py
  test: prepare guardrails_test.py to be moved to test/cluster/
2026-02-23 12:34:43 +01:00
Marcin Maliszkiewicz
1293b94039 auth: cache: fix permissions iterator invalidation in reload_all_permissions
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.

We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).

Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
2026-02-23 12:14:22 +01:00
Calle Wilund
fec7df7cbb topology::snapshot: Add expiry (ttl) to RPC/topo op
Not set yet, but includes it in messages so it can be properly
set in calling code. Will add entry to manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
cc60d014ed test_snapshot_with_tablets: Extend test to check manifest content
Verifies we have the expected tablet info in manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
f7aa2aacfc table::manifest: Add tablet info to manifest.json
If using tablets, will add info for each tablet present in
snapshot, with repair time and token ranges + map each
sstable to its owning tablet
2026-02-23 11:37:17 +01:00
Calle Wilund
ae10b5a897 test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot 2026-02-23 11:37:16 +01:00
Calle Wilund
bac81df20f scylla-nodetool: Add "cluster snapshot" command
Similar to "normal" snapshot, but will use the cluster-wide,
topolgy coordinated snapshot API and path.
2026-02-23 11:37:16 +01:00
Calle Wilund
b0604d9840 api::storage_service: Add tablets/snapshots command for cluster level snapshot
Calls the topology_coordinator path for snapshots.
2026-02-23 11:37:16 +01:00
Piotr Dulikowski
a4c389413c Merge 'Hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks' from Alex Dathskovsky
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.

Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.

Patch 2 adds explicit callback lifetime coordination in view_builder:

  - introduce a seastar::gate member
  - acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
  - keep the hold alive until each detached future resolves
  - close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown

Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.

Testing:
  - pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)

backport: not required started happening in master

fixes: SCYLLADB-687

Closes scylladb/scylladb#28648

* github.com:scylladb/scylladb:
  db/view: gate detached view-builder callbacks during shutdown
  db:view: refactor on_update_view to use coroutine dispatcher
2026-02-23 11:28:37 +01:00
Calle Wilund
9680541144 db::snapshot-ctl: Add method to do snapshot using topo coordinator
Separated from "local" snapshot.
2026-02-23 11:27:15 +01:00
Calle Wilund
425d6b4441 storage_proxy: Add snapshot_keyspace method
Takes set of ks->tables tuples and issues snapshot for each.
If feature is enabled, keyspace is non-local, and uses tablets,
will issue topo coordinator call across cluster.

Keyspaces not fitting the above will just go to "normal" (node
local) snapshot.
2026-02-23 11:27:15 +01:00
Calle Wilund
012a065364 topology_coordinator: Add handler for snapshot_tables
Similar to truncate, translates the topo op into RPC call
to relevant replicas and waits.
2026-02-23 11:27:15 +01:00
Calle Wilund
2bc633c3bd storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS 2026-02-23 10:44:42 +01:00
Calle Wilund
d1b06482f0 messaging_service: Add SNAPSHOT_WITH_TABLETS verb 2026-02-23 10:44:42 +01:00
Calle Wilund
3075311f21 feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
To detect if cluster can do coordinated snapshot
2026-02-23 10:44:41 +01:00
Calle Wilund
988c5238cf topology_mutation: Add setter for snapshot part of row 2026-02-23 10:44:41 +01:00
Calle Wilund
8bb81f00f8 system_keyspace::topology_requests_entry: Add snapshot info to table
Adds required info to communicate snapshot requests via topology
coordinator.
2026-02-23 10:44:38 +01:00
Calle Wilund
642aa44937 topology_state_machine: Add snapshot_tables operation
Note: placed after "noop" op, to not confuse ops in a mixed
(upgrading) cluster
2026-02-23 10:43:28 +01:00
Calle Wilund
e3d4493bf6 topology_coordinator: Break out logic from handle_truncate_table
Makes handle_truncate_table use shared logic, because we would
like to reuse some of it for other, coming ops. I.e. snapshot.
2026-02-23 10:43:28 +01:00
Calle Wilund
6e39c3bb83 storage_proxy: Break out logic from request_truncate_with_tablets
Makes request_truncate_with_tablets use a parameterized helper,
because eventually we will want to use almost identical logic
for other ops, like snapshot.
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
ad0c2de0d1 test/object_store: Remove create_ks_and_cf() helper
Now all test cases use standard facilities to create data they test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
6711afd73b test/object_store: Replace create_ks_and_cf() usage with standard methods
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
ed3a326637 test/object_store: Shift indentation right for test cases
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Marcin Maliszkiewicz
75d4bc26d3 auth/cache: acquire _loading_sem in cross-shard callbacks
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.

Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
2026-02-23 10:30:03 +01:00
Pavel Emelyanov
3d07633300 test/object_store: Use itertools.product() for deeply nested loops
The test_restore_with_streaming_scopes want to run some loop body for
all (almost) combinations of scope, primary-replica-only and min tablet
count. For that three nested loops are used. Using itertools.product()
makes the code shorter, less indented and more explicit.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:28:53 +03:00
Pavel Emelyanov
a9a82f89ac test/object_store: Replace dataset creation usage with standard methods
Two places are fixed

1. The call to create_dataset() is replaced with three "library"
   methods. This makes it explicit which options and schema are used
   for that. Eventually, the large and bulky create_dataset will be
   removed

2. The part that restores data into a fresh new table calls some CQLs by
   hand, and partially re-uses variables obtained from previous call to
   create_dataset(). Using the same "library" methods to re-create an
   empty table makes this part much simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:27:41 +03:00
Pavel Emelyanov
988606ac7f test/object_store: Shift indentation right for test_restore_with_streaming_scopes
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:27:09 +03:00
Pavel Emelyanov
5161aeee95 test/backup: Run keyspace flush and snapshot taking API in parallel
The take_snapshot() helper runs these API sequentially for every server.
Running them with asyncio.gather() slightly reduces the wait-time thus
improving the total runtime.

Before:
    CPU utilization: 2.1%
    real	0m33,871s
    user	0m22,500s
    sys	        0m13,207s

After:
    CPU utilization: 2.4%
    real	0m29,532s
    user	0m22,351s
    sys	        0m12,890s

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:36 +03:00
Pavel Emelyanov
21752a43fe test/backup: Re-use take_snapshot() helper in do_abort_restore()
The test in question does _exactly_ what this helper does, but in a
longer way. The only difference is that it uses server_id as key to dict
with sstable components, but it's easy to tune.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:35 +03:00
Pavel Emelyanov
818a99810c test/backup: Move take_snapshot() helper up
So that it's not in the middle of tests themselves, but near other
"helper" functions in the .py file

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:35 +03:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Marcin Maliszkiewicz
aba5a8c37f generic_server: fix waiters count in shed log
Capture semaphore waiters count when blocking starts,
not after the wait completes.
2026-02-20 17:04:23 +01:00
Marcin Maliszkiewicz
23bed55170 generic_server: scale connection concurrency semaphore by listener count
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.

Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
2026-02-20 16:59:19 +01:00
Patryk Jędrzejczak
e8efcae991 Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3

Cleaning tests, not backporting

Closes scylladb/scylladb#28600

* https://github.com/scylladb/scylladb:
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-20 15:53:38 +01:00
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Avi Kivity
92bc5568c5 tools: toolchain: build sanitizers for future toolchain
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.

Closes scylladb/scylladb#28733
2026-02-20 15:44:24 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Botond Dénes
7e90ed657c Merge 'Fix client_options docs' from Karol Baryła
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.

Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.

This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.

Backport: none, because this fixes internal documentation.

Closes scylladb/scylladb#28126

* github.com:scylladb/scylladb:
  protocol-extensions.md: Fix client_options docs
  system_keyspace.md: Add client_options column
  system_keyspace.md: Fix order in system.clients
2026-02-20 14:23:34 +02:00
Pavel Emelyanov
525cb5b3eb table: Use fmt::to_string() to stringify compation group ID
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28630
2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Andrzej Jackowski
eb5a564df2 test: move dtest/guardrails_test.py to test_guardrails.py
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
2026-02-20 11:39:52 +01:00
Andrzej Jackowski
9df426d2ae test: prepare guardrails_test.py to be moved to test/cluster/
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file
2026-02-20 11:39:43 +01:00
Raphael S. Carvalho
f33f324f77 mutation_compactor: Fix tombstone GC metrics to account for only expired
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable

When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.

What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28669
2026-02-20 10:43:58 +02:00
Botond Dénes
0bf4c68af5 Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910

No backport, just a README link fixes.

Closes scylladb/scylladb#28699

* github.com:scylladb/scylladb:
  docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
  docs: fix link to docker build README.MD
2026-02-20 08:21:46 +02:00
Botond Dénes
51a25c8af3 test/boost/batchlog_manager_test: add tests for v1 batchlog
The v1 table is used while upgrading from a pre-v2 version. We need
tests to ensure it still works.
2026-02-20 07:03:46 +02:00
Botond Dénes
83344dacbd test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
2026-02-20 07:03:46 +02:00
Botond Dénes
2956714e19 test/boost/batchlog_manager_test: fix indentation 2026-02-20 07:03:46 +02:00
Botond Dénes
23732227fe test/boost/batchlog_manager_test: extract prepare_batches() method
To be shared between multiple tests in future commits.
Indentation is left broken.
2026-02-20 07:03:46 +02:00
Botond Dénes
af26956bb4 test/lib/cql_assertions: is_rows(): add dump parameter
When set to true, the query results will be logged by the testlog logger
with  debug level. A huge help when debugging failures around cql
assertions: seeing the actual query result is often enough to
immediately understand why the test failed.
2026-02-20 07:03:46 +02:00
Botond Dénes
48e9b3d668 tools/scylla-sstable: extract query result printers
To cql3/query_result_printer.hh. Allowing for other users, outside of
tools.
2026-02-20 07:03:46 +02:00
Botond Dénes
978627c4e1 tools/scylla-sstable: add std::ostream& arg to query result printers
Make them more general-purpose, in preparation to extracting them to
their own header.
2026-02-20 07:03:46 +02:00
Botond Dénes
0549b61d55 repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
Provides visibility into whether batchlog replay was successful or not.
2026-02-20 07:03:46 +02:00
Botond Dénes
dd50bd9bd4 db/batchlog_manager: re-add v1 support
system.batchlog will still have to be used while the cluster is
upgrading from an older version, which doesn't know v2 yet.
Re-add support for replaying v1 batchlogs. The switch to v2 will happen
after the BATCHLOG_V2 cluster feature is enabled.

The only external user -- storage_proxy -- only needs a minor
adjustment: switch between the table names. The rest is handled
transparently by the db/batchlog.hh interface and the batchlog_manager.
2026-02-20 07:03:46 +02:00
Botond Dénes
8ffa3d32c0 db/batchlog_manager: return all_replayed from process_batch()
process_batch() currently returns stop_iteration::no from all control
paths. This is not useful. Return the all_replayed output param instead.
This requires making the batch() lambda a coroutine, but considering the
amount of work process_batch() does (send multiple writes), this should
be inconsequential.
2026-02-20 07:03:46 +02:00
Botond Dénes
091b43f54b db/batchlog_manager: process_bath() fix indentation 2026-02-20 07:03:46 +02:00
Botond Dénes
ef2b8b4819 db/batchlog_manager: make batch() a standalone function
Currently it is a huge lambda. Deserves to be a standalone function, to
make the replay_all_failed_batches() easier to read and modify.
2026-02-20 07:03:46 +02:00
Botond Dénes
ca2bbbad97 db/batchlog_manager: make structs stats public
Need to rename stats() -> get_stats() because it shadows the now
exported type name.
2026-02-20 07:03:46 +02:00
Botond Dénes
f8bfaedb6e db/batchlog_manager: allocate limiter on the stack
Now that replay_all_failed_batches() is a coroutine, there is no need to
make it a shared pointer anymore.
2026-02-20 07:03:46 +02:00
Botond Dénes
ac059dadc6 db/batchlog_manager: add feature_service dependency
Will be needed to check for batchlog_v2 feature.
2026-02-20 07:03:46 +02:00
Botond Dénes
c901ab53d2 gms/feature_service: add batchlog_v2 feature 2026-02-20 07:03:45 +02:00
Avi Kivity
66bef0ed36 lua, tools: adjust for lua 5.5 lua_newstate seed parameter
Lua 5.5 adds a seed parameter to lua_newstate(), provide it
with a strong random seed.

Closes scylladb/scylladb#28734
2026-02-20 06:52:37 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Dawid Mędrek
c9d192c684 Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

Closes scylladb/scylladb#28558

* github.com:scylladb/scylladb:
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-02-19 16:50:03 +01:00
Marcin Maliszkiewicz
22c3d8d609 Merge 'db/config: enable table audit by default' from Piotr Smaron
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222

No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).

Closes scylladb/scylladb#28376

* github.com:scylladb/scylladb:
  docs: decommission: note audit ks may require ALTERing
  docs: mention table audit enabled by default
  audit: disable DDL by default
  db/config: enable table audit by default
  test/cluster: fix `test_table_desc_read_barrier` assertion
  test/cluster: adjust audit in tests involving decommissioning its ks
  audit_test: fix incorrect config in `test_audit_type_none`
2026-02-19 16:30:11 +01:00
Pavel Emelyanov
b4b9b547ce replica: Remove unused sched groups from keyspace and table configs
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28540
2026-02-19 15:47:31 +01:00
Patryk Jędrzejczak
45115415fb Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)

It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.

Improving tests, not backporting

Closes scylladb/scylladb#28569

* https://github.com/scylladb/scylladb:
  test: Merge test_restore_primary_replica_different_... tests
  test: Merge test_restore_primary_replica_same_... tests
  test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
  test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
2026-02-19 15:42:55 +01:00
Pavel Emelyanov
26372e65df Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky
Fix the build of the test and the upload operation flow

No need to backport since it is only a test we barely use

Closes scylladb/scylladb#28595

* github.com:scylladb/scylladb:
  s3_perf: fix upload operation flow
  s3_perf: fix the CMake build
2026-02-19 15:31:43 +02:00
Avi Kivity
7ec710c250 Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317

Closes scylladb/scylladb#28563

* github.com:scylladb/scylladb:
  tablets, config: Reduce migration concurrency to 2
  tablets: load_balancer: Always accept migration if the load is 0
  config, tablets: Make tablet migration concurrency configurable
2026-02-19 15:31:43 +02:00
Dawid Mędrek
fae71f79c2 test: raft: Add test_aborting_wait_for_state_change 2026-02-19 14:21:01 +01:00
Dawid Mędrek
e4f2b62019 raft: Describe exception types for wait_for_state_change and wait_for_leader
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.

Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
2026-02-19 12:47:14 +01:00
Dawid Mędrek
c36623baad raft: Await instead of returning future in wait_for_state_change
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.

Fixes SCYLLADB-665
2026-02-19 12:47:14 +01:00
Marcin Maliszkiewicz
de4e5e10af test: perf: fix prepared statements logic in perf-simple-query
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.

We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).

Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1

Fixes https://github.com/scylladb/scylladb/issues/27941

Closes scylladb/scylladb#28704
2026-02-19 12:42:07 +02:00
Avi Kivity
58a662b9db dist: refresh container base image (ubi9-minimal)
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.

Fix by using --pull=always to get a fresh image.

Ref https://scylladb.atlassian.net/browse/SCYLLADB-714

Closes scylladb/scylladb#28680
2026-02-19 12:42:43 +03:00
Ferenc Szili
f1bc17bd4c load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703
2026-02-19 12:29:25 +03:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Ernest Zaslavsky
45d824e0fe s3_perf: fix upload operation flow
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
2026-02-19 11:14:59 +02:00
Botond Dénes
b637e17b19 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080
2026-02-19 09:51:09 +01:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Marcin Maliszkiewicz
3417d50add test: cluster: add tests for perf tools
It checks if all workloads can be properly
executed with succesfull startup and teardown.

Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.

Test runtime:
Release - 24s
Debug - 1m 15s

Test time consists mostly of Scylla startup in various modes.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
c69534504c test: perf: fix port race condition on startup in connect workload
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.

This commit adds a single place where we will wait for port
for all workloads.

Timeout is set to 5 minutes so that even slowest machines
are able to start.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
828f2fbdb1 test: perf: prepare benchmarks to bind to custom host
This is usefull for tests where we use local networks
like 127.5.5.5 to avoid port and host collisions.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
9f2b97bef4 test: perf: make perf-alterantor remote port configurable
It could be a usefull option to have.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
f5a212e91e test: perf: fix ASAN leak warnings in perf-alternator
Those were intentional as test process is short lived
but when we add automated tests in the following commits
we expect clean exit, with 0 exit code.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
0c76c73e34 Reapply "main: test: add future and abort_source to after_init_func"
This reverts commit ceec703bb7.

The commit was fixed with abort source handling for alternator
standalone path so it's safe to reapply.
2026-02-19 09:33:10 +01:00
Piotr Dulikowski
7d6f734a51 dictionary compression: add missing co_awaits on get_units
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.

Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.

Closes scylladb/scylladb#28581
2026-02-18 16:40:40 +01:00
Ernest Zaslavsky
4026b54a5e s3_perf: fix the CMake build
Fix the CMake build of the perf_s3_client by adding the
necessary linkage with the jsoncpp library.
2026-02-18 17:12:08 +02:00
Piotr Smaron
797c5cd401 docs: decommission: note audit ks may require ALTERing
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
2026-02-18 15:14:57 +01:00
Piotr Smaron
65eec6d8e7 docs: mention table audit enabled by default
Also align the documentation with the current audit settings.
2026-02-18 15:14:57 +01:00
Piotr Smaron
c30607d80b audit: disable DDL by default
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
2026-02-18 15:14:57 +01:00
Piotr Smaron
08dc1008ba db/config: enable table audit by default
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
2026-02-18 15:14:57 +01:00
Piotr Smaron
95ee4a562c test/cluster: fix test_table_desc_read_barrier assertion
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
2026-02-18 15:14:57 +01:00
Piotr Smaron
2e12b83366 test/cluster: adjust audit in tests involving decommissioning its ks
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.

BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
2026-02-18 15:14:55 +01:00
Piotr Smaron
0cf20fa15a audit_test: fix incorrect config in test_audit_type_none
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
2026-02-18 15:12:26 +01:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Andrzej Jackowski
4221d9bbfd docs: improve examples in Handling Audit Failures section
This commit introduces four changes:
 - In the `table` example, singular forms (node, partition) are changed to
   plural forms (nodes, partitions). Currently, the default `table`
   audit configuration is RF=3 and writes use CL=ONE. Therefore,
   a `table` audit log write failure should not be caused by a single
   node unavailability, and plural forms are more adequate.
 - In the `table` example, unreachability due to network issues is
   mentioned because with RF=3, audit failure due to network problems
   is more likely to happen than a simultaneous failure of three
   nodes (such network failures happened in SCYLLADB-706).
 - In the `syslog` example, a slash `/` is changed to `or`, so `table`
   and `syslog` examples have similar structure.
 - As the `syslog` line is already being changed, I also change `unix`
   to `Unix`, as the capitalized form is the correct one.

Refs SCYLLADB-706

Closes scylladb/scylladb#28702
2026-02-18 13:10:01 +01:00
Botond Dénes
3bfd47da4b Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

Closes scylladb/scylladb#28530

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-18 13:48:49 +02:00
Marcin Szopa
9217f85e99 docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory 2026-02-18 12:19:27 +01:00
Marcin Szopa
e66bf4a6f5 docs: fix link to docker build README.MD
Link was pointing to the old place of the README. It was moved in the 1abf981a73
2026-02-18 12:12:46 +01:00
Piotr Dulikowski
b9db3c9c75 Merge 'Add consistent permissions cache' from Marcin Maliszkiewicz
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.

Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.

Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.

Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147

Closes scylladb/scylladb#28078

* github.com:scylladb/scylladb:
  test: boost: add auth cache tests
  auth: add cache size metrics
  docs: conf: update permissions cache documentation
  auth: remove old permissions cache
  auth: use unified cache for permissions
  auth: ldap: add permissions reload to unified cache
  auth: add permissions cache to auth/cache
  auth: add service::revoke_all as main entry point
  auth: explicitly life-extend resource in auth_migration_listener
2026-02-18 12:03:20 +01:00
Tomasz Grabiec
af0b5d0894 Merge 'tablets global barrier: acknowledge barrier_and_drain from all nodes' from Petr Gusev
Before this series, the `global_barrier` used during tablet migration did not guarantee that `barrier_and_drain` was acknowledged by tablet replicas. As a result, if a request coordinator was fenced out, stale requests from previous topology versions could still execute on replicas in parallel with new requests from incompatible topology versions. For example, stale requests from `tablet_transition_stage::streaming` could run concurrently with new requests from `tablet_transition_stage::use_new`. This caused several issues, including [#26864](https://github.com/scylladb/scylladb/issues/26864) and [#26375](https://github.com/scylladb/scylladb/issues/26375).

This PR fixes the problem in two steps:
* Replicas now hold an erm strong pointer while handling RPCs from coordinators.
* The tablet barrier is updated to require `barrier_and_drain` acknowledgments from all nodes.

A description of alternative solutions and various tradeoffs can be found in [this document](https://docs.google.com/document/d/1tpDtPOsrGaZGBYkdwOKApQv4eMzrBydMM1GaYYmaPgg/edit?pli=1&tab=t.0#heading=h.vidfy0hrz5j7).

[A previous attempt on this changes](https://github.com/scylladb/scylladb/pull/27185).

Fixes [scylladb/scylladb#26864](https://github.com/scylladb/scylladb/issues/26864)
Fixes [scylladb/scylladb#26375](https://github.com/scylladb/scylladb/issues/26375)

backport: needs backport to 2025.4 (fixes #26864 for tablets LWT)

Closes scylladb/scylladb#27492

* github.com:scylladb/scylladb:
  tests: extract get_topology_version helper
  global tablets barrier: require all nodes to ack barrier_and_drain
  topology_coordinator: pass raft_topology_cmd by value
  storage_proxy: hold erms in replica handlers
  token_metadata: improve stale versions diagnostics
2026-02-18 11:45:56 +01:00
Pavel Emelyanov
0c443d5764 gms: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28566
2026-02-18 12:24:35 +02:00
Pavel Emelyanov
5b740afe9a database: Remove streaming sched group getter
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28527
2026-02-18 12:23:35 +02:00
Avi Kivity
c5a1f44731 tools: toolchain: switch from ccache to sccache
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.

The documentation is adjusted since cache will be
persistent for sccache without further work.

Closes scylladb/scylladb#28524
2026-02-18 12:23:12 +02:00
Botond Dénes
36167a155e Merge 'Remove map_to_key_value() helpers from API' from Pavel Emelyanov
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.

Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().

Code cleanup, not backporting

Closes scylladb/scylladb#28491

* github.com:scylladb/scylladb:
  api: Remove map_to_key_value() helpers
  api: Streamify view_build_statuses handler
  api: Streamify few more storage_service/ handlers
  api: Add map_to_json() helper
  api: Coroutinize view_build_statuses handler
2026-02-18 12:22:00 +02:00
Ernest Zaslavsky
196f7cad93 nodetool: fix handling of "--primary-replica-only" argument
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.

Closes scylladb/scylladb#28490
2026-02-18 12:21:27 +02:00
Pavel Emelyanov
bce43c6b20 api: Remove unused (lost) local variable
Lost when the get_range_to_endpoint_map hander was implemented for real (48c3c94aa6)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28489
2026-02-18 12:20:30 +02:00
Alex
c44ad31d44 db/view: gate detached view-builder callbacks during shutdown
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
  can race with view_builder::drain() teardown.

  Add a lifetime gate to view_builder and wire callback launches through
  _ops_gate.hold() so each detached dispatch future is tracked until it
  completes (finally keeps the hold alive). During shutdown, drain()
  now waits for all tracked callback work with _ops_gate.close().

  This ensures drain does not proceed past callback lifetime while shutdown is in
  progress, and ignores only gate_closed_exception at callback entry as the
  expected shutdown path.
2026-02-18 11:56:41 +02:00
Pavel Emelyanov
b01adf643c Merge 'init: fix infinite loop on npos wrap with updated Seastar' from Emil Maskovsky
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.

The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.

Refs: https://github.com/scylladb/seastar/pull/3236

No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).

Closes scylladb/scylladb#28573

* github.com:scylladb/scylladb:
  init: fix infinite loop on npos wrap with updated Seastar
  init: remove unnecessary object creation in emplace calls
2026-02-18 11:46:26 +03:00
Aleksandra Martyniuk
100ccd61f8 tasks: increase tasks_vt_get_children timeout
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.

Extend the message timeout of tasks_vt_get_children from 10s to 60s.

Fixes: #28295.

Closes scylladb/scylladb#28599
2026-02-18 11:39:19 +03:00
Dani Tweig
aac0f57836 .github/workflows: add SMI to milestone sync Jira project keys
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys

Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.

Fixes:PM-190

Closes scylladb/scylladb#28585
2026-02-18 09:35:37 +02:00
Nadav Har'El
a1475dbeb9 test/cqlpy: make test testMapWithLargePartition faster
Right now the slowest test in the test/cqlpy directory is

   cassandra_tests/validation/entities/collections_test.py::
      testMapWithLargePartition

This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.

200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!

Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28271
2026-02-18 10:12:16 +03:00
Raphael S. Carvalho
5b550e94a6 streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28505
2026-02-18 10:10:40 +03:00
Avi Kivity
f3cbd76d93 build: install cassandra-stress RPM with no signature check
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.

Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.

We install its dependencies via the normal methods so they're still
checked.

[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_default

Closes scylladb/scylladb#28687
2026-02-18 10:08:13 +03:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Tomasz Grabiec
d33d38139f test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715
2026-02-18 01:02:50 +01:00
Tomasz Grabiec
2454de4f8f test: cluster: task_manager_client: Introduce wait_task_appears() 2026-02-18 01:02:44 +01:00
Tomasz Grabiec
e14eca46af tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.
2026-02-18 01:02:19 +01:00
Szymon Malewski
668d6fe019 vector: Improve similarity functions performance
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471

Closes scylladb/scylladb#28615
2026-02-18 00:33:34 +02:00
Calle Wilund
ab4e4a8ac7 commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679
2026-02-17 23:46:47 +02:00
Emil Maskovsky
6b98f44485 init: fix infinite loop on npos wrap with updated Seastar
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.

Refs: scylladb/seastar#3236
2026-02-17 17:57:13 +00:00
Emil Maskovsky
bda0fc9d93 init: remove unnecessary object creation in emplace calls
Simplifies code by directly passing constructor arguments to emplace,
avoiding redundant temporary gms::inet_address() object creation.
Improves clarity and potentially performance in affected areas.
2026-02-17 17:57:12 +00:00
Marcin Maliszkiewicz
741969cf4c test: boost: add auth cache tests
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
c11eb73a59 auth: add cache size metrics 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a059798de9 docs: conf: update permissions cache documentation 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a23e503e7b auth: remove old permissions cache 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
9d9184e5b7 auth: use unified cache for permissions 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
7eedf50c12 auth: ldap: add permissions reload to unified cache
The LDAP server may change role-chain assignments without notifying
Scylla. As a result, effective permissions can change, so some form of
polling is required.

Currently, this is handled via cache expiration. However, the unified
cache is designed to be consistent and does not support expiration.
To provide an equivalent mechanism for LDAP, we will periodically
reload the permissions portion of the new cache at intervals matching
the previously configured expiration time.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
10996bd0fb auth: add permissions cache to auth/cache
We want to get rid of loading cache because its periodic
refresh logic generates a lot of internal load when there
is many entries. Also our operation procedures involve tweaking
the config while new unified cache is supposed to work out
of the box.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
03c4e4bb10 auth: add service::revoke_all as main entry point
In the following commit we'll need to add some
cache related logic (removing resource permissions).
This logic doesn't depend on authorizer so it should
be managed by the service itself.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
070d0bfc4c auth: explicitly life-extend resource in auth_migration_listener
Otherwise it's easy to trigger use-after-free when code slightly changes.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
3b98451776 test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s
2026-02-17 17:55:48 +01:00
Marcin Maliszkiewicz
0376d16ad3 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
2026-02-17 17:55:48 +01:00
Dani Tweig
5dc06647e9 .github: add workflow to auto-close issues from ScyllaDB associates
Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates

We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira.

This is a new github action, and does not require any code backport.

Fixes: PM-64

Closes scylladb/scylladb#28212
2026-02-17 17:18:32 +02:00
Dani Tweig
bb8a2c3a26 .github/workflow/:Add milestone sync to Jira based on GitHub Action
What changed
Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml

Why (Requirements Summary)
Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR
When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue
When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue
Testing was performed in staging.git and the STAG Jira project.

Fixes:PM-177

Closes scylladb/scylladb#28575
2026-02-17 16:41:03 +02:00
Botond Dénes
2e087882fa Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

Closes scylladb/scylladb#28400

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-17 16:40:02 +02:00
Andrei Chekun
1b5789cd63 test.py: refactor manager fixture
The current manager flow have a flaw. It will trigger pytest.fail when
it found errors on teardown regardless if the test was already failed.
This will create an additional record in JUnit report with the same name
and Jenkins will not be able to show the logs correctly. So to avoid
this, this PR changes logic slightly.
Now manager will check that test failed or not to avoid two fails for
the same test in the report.
If test passed, manager will check the cluster status and fail if
something wrong with a status of it. There is no need to check the
cluster status in case of test fail.
If test passed, and cluster status if OK, but there are unexpected
errors in the logs, test will fail as well. But this check will gather
all information about the errors and potential stacktraces and will only
fail the test if it's not yet failed to avoid double entry in report.

Closes scylladb/scylladb#28633
2026-02-17 14:35:18 +01:00
Dawid Mędrek
5b5222d72f Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.

No backport, test update.

Closes scylladb/scylladb#28571

* github.com:scylladb/scylladb:
  test: run test_different_group0_ids in all modes
  test: make test_different_group0_ids work with the Raft-based topology
2026-02-17 13:56:41 +01:00
Dawid Mędrek
1b80f6982b Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili
Currently, the load balancing simulator computes node, shard and tablet load based on tablet count.

This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count.

This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series:

- First part for tablet size collection via load_stats: scylladb/scylladb#26035
- Second part reconcile load_stats: scylladb/scylladb#26152
- The third part for load_sketch changes: scylladb/scylladb#26153
- The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254
- The fifth part changes the load balancing simulator: scylladb/scylladb#26438

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26438

* github.com:scylladb/scylladb:
  test, simulator: compute load based on tablet size instead of count
  test, simulator: generate tablet sizes and update load_stats
  test, simulator: postpone creation of load_stats_ptr
2026-02-17 13:29:37 +01:00
Avi Kivity
ffde2414e8 cql3: grammar: remove special case for vector similarity functions in selectors
In b03d520aff ("cql3: introduce similarity functions syntax") we
added vector similarity functions to the grammar. The grammar had to
be modified because we wanted to support literals as vector similarity
function arguments, and the general function syntax in selectors
did not allow that.

In cc03f5c89d ("cql3: support literals and bind variables in
selectors") we extended the selector function call grammar to allow
literals as function arguments.

Here, we remove the special case for vector similarity functions as
the general case in function calls covers all the possibilities the
special case does.

As a side effect, the vector similarity function names are no longer
reserved.

Note: the grammar change fixes an inconsistency with how the vector
similarity functions were evaluated: typically, when a USE statement
is in effect, an unqualified function is first matched against functions
in the keyspace, and only if there is no match is the system keyspace
checked. But with the previous implementation vector similarity functions
ignored the USE keyspace and always matched only the system keyspace.

This small inconsistency doesn't matter in practice because user defined
functions are still experimental, and no one would name a UDF to conflict
with a system function, but it is still good to fix it.

Closes scylladb/scylladb#28481
2026-02-17 12:40:21 +01:00
Ernest Zaslavsky
30699ed84b api: report restore params
report restore params once the API's call for restore is invoked

Closes scylladb/scylladb#28431
2026-02-17 14:27:21 +03:00
Andrei Chekun
767789304e test.py: improve C++ fail summary in pytest
Currently, if the test fail, pytest will output only some basic information
about the fail. With this change, it will output the last 300 lines of the
boost/seastar test output.
Also add capturing the output of the failed tests to JUnit report, so it
will be present in the report on Jenkins.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449

Closes scylladb/scylladb#28535
2026-02-17 14:25:28 +03:00
Pavel Emelyanov
6d4af84846 Merge 'test: increase open file limit for sstable tests' from Avi Kivity
In ebda2fd4db ("test: cql_test_env: increase file descriptor limit"),
we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env
as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests
open 256 sstables, and with 2 files/sstable + resharding work it is understandable
that they overflow the 1024 limit.

No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches.

Closes scylladb/scylladb#28646

* github.com:scylladb/scylladb:
  test: sstables::test_env: adjust file open limit
  test: extract cql_test_env's adjust_rlimit() for reuse
2026-02-17 14:19:43 +03:00
Avi Kivity
41925083dc test: minio: tune sync setting
Disable O_DSYNC in minio to avoid unnecessary slowdown in S3
tests.

Closes scylladb/scylladb#28579
2026-02-17 14:19:27 +03:00
Avi Kivity
f03491b589 Update seastar submodule
* seastar f55dc7eb...d2953d2a (13):
  > io_tester: Revive IO bandwidth configuration
  > Merge 'io_tester: add vectorized I/O support' from Travis Downs
    doc: add vectorized I/O options to io-tester.md
    io_tester: add vectorized I/O support
  > Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov
    reactor: Drop sched group IDs bitmap
    reactor: Allocate scheduling group on shard-0 first
    reactor: Detach init_scheduling_group_specific_data()
    reactor: Coroutinize create_scheduling_group()
  > set_iterator: increase compatibility with C++ ranges
  > test: fix race condition in test_connection_statistics
  > Add Claude Code project instructions
  > reactor: Unfriend pollable_fd via pollable_fd_state::make()
  > Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń
    apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job.
    apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server.
  > treewide: remove remnants of SEASTAR_MODULE
  > test: Tune abort-accept test to use more readable async()
  > build: support sccache as a compiler cache (#3205)
  > posix-stack: Reuse parent class _reuseport from child
  > Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg
    tests: Add unit test for epoll busy spin bug
    reactor_backend: Fix another busy spin bug in epoll

Closes scylladb/scylladb#28513
2026-02-17 13:13:22 +02:00
Jakub Smolar
189b056605 scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect
Previous implementation of Scylla lifecycle brought flakiness to the test.
This change leaves lifecycle management up to PythonTest.run_ctx,
which implements more stability logic for setup/teardown.

Replace pexpect-driven GDB interaction with GDB batch mode:
- Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty()
may lead to deadlocks in the child.", which ultimately caused CI deadlocks.
- Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts.
- Produces cleaner, more direct assertions around command execution and output.
- Trade-off: batch mode adds ~10s per command per test,
but with --dist=worksteal this is ~10% overall runtime increase across the suite.

Closes scylladb/scylladb#28484
2026-02-17 11:36:20 +01:00
Łukasz Paszkowski
f45465b9f6 test_out_of_space_prevention.py: Lower the critical disk utilization threshold
After PR https://github.com/scylladb/scylladb/pull/28396 reduced
the test volumes to 20MiB to speed up test_out_of_space_prevention.py,
keeping the original 0.8 critical disk utilization threshold can make
the tests flaky: transient disk usage (e.g. commitlog segment churn)
can push the node into ENOSPC during the run.

These tests do not write much data, so reduce the critical disk
utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB
of headroom for temporary growth during the test.

Fixes: https://github.com/scylladb/scylladb/issues/28463

Closes scylladb/scylladb#28593
2026-02-16 15:10:18 +02:00
Andrei Chekun
e26cf0b2d6 test/cluster: fix two flaky tests
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.

Closes scylladb/scylladb#28635
2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak
0693091aff test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632
2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz
6a4aef28ae Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

Closes scylladb/scylladb#28625

* github.com:scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-16 11:38:24 +01:00
Ernest Zaslavsky
034c6fbd87 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554
2026-02-16 13:32:58 +03:00
Botond Dénes
9f57d6285b Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz
Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail,
increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures.

Fixes https://github.com/scylladb/scylladb/issues/27745
Backport: no, it's not a bug fix

Closes scylladb/scylladb#28628

* github.com:scylladb/scylladb:
  test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
  test: pylib: increase curl's number of retries when downloading scylla
  test: pylib: improve error reporting in get_scylla_2025_1_executable
2026-02-16 10:09:17 +02:00
Petr Gusev
c785d242a7 tests: extract get_topology_version helper
This is a refactoring commit. We need to load the cluster version
for a host in several places, so extract a helper for this.
2026-02-16 08:57:42 +01:00
Petr Gusev
ffe3262e8d global tablets barrier: require all nodes to ack barrier_and_drain
Previously, global_tablet_token_metadata_barrier() could proceed with
fencing even if some nodes did not acknowledge the barrier_and_drain.
This could cause problems:
* In scylladb/scylladb#26864, replica locks did not provide mutual
exclusion, because “fenced out” requests from old topology versions
could run in parallel with requests using newer versions.
* In scylladb/scylladb#26375, the barrier could succeed even though we
did not wait for closed sessions to become unused. This could leave
aborted repair or streaming tasks running concurrently after a tablet
transition was aborted, and thus running concurrently with the next
transition.

In this commit we add a parameter drain_all_nodes: bool to
the global_token_metadata_barrier function. If this parameter is set,
the barrier waits for all nodes to acknowledge the barrier_and_drain
round of RPCs. If any of the nodes are not accessible or throw an error,
such errors are rethrown to the caller. We set this parameter only in
global_tablet_token_metadata_barrier since for topology migrations
the old behavior should be preserved. In case of errors, the tablet
migration is blocked until the problem goes away by itself or the
problematic node is added to the ignore_nodes list.

The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is
removed: with tablets, we now drain all nodes, so after a successful
barrier_and_drain round there can be no coordinators with an old
topology version. The fence_token check after executing a request on
a replica is therefore unnecessary for tablets, but still required for
vnodes, where topology changes do not wait for all nodes.
Topology fencing is covered by test_fence_lwt_during_bootstrap.

Fixes scylladb/scylladb#26864
Fixes scylladb/scylladb#26375
2026-02-16 08:57:42 +01:00
Petr Gusev
06f88b43e5 topology_coordinator: pass raft_topology_cmd by value
It's just a single enum. Passing by reference risks
use-after-free if a temporary command is created on
the stack in a non-coroutine function.
2026-02-16 08:57:42 +01:00
Petr Gusev
df73f723a6 storage_proxy: hold erms in replica handlers
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.

Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.

For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.

The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.

Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.

Refs scylladb/scylladb#26864
2026-02-16 08:57:42 +01:00
Petr Gusev
e39f4b399c token_metadata: improve stale versions diagnostics
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.

This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
2026-02-16 08:57:42 +01:00
Andrei Chekun
8c5c1096c2 test: ensure that that table used it cqlpy/test_tools have at least 3 pk
One of the tests check that amount of the PK should be more than 2, but
the method that creates it can return table with less keys. This leads
to flakiness and to avoid it, this PR ensures that table will have at
least 3 PK

Closes scylladb/scylladb#28636
2026-02-16 09:50:58 +02:00
Anna Mikhlin
33cf97d688 .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604
2026-02-16 09:33:16 +02:00
Andrei Chekun
e144d5b0bb test.py: fix JUnit double test case records
Move the hook for overwriting the XML reporter to be the first, to
avoid double records.

Closes scylladb/scylladb#28627
2026-02-15 19:02:24 +02:00
Alex
75e25493c1 db:view: refactor on_update_view to use coroutine dispatcher
on_update_view() currently runs its serialized logic inline via with_semaphore()
  from a detached callback path, while create/drop already use dedicated async
  dispatchers.

  Refactor update handling to follow the same pattern:

  - add dispatch_update_view(sstring ks_name, sstring view_name)
  - move update logic into that coroutine
  - acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
  - keep existing behavior for missing base/view state
  - keep background invocation semantics from on_update_view()

  This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
2026-02-15 18:50:32 +02:00
Avi Kivity
a365e2deaa test: sstables::test_env: adjust file open limit
The twcs compaction tests open more than 1024 files (not
so good), and will fail in a user session with the default
soft limit (1024).

Attempt to raise the limit so the tests pass. On a modern
systemd installation the hard limit is >500,000, so this
will work.

There's no problem in dbuild since it raises the file limit
globally.
2026-02-15 14:27:37 +02:00
Avi Kivity
bab3afab88 test: extract cql_test_env's adjust_rlimit() for reuse
The sstable-oriented sstable::test_env would also like to use
it, so extract it into a neutral place.
2026-02-15 14:26:46 +02:00
Jenkins Promoter
69249671a7 Update pgo profiles - aarch64 2026-02-15 05:22:17 +02:00
Jenkins Promoter
27aaafb8aa Update pgo profiles - x86_64 2026-02-15 04:26:36 +02:00
Piotr Dulikowski
9c1e310b0d Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

Closes scylladb/scylladb#28617

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-13 19:03:50 +01:00
Patryk Jędrzejczak
aebc108b1b test: run test_different_group0_ids in all modes
CI currently fails in release and debug modes if the PR only changes
a test run only in dev mode. There is no reason to wait for the CI fix,
as there is no reason to run this test only in dev mode in the first
place. The test is very fast.
2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak
59746ea035 test: make test_different_group0_ids work with the Raft-based topology
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz
1b0a68d1de test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
It's difficult to say if our download backend would always return
transient error correctly so that the curl could retry. Instead it's
more robust to always retry on error.
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
8ca834d4a4 test: pylib: increase curl's number of retries when downloading scylla
By default curl does exponential backoff, and we want to keep that
but there is time cap of 10 minutes, so with 40 retries we'd wait
long time, instead we set the cap to 60 seconds.

Total waiting time (excluding receiving request time):
before - 17m
after - 35m
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
70366168aa test: pylib: improve error reporting in get_scylla_2025_1_executable
Curl or other tools this function calls will now log error
in the place they fail instead of doing plain assert.
2026-02-12 16:18:52 +01:00
Andrzej Jackowski
9ffa62a986 test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
2026-02-12 14:58:39 +01:00
Andrzej Jackowski
e63cfc38b3 test: remove unneeded semicolons from python test 2026-02-12 14:49:17 +01:00
Patryk Jędrzejczak
8e9c7397c5 raft topology: prevent accessing nullptr returned by topology::find
It's better to raise an internal error than cause a segmentation fault on
possibly multiple nodes.
2026-02-12 13:10:04 +01:00
Patryk Jędrzejczak
e21ecf69de raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
2026-02-12 13:10:03 +01:00
Ferenc Szili
d7cfaf3f84 test, simulator: compute load based on tablet size instead of count
This patch changes the load balancing simulator so that it computes
table load based on tablet sizes instead of tablet count.

best_shard_overcommit measured minimal allowed overcommit in cases
where the number of tablets can not be evenly distributed across
all the available shards. This is still the case, but instead of
computing it as an integer div_ceil() of the average shard load,
it is now computed by allocating the tablet sizes using the
largest-tablet-first method. From these, we can get the lowest
overcommit for the given set of nodes, shards and tablet sizes.
2026-02-12 12:54:55 +01:00
Ferenc Szili
216443c050 test, simulator: generate tablet sizes and update load_stats
This change adds a random tablet size generator. The tablet sizes are
created in load_stats.

Further changes to the load balance simulator:

- apply_plan() updates the load_stats after a migration plan is issued by the
load balancer,

- adds the option to set a command line option which controls the tablet size
deviation factor.
2026-02-12 12:54:55 +01:00
Ferenc Szili
e31870a02d test, simulator: postpone creation of load_stats_ptr
With size based load balancing, we will have to move the tablet size in
load_stats after each internode migration issued by balance_tablets().
This will be done in a subsequent commit in apply_plan() which is
called from rebalance_tablets().

Currently, rebalance_tablets() is passed a load_stats_ptr which is
defined as:

using load_stats_ptr = lw_shared_ptr<const load_stats>;

Because this is a pointer to const, apply_plan() can't modify it.

So, we pass a reference to load_stats to rebalance_tablets() and create
a load_stats_ptr from it for each call to balance_tablets().
2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk
f955a90309 test: fix test_remove_node_violating_rf_rack_with_rack_list
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.

Add the 5th node to the cluster. This way the majority is always up.

Fixes: https://github.com/scylladb/scylladb/issues/28596.

Closes scylladb/scylladb#28610
2026-02-12 12:58:48 +02:00
Ferenc Szili
4ca40929ef test: add read barrier to test_balance_empty_tablets
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.

After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.

Fixes: SCYLLADB-603

Closes scylladb/scylladb#28598
2026-02-12 11:16:34 +02:00
Karol Nowacki
079fe17e8b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
2026-02-12 10:08:37 +01:00
Karol Nowacki
aef5ff7491 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
2026-02-12 09:58:54 +01:00
Piotr Dulikowski
38c4a14a5b Merge 'test: cluster: Fix test_sync_point' from Dawid Mędrek
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

Closes scylladb/scylladb#28602

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-12 09:34:09 +01:00
Dawid Pawlik
4e32502bb3 test/vector_search: add reproducer for rescoring with zero vectors
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.
2026-02-11 13:41:09 +01:00
Dawid Pawlik
af0889d194 vector_search: return NaN for similarity_cosine with all-zero vectors
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456
2026-02-11 12:31:47 +01:00
Pavel Emelyanov
2a3a56850c test: Fix the condition for streaming directions validation
Commit ea8a661119 tried to reduce the dataset for restoration tests.
While doing it effectively disabled part of itself -- the checks for
streaming directions were never ran after this change. The thing is that
this check only runs if restored tablet count matches some hardcoded one
of 512. This was the real dataset size of the test before the
aforementioned commit, but after it it had changed to over values, and
the comparison with 512 became always False.

Fix it with a local variable to prevent such mistakes in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:55:27 +03:00
Pavel Emelyanov
f187dceb1a test: Split test_backup.py::check_data_is_back() into two
This method does two things -- checks that the data is indeed back, and
validates streaming directions. The latter is not quite about "data is
back", so better to have it as explicit dedicated method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:54:20 +03:00
Dawid Mędrek
f83f911bae test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
a256ba7de0 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203
2026-02-10 17:05:02 +01:00
Dawid Mędrek
c5239edf2a test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
ac4af5f461 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
2026-02-10 17:05:01 +01:00
Dawid Mędrek
628e74f157 test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
2026-02-10 17:04:59 +01:00
Pavel Emelyanov
875fd03882 test/object_store: Remove create_ks_and_cf() helper
Now all test cases use standard facilities to create data they test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:59:05 +03:00
Pavel Emelyanov
94176f7477 test/object_store: Replace create_ks_and_cf() usage with standard methods
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:58:05 +03:00
Pavel Emelyanov
6665cda23f test/object_store: Shift indentation right for test cases
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:56:27 +03:00
Ernest Zaslavsky
960adbb439 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html
2026-02-10 13:15:07 +02:00
Ernest Zaslavsky
6280cb91ca s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
2026-02-10 13:13:26 +02:00
Ernest Zaslavsky
289e910cec s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
2026-02-10 13:13:25 +02:00
Ernest Zaslavsky
7142b1a08d exceptions: add helper to build a chain of error handlers
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
7fd62f042e http: extract error classification code
move http client related error classification code to a common location for future reuse
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
5beb7a2814 aws_error: extract retryable from aws_error
Move aws::retryable to common location to reuse it later in other http based clients
2026-02-09 08:48:41 +02:00
Pavel Emelyanov
44715a2d45 api: Remove map_to_key_value() helpers
All the callers had already been patched to stream their results
directly as json.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
dcbb5cb45b api: Streamify view_build_statuses handler
Similarly to previous patch, the handler can stream the map of build
statuses. Unlike previous patch, it doesn't need to fmt::format() key
and value, as these are strings already.

It could be a map_to_json<string, string> partial specialization, but
there's so far only one caller, so probably not worth it yet.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
73512a59ff api: Streamify few more storage_service/ handlers
Like get_token_endpoint one streams the map that it got from storage
service, the get_ownership and get_effective_ownership can do the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
a4bd9037b3 api: Add map_to_json() helper
The get_token_endpoint handler converts iterator of std::map into
generated maplist_mapper type. Next patch will do the same for more
handlers, so it's good to have a helper converter for it.

As a nice side effect, it's possible to avoid multiline lambda argument
to stream_range_as_array().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
63cafab56c api: Coroutinize view_build_statuses handler
Further patching will be nicer if this handler is a coroutine

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pawel Pery
81d11a23ce Revert "Merge 'vector_search: add validator tests' from Pawel Pery"
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.

There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.

It needs a backport to 2026.1 to remove remaining validator tests from there.

Fixes: VECTOR-497

Closes scylladb/scylladb#28568
2026-02-08 16:29:58 +02:00
Avi Kivity
bb99bfe815 test: scylla_gdb: tighten check for Error output from gdb
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.

Allow for that and only match 'Error' that isn't 'error::Error'.

Fixes #28516.

Closes scylladb/scylladb#28574
2026-02-08 09:48:23 +02:00
Pavel Emelyanov
1b1aae8a0d test: Merge test_restore_primary_replica_different_... tests
The difference is very small:

@@ -1,18 +1,18 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_different_...(manager: ManagerClient, object_storage):
     ''' Comment '''

-    topology = topo(rf = 2, nodes = 2, racks = 2, dcs = 1)
-    scope = "dc"
+    topology = topo(rf = 1, nodes = 2, racks = 1, dcs = 2)
+    scope = "all"
     ks = 'ks'
     cf = 'cf'

-    servers, host_ids = await create_cluster(topology, True, manager, logger, object_storage)
+    servers, host_ids = await create_cluster(topology, False, manager, logger, object_storage)

     await manager.disable_tablet_balancing()
     cql = manager.get_cql()
@@ -41,7 +41,6 @@ async def test_restore_primary_replica_d
         log = await manager.server_open_log(s.server_id)
         res = await log.grep(r'INFO.*sstables_loader - load_and_stream:.*target_node=([0-9a-z-]+),.*num_bytes_sent=([0-9]+)')
         streamed_to = set([ r[1].group(1) for r in res ])
-        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}')
+        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}, expected {servers}')
         assert len(streamed_to) == 2

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:14:43 +03:00
Pavel Emelyanov
70988f9b61 test: Merge test_restore_primary_replica_same_... tests
The difference is very tiny:

@@ -1,12 +1,12 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage):
     ''' comment '''

-    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1)
-    scope = "rack"
+    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2)
+    scope = "dc"
     ks = 'ks'
     cf = 'cf'

@@ -42,7 +42,7 @@ async def test_restore_primary_replica_s
         for r in res:
             nodes_by_operation[r[1].group(1)].append(r[1].group(2))

-        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ])
+        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ])
         for op, nodes in nodes_by_operation.items():
             logger.info(f'Operation {op} streamed to nodes {nodes}')
             assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica"

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:12:23 +03:00
Pavel Emelyanov
98b4092153 test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
If not specified, the call would use dcs * rf default, which match this
teat parameters perfectly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:47 +03:00
Pavel Emelyanov
1ac1e90b16 test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
It merely copies the `servers` one, no need in it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:22 +03:00
Anna Stuchlik
dc8f7c9d62 doc: replace the OS Support page with a link to the new location
We've moved that page to another place; see https://github.com/scylladb/scylladb/issues/28561.
This commit replaces the page with the link to the new location
and adds a redirection.

Fixes https://github.com/scylladb/scylladb/issues/28561

Closes scylladb/scylladb#28562
2026-02-06 11:38:21 +02:00
Tomasz Grabiec
41930c0176 tablets, config: Reduce migration concurrency to 2
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SStables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporary grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

The reduce impact from this, reduce concurrency of
migation. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317
2026-02-06 00:42:19 +01:00
Tomasz Grabiec
56e40e90c9 tablets: load_balancer: Always accept migration if the load is 0
Different transitions have different weights, and limits are
configurable.  We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.

For example, repair uses weight 2 for read concurrency.  Migrating
co-located tablets scales the cost by the number of co-located
tablets.
2026-02-06 00:42:18 +01:00
Tomasz Grabiec
39492596c2 config, tablets: Make tablet migration concurrency configurable
We're about to reduce it. It's better to not have it hard-coded in
case we change our mings again.
2026-02-06 00:42:18 +01:00
Avi Kivity
7a3ce5f91e test: minio: disable web console
minio starts a web console on a random port. This was seen to interfere
with the nodetool tests when the web console port clashed with the mock
API port.

Fix by disabling the web console.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-496

Closes scylladb/scylladb#28492
2026-02-05 20:11:32 +02:00
Nikos Dragazis
5d1e6243af test/cluster: Remove short_tablet_stats_refresh_interval injection
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.

This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).

Use the config option. This reduces the execution time to ~8 seconds.

Fixes SCYLLADB-556.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28536
2026-02-05 20:11:32 +02:00
Pavel Emelyanov
10c278fff7 database: Remove _flush_sg member from replica::database
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28539
2026-02-05 13:02:35 +02:00
Petr Hála
a04dbac369 open-coredump: Change to use new backtrace
* This is a breaking change, which removes compatibility with the old backtrace
    - See https://staging.backtrace.scylladb.com/api/docs#/default/search_by_build_id_search_build_id_post for the APIDoc
* Add timestamp field to log
* Tested locally

Closes scylladb/scylladb#28325
2026-02-05 11:50:47 +02:00
Marcin Maliszkiewicz
0753d9fae5 Merge 'test: remove xfail marker from a few passing tests' from Nadav Har'El
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.

Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).

Backport not needed, xpass doesn't bother anyone.

Closes scylladb/scylladb#28441

* github.com:scylladb/scylladb:
  test/cqlpy: remove xfail from tests for fixed issue 7972
  test/cqlpy: remove xfail from tests for fixed issue 10358
  test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
  test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
2026-02-05 10:10:43 +01:00
Marcin Maliszkiewicz
6eca74b7bb Merge 'More Alternator tests for BatchWriteItem' from Nadav Har'El
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.

* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes #28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.

Closes scylladb/scylladb#28528

* github.com:scylladb/scylladb:
  test/alternator: add test for BatchWriteItem with different write isolations
  test/alternator: reproducer for Alternator Streams bug
2026-02-05 10:07:29 +01:00
Yaron Kaikov
b30ecb72d5 ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543
2026-02-05 08:41:43 +02:00
Michał Hudobski
6b9fcc6ca3 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519
2026-02-04 09:10:08 +01:00
Nadav Har'El
47e827262f test/alternator: add test for BatchWriteItem with different write isolations
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.

But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.

In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.

Fixes #28171

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:24:29 +02:00
Nadav Har'El
c63f43975f test/alternator: reproducer for Alternator Streams bug
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:

1. A BatchWriteItem operation writing multiple items to the *same*
   partition.

2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
   occur in other write isolation modes).

We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).

As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.

The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.

Refs #28439

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:17:48 +02:00
Radosław Cybulski
03ff091bee alternator: improve events output when test failed
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.

Fixes #28455

Closes scylladb/scylladb#28476
2026-02-03 21:55:07 +02:00
Anna Stuchlik
a427ad3bf9 doc: remove the link to the Open Source blog post
Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28518
2026-02-03 14:15:16 +01:00
Botond Dénes
3adf8b58c4 Merge 'test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0' from Patryk Jędrzejczak
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.

Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.

Closes scylladb/scylladb#28464

* github.com:scylladb/scylladb:
  test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
  test: test_tablets_parallel_decommission: prevent group0 majority loss
  test: delete test_service_levels_work_during_recovery
2026-02-03 08:19:05 +02:00
Pavel Emelyanov
19ea05692c view_build_worker: Do not switch scheduling groups inside work_on_view_building_tasks
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.

The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.

By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28397
2026-02-03 07:00:32 +02:00
Anna Stuchlik
77480c9d8f doc: fix the links on the repair-related pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28199.

This commit fixes the syntax of the internal links.

Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28487
2026-02-03 06:54:08 +02:00
Botond Dénes
64b38a2d0a Merge 'Use gossiper scheduling group where needed' from Pavel Emelyanov
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().

Refining inter-components API, not backporting

Closes scylladb/scylladb#28412

* github.com:scylladb/scylladb:
  gossiper: Export its scheduling group for those who need it
  migration_manager: Reorder members
2026-02-03 06:51:31 +02:00
Nadav Har'El
48b01e72fa test/alternator: add test verifying that keys only allow S/B/N type
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.

We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28479
2026-02-03 06:49:02 +02:00
Andrei Chekun
ed9a96fdb7 test.py: modify logic for adding function_path in JUnit
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.

Closes scylladb/scylladb#28462
2026-02-03 06:42:18 +02:00
Andrei Chekun
3a422e82b4 test.py: fix the file name in test summary
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.

Closes scylladb/scylladb#28429
2026-02-03 06:38:21 +02:00
Benny Halevy
84caa94340 gossiper: add_expire_time_for_endpoint: replace fmt::localtime with gmtime in log printout
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
   to anyone reading the lof message if the expiration time is in the
   local timezone or in GMT/UTC.

Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
 2428 |             endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
      |                            ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
  538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
      | ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
  207 | #  define FMT_DEPRECATED [[deprecated]]
      |                            ^
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#28434
2026-02-03 06:36:53 +02:00
Pavel Emelyanov
8c42704c72 storage_service: Check raft rpc scheduling group from debug namespace
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.

This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.

Refs #28410

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28427
2026-02-03 06:34:03 +02:00
Asias He
b5c3587588 repair: Add request type in the tablet repair log
So we can know if the repair is an auto repair or a user repair.

Fixes SCYLLADB-395

Closes scylladb/scylladb#28425
2026-02-03 06:26:58 +02:00
Nadav Har'El
a63ad48b0f test/cqlpy: remove xfail from tests for fixed issue 7972
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:32 +02:00
Nadav Har'El
10b81c1e97 test/cqlpy: remove xfail from tests for fixed issue 10358
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
508bb97089 test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
2026-02-02 23:49:31 +02:00
Nadav Har'El
3682c06157 test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.

The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.

Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.

Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
df69dbec2a Merge ' cql3/statements/describe_statement: hide paxos state tables ' from Michał Jadwiszczak
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

Closes scylladb/scylladb#28230

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-02 21:22:59 +02:00
Nadav Har'El
f23e796e76 alternator: fix typos in comments and variable names
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.

There are no functional changes in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28447
2026-02-02 19:16:43 +03:00
Marcin Maliszkiewicz
88c4ca3697 Merge 'test: migrate guardrails_test.py from scylla-dtest' from Andrzej Jackowski
This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it.

The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`.

Fixes: SCYLLADB-255

No backport, just test migration

Closes scylladb/scylladb#28454

* github.com:scylladb/scylladb:
  test: refactor test_all_rf_limits in guardrails_test.py
  test: specify exceptions being caught in guardrails_test.py
  test: enable guardrails_test.py
  test: add wait_other_notice to test_default_rf in guardrails_test.py
  test: copy guardrails_test.py from scylla-dtest
2026-02-02 16:54:13 +01:00
Avi Kivity
acc54cf304 tools: toolchain: adapt future toolchain to loss of toxiproxy in Fedora
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954

Closes scylladb/scylladb#28444
2026-02-02 17:02:19 +02:00
Avi Kivity
419636ca8f test: ldap: regularize toxiproxy command-line options
Modern toxiproxy interprets `-h` as help and requires the subcommand
subject (e.g. the proxy name) to be after the subcommand switches.
Arrange the command line in the way it likes, and spell out the
subcommands to be more comprehensible.

Closes scylladb/scylladb#28442
2026-02-02 17:00:58 +02:00
Botond Dénes
2b3f3d9ba7 Merge 'test.py: support boost labels in test.py' from Artsiom Mishuta
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

using  -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well

Fixes SCYLLADB-246

Closes scylladb/scylladb#28232

* github.com:scylladb/scylladb:
  test: add nightly label
  test.py: support boost labels in test.py
2026-02-02 16:55:29 +02:00
Dawid Mędrek
68981cc90b Merge 'raft topology: generate notification about released nodes only once' from Piotr Dulikowski
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

Closes scylladb/scylladb#28367

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-02 15:39:15 +01:00
Jenkins Promoter
c907fc6789 Update pgo profiles - aarch64 2026-02-02 14:56:49 +02:00
Dawid Mędrek
b0afd3aa63 Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

Closes scylladb/scylladb#28322

* github.com:scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-02 13:28:19 +01:00
Andrzej Jackowski
298aca7da8 test: refactor test_all_rf_limits in guardrails_test.py
Before this commit, `test_all_rf_limits` was implemented in a
repetitive manner, making it harder to understand how the guardrails
were tested. This commit refactors the test to reduce code redundancy
and verify the guardrails more explicitly.
2026-02-02 10:49:12 +01:00
Andrzej Jackowski
136db260ca test: specify exceptions being caught in guardrails_test.py
Before this commit, the test caught a broad `Exception`. This change
specifies the expected exceptions to avoid a situation where the product
or test is broken and it goes undetected.
2026-02-02 10:48:07 +01:00
Patryk Jędrzejczak
ec2f99b3d1 test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
1f28a55448 test: test_tablets_parallel_decommission: prevent group0 majority loss
Both of the changed test cases stop two out of four nodes when there are
three group0 voters in the cluster. If one of the two live nodes is
a non-voter (node 1, specifically, as node 0 is the leader), a temporary
majority loss occurs, which can cause the following operations to fail.
In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node`
API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`,
removenode can fail with an unexpected error message:
```
"service::raft_operation_timeout_error (group
[46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)"
```

Somehow, these test cases are currently not flaky, but they become flaky in
the following commit.

We can consider backporting this commit to 2026.1 to prevent flakiness.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
bcf0114e90 test: delete test_service_levels_work_during_recovery
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
2026-02-02 10:39:54 +01:00
Artsiom Mishuta
af2d7a146f test: add nightly label
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers

command to test :
./tools/toolchain/dbuild  pytest --test-py-init --collect-only -q -m=nightly test/boost

output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
2026-02-02 10:30:38 +01:00
Gleb Natapov
08268eee3f topology: disable force-gossip-topology-changes option
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.

Closes scylladb/scylladb#28383
2026-02-02 09:56:32 +01:00
Avi Kivity
ceec703bb7 Revert "main: test: add future and abort_source to after_init_func"
This reverts commit 7bf7ff785a. The commit
tried to add clean shutdown to `scylla perf` paths, but forgot at least
`scylla perf-alternator --workload wr` which now crashes on uninitialized
`c.as`.

Fixes #28473

Closes scylladb/scylladb#28478
2026-02-02 09:22:24 +01:00
Avi Kivity
cc03f5c89d cql3: support literals and bind variables in selectors
Add support for literals in the SELECT clause. This allows
SELECT fn(column, 4) or SELECT fn(column, ?).

Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still
not accepted because of failed type inference - we cannot infer the
type of 7, and don't have a favored type for literals (like C favors
int). We might relax this later.

In the WHERE clause, and Cassandra in the SELECT clause, type hints
can also resolve type ambiguity: (bigint)7 or (text)?. But this is
deferred to a later patch.

A few changes to the grammar are needed on top of adding a `value`
alternative to `unaliasedSelector`:

 - vectorSimilarityArg gained access to `value` via `unaliasedSelector`,
   so it loses that alternate to avoid ambiguity. We may drop
   `vectorSimilarityArg` later.
 - COUNT(1) became ambiguous via the general function path (since
   function arguments can now be literals), so we remove this case
   from the COUNT special cases, remaining with count(*).
 - SELECT JSON and SELECT DISTINCT became "ambiguous enough" for
   ANTLR to complain, though as far as I can tell `value` does not
   add real ambiguity. The solution is to commit early (via "=>") to
   a parsing path.

Due to the loss of count(1) recognition in the parser, we have to
special-case it in prepare. We may relax it to count any expression
later, like modern Cassandra and SQL.

Testing is awkward because of the type inference problem in top-level.
We test via the set_intersection() function and via lua functions.

Example:

```
cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int  LANGUAGE LUA AS 'return a + b';
cqlsh> SELECT ks.sum(1, 2) FROM system.local;

 ks.sum(1, 2)
--------------
            3

(1 rows)
cqlsh>
```

(There are no suitable system functions!)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296

Closes scylladb/scylladb#28256
2026-02-02 00:06:13 +02:00
Patryk Jędrzejczak
68b105b21c db: virtual tables: add the rack column to cluster_status
`system.cluster_status` is missing the rack info compared to `nodetool status`
that is supposed to be equivalent. It has probably been an omission.

Closes scylladb/scylladb#28457
2026-02-01 20:36:53 +01:00
Pavel Emelyanov
6f3f30ee07 storage_service: Use stream_manager group for streaming
The hander of raft_topology_cmd::command::stream_ranges switches to
streaming scheduling group to perform data streaming in it. It grabs the
group from database db_config, which's not great. There's streaming
manager at hand in storage service handlers, since it's using its
functionality, it should use _its_ scheduling group.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28363
2026-02-01 20:42:37 +02:00
Marcin Maliszkiewicz
b8c75673c8 main: remove confusing duplicated auth start message
Before we could observe two exactly the same
"starting auth service" messages in the log.
One from checkpoint() the other from notify().
We remove the second one to stay consistent
with other services.

Closes scylladb/scylladb#28349
2026-02-01 13:57:53 +02:00
Avi Kivity
6676953555 Merge 'test: perf: add option to write results to json in perf-cql-raw and perf-alternator' from Marcin Maliszkiewicz
Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has.
It is useful for automating test runs.

Related: https://scylladb.atlassian.net/browse/SCYLLADB-434

Bacport: no, original benchmark is not backported

Closes scylladb/scylladb#28451

* github.com:scylladb/scylladb:
  test: perf: add example commands to perf-alternator and perf-cql-raw
  test: perf: add option to write results to json in perf-cql-raw
  test: perf: add option to write results to json in perf-alternator
  test: perf: move write_json_result to a common file
2026-02-01 13:57:10 +02:00
Artsiom Mishuta
e216504113 test.py: support boost labels in test.py
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

fixes: https://github.com/scylladb/scylladb/issues/25415
2026-02-01 11:31:26 +01:00
Tomasz Grabiec
b93472d595 Merge 'load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Ferenc Szili
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

Closes scylladb/scylladb#28440

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-01-31 21:12:19 +01:00
Ferenc Szili
92dbde54a5 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
2026-01-30 15:11:29 +01:00
Patryk Jędrzejczak
7e7b9977c5 test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
6c547e1692 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
867a1ca346 test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
53f58b85b7 test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
408c6ea3ee test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
c92962ca45 test: test_maintenance_mode: remove the redundant value from the query result 2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
9d4a5ade08 storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
a08c53ae4b storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988
2026-01-30 12:55:16 +01:00
Andrzej Jackowski
625f292417 test: enable guardrails_test.py
After guardrails_test.py has been migrated to test.py and fixed in
previous commits of this patch series, it can finally be enabled.

Fixes: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
576ad29ddb test: add wait_other_notice to test_default_rf in guardrails_test.py
This commit adds `wait_other_notice=True` to `cluster.populate` in
`guardrails_test.py`. Without this, `test_default_rf` sometimes fails
because `NetworkTopologyStrategy` setting fails before
the node knows about all other DCs.

Refs: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
64c774c23a test: copy guardrails_test.py from scylla-dtest
This commit copies guardrails_test.py from dtest repository and
(temporarily) disables it, as it requires improvement in following
commits of this patch series before being enabled.

Refs: SCYLLADB-255
2026-01-30 11:51:40 +01:00
Marcin Maliszkiewicz
e18b519692 cql3: remove find_schema call from select check_access
Schema is already a member of select statement, avoiding
the call saves around 400 cpu instructions on a select
request hot path.

Closes scylladb/scylladb#28328
2026-01-30 11:49:09 +01:00
Ferenc Szili
71be10b8d6 load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.
2026-01-30 09:48:59 +01:00
Marcin Maliszkiewicz
80e627c64b test: perf: add example commands to perf-alternator and perf-cql-raw 2026-01-30 08:48:19 +01:00
Pawel Pery
f49c9e896a vector_search: allow full secondary indexes syntax while creating the vector index
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.

This patch allows creating vector indexes using this CQL syntax:

```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  discussion_board_id int,
  country text,
  lang text,
  PRIMARY KEY ((commenter, discussion_board_id), created_at)
);

CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
  ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
  ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```

Currently, if we run these queries to create indexes we will receive such errors:

```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```

This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.

Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).

This commits adds cqlpy test to check errors while creating indexes.

Fixes: SCYLLADB-298

This needs to be backported to version 2026.1 as this is a fix for filtering support.

Closes scylladb/scylladb#28366
2026-01-30 01:14:31 +02:00
Avi Kivity
3d1558be7e test: remove xfail markers from SELECT JSON count(*) tests
These were marked xfail due to #8077 (the column name was wrong),
but it was fixed long ago for 5.4 (exact commit not known).

Remove the xfail markers to prevent regressions.

Closes scylladb/scylladb#28432
2026-01-29 21:56:00 +02:00
Piotr Dulikowski
f150629948 Merge 'auth: switch find_record to use cache' from Marcin Maliszkiewicz
This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard.

Backport: no, it's not a bug fix.

Closes scylladb/scylladb#28329

* github.com:scylladb/scylladb:
  auth: bring back previous version of standard_role_manager::can_login
  auth: switch find_record to use cache
  auth: make find_record and callers standard_role_manager members
2026-01-29 17:25:42 +01:00
Avi Kivity
7984925059 Merge 'Use coroutine::switch_to() in table::try_flush_memtable_to_sstable' from Pavel Emelyanov
The method was coroutinized by 6df07f7ff7. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like

```
auto flush = [] {
    ... // do the flushing
    auto post_flush = [] {
        ... // do the post-flushing
    }
    co_return co_await with_scheduling_group(group_b, post_flush);
};
co_return co_await with_scheduling_group(group_a, flush);
```

which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this

```
co_await switch_to(group_a);
... // do the flushing
co_await switch_to(group_b);
... // do the post-flushing
```

Code cleanup, not backporting

Closes scylladb/scylladb#28430

* github.com:scylladb/scylladb:
  table: Fix indentation after previous patch
  table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
2026-01-29 18:12:35 +02:00
Nadav Har'El
a6fdda86b5 Merge 'test: test_alternator_proxy_protocol: fix race between node startup and test start' from Avi Kivity
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.

Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.

Fixes #28210
Fixes #28211

Flaky tests are only in master and branch-2026.1, so backporting there.

Closes scylladb/scylladb#28291

* github.com:scylladb/scylladb:
  test: test_alternator_proxy_protocol: wait for the node to report itself as serving
  test: cluster_manager: add ability to wait for supervisor STATUS=serving
2026-01-29 16:18:26 +02:00
Pavel Emelyanov
56e212ea8d table: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-29 15:02:33 +03:00
Pavel Emelyanov
258a1a03e3 table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-29 15:01:27 +03:00
Marcin Maliszkiewicz
ea29e4963e test: perf: add option to write results to json in perf-cql-raw 2026-01-29 10:56:03 +01:00
Marcin Maliszkiewicz
d974ee1e21 test: perf: add option to write results to json in perf-alternator 2026-01-29 10:55:52 +01:00
Marcin Maliszkiewicz
a74b442c65 test: perf: move write_json_result to a common file
The implementation is going to be shared with
perf-alternator and perf-cql-raw.
2026-01-29 10:54:11 +01:00
Botond Dénes
3158e9b017 doc: reorganize properties in config.cc and config.hh
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.

This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.

Closes scylladb/scylladb#28343
2026-01-29 11:27:42 +03:00
Pavel Emelyanov
937d008d3c Merge 'Clean up partition_snapshot_reader' from Botond Dénes
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.

Code cleanup, no backport

Closes scylladb/scylladb#28353

* github.com:scylladb/scylladb:
  replica/partition_snapshot_reader: remove unused includes
  partition_snapshot_reader: remove "flat" from name
  mv partition_snapshot_reader.hh -> replica/
2026-01-29 11:22:15 +03:00
Botond Dénes
f6d7f606aa memtable_test: disable flushing_rate_is_reduced_if_compaction_doesnt_keep_up for debug
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.

Closes scylladb/scylladb#28407
2026-01-29 11:13:22 +03:00
Jakub Smolar
e978cc2a80 scylla_gdb: use persistent GDB - decrease test execution time
This commit replaces the previous approach of running pytest inside
GDB’s Python interpreter. Instead, tests are executed by driving a
persistent GDB process externally using pexpect.

- pexpect: Python library for controlling interactive programs
  (used here to send commands to GDB and capture its output)
- persistent GDB: keep one GDB session alive across multiple tests
  instead of starting a new process for each test

Tests can now be executed via `./test.py gdb` or with
`pytest test/scylla_gdb`. This improves performance and
makes failures easier to debug since pytest no longer runs
hidden inside GDB subprocesses.

Closes scylladb/scylladb#24804
2026-01-29 10:01:39 +02:00
Avi Kivity
347c69b7e2 build: add clang-tools-extra (for clang-include-cleaner) to frozen toolchain
clang-include-cleaner is used in the iwyu.yaml github workflow (include-
what-you-use). Add it to the frozen toolchain so it can be made part
of the regular build process.

The corresponding install command is removed from iwyu.yaml.

Regenerated frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz

Closes scylladb/scylladb#28413
2026-01-29 08:44:49 +02:00
Botond Dénes
482ffe06fd Merge 'Improve load shedding on the replica side' from Łukasz Paszkowski
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.

Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.

If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:

if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.

Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353

Backport is not needed. Better to first observe its impact.

Closes scylladb/scylladb#21649

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: Check during admission if read may timeout
  permit_reader::impl: Replace break with return after evicting inactive permit on timeout
  reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
  config: Add parameters to control reads' preemptive_abort_factor
  permit_reader: Add a new state: preemptive_aborted
  reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
  reader_concurrency_semaphore: Remove cpu_concurrency's default value
2026-01-29 08:27:22 +02:00
Botond Dénes
a8767f36da Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

Closes scylladb/scylladb#28337

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-29 08:25:17 +02:00
Amnon Heiman
f2e142ac6e test/boost/estimated_histogram_test.cc: Switch to real Sum
Now that the sum function in the histogram uses true values instead of
an estimate, the test should reflect that.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 23:19:00 +02:00
Piotr Dulikowski
ec6a2661de Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.

This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).

New feature (nested scheduling group), not backporting

Closes scylladb/scylladb#28386

* github.com:scylladb/scylladb:
  view_builder: Start background in maintenance group
  view_builder: Wake-up step fiber with condition variable
2026-01-28 20:49:19 +01:00
Pavel Emelyanov
cb1d05d65a streaming: Get streaming sched group from debug:: namespace
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.

The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.

To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28410
2026-01-28 19:14:59 +02:00
Marcin Maliszkiewicz
5d4e2ec522 Merge 'docs: add documentation for automatic repair' from Botond Dénes
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.

Fixes: SCYLLADB-130

This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.

Closes scylladb/scylladb#28199

* github.com:scylladb/scylladb:
  docs: add feature page for automatic repair
  docs: inter-link incremental-repair and repair documents
  docs: incremental-repair: fix curl example
2026-01-28 17:46:53 +01:00
Nadav Har'El
1454228a05 test/cqlpy: fix "assertion rewriting" in translated Cassandra tests
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.

But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.

This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.

We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:

    pytest.register_assert_rewrite("cassandra_tests.porting")

This tells pytest to rewrite assertions in porting.py the first time
that it is imported.

It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.

I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28411
2026-01-28 18:34:57 +02:00
Pavel Emelyanov
3ebd02513a view_builder: Start background in maintenance group
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.

This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.

Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:59 +03:00
Pavel Emelyanov
2439d27b60 view_builder: Wake-up step fiber with condition variable
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.

This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:58 +03:00
Pavel Emelyanov
5ce12f2404 gossiper: Export its scheduling group for those who need it
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Pavel Emelyanov
0da1a222fc migration_manager: Reorder members
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Botond Dénes
1713d75c0d docs: add feature page for automatic repair
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
2026-01-28 16:45:57 +02:00
Łukasz Paszkowski
7e1bbbd937 reader_concurrency_semaphore: Check during admission if read may timeout
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.

When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.

Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.

If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:

if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
2026-01-28 14:24:45 +01:00
Łukasz Paszkowski
8a613960af permit_reader::impl: Replace break with return after evicting inactive permit on timeout
Evicting an inactive permit destroyes the permit object when the
reader is closed, making any further member access invalid. Switch
from break to an early return to prevent any possible use-after-free
after evict() in the state::inactive timeout path.
2026-01-28 14:24:33 +01:00
Łukasz Paszkowski
fde09fd136 reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
  + 0.0 means a read will never get rejected during admission
  + 1.0 means a read will immediatelly get rejected during admission

Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
21348050e8 config: Add parameters to control reads' preemptive_abort_factor 2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
2d3a40e023 permit_reader: Add a new state: preemptive_aborted
A permit gets into the preemptive_aborted state when:
- times out;
- gets rejected from execution due to high chance its execution would
  not finalize on time;

Being in this state means a permit was removed from the wait list,
its internal timer was canceled and semaphore's statistic
`total_reads_shed_due_to_overload` increased.
2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
5a7cea00d0 reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
Add a defensive check in dequeue_permit() to avoid underflowing
_stats.waiters and report an internal error if the stats are already
inconsistent.
2026-01-28 14:19:53 +01:00
Tomasz Grabiec
df949dc506 Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.

To make cleanup reliable this PR:

Adds an additional “fallback cleanup” stage

- `write_both_read_old_fallback_cleanup`

that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.

Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.

Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.

No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.

Closes scylladb/scylladb#28169

* github.com:scylladb/scylladb:
  topology_coordinator: add write_both_read_old_fallback_cleanup state
  topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
  topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
  topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
2026-01-28 13:33:39 +01:00
Amnon Heiman
3175540e87 transport/server: to bytes_histogram
This patch replaces simple counters with bytes_histogram for tracking
CQL request and response sizes, enabling better visibility into message
size distribution.

Changes:
- Replace request_size and response_size metrics with bytes_histogram in
  cql_sg_stats::request_kind_stats
- Per-shard metrics continue to be reported as before
- QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group
  histograms of bytes sent and received, providing detailed insight into these
  operations

Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node
histogram reporting as they are less performance-critical, but can be added
in the future if proven useful.

Metrics example:

```
 # HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind.
 # TYPE scylla_transport_cql_request_bytes counter
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
 # HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group.
 # TYPE scylla_transport_cql_request_histogram_bytes histogram
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
2026-01-28 13:53:47 +02:00
Amnon Heiman
5875bcca23 approx_exponential_histogram: Add sum() method for accurate value tracking
Previously, histogram sums were estimated by multiplying bucket offsets
by their counts, which produces inaccurate results - typically too high
when using upper limits or too low when using lower limits.

This patch adds accurate sum tracking to approx_exponential_histogram:
- Adds a _sum member variable to track the actual sum of all values
- Implements sum() method to return the accumulated total
- Updates add() to increment _sum for each value
- Modifies to_metrics_histogram() helper to use the new sum() method

This change is important as histograms will be used instead of counters for
byte statistics, where accurate totals are essential for metrics reporting.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:39:46 +02:00
Amnon Heiman
2fd453f4ec utils/estimated_histogram.hh: Add bytes_histogram
For various use cases, we need to report byte histograms, such as for
request and reply message sizes.

This patch introduce bytes_histogram as a type alias for
approx_exponential_histogram configured to track byte values from 1KB to
1GB with power-of-2 buckets (Precision=1).

This provides a convenient, performance-efficient histogram for
measuring message sizes, payload sizes, and other byte-based metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:31:39 +02:00
Botond Dénes
ee631f31a0 Merge 'Do not export system keyspace from raft_group0_client' from Pavel Emelyanov
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.

Refining API between components, not backporting

Closes scylladb/scylladb#28387

* github.com:scylladb/scylladb:
  raft_group0_client: Dont export system keyspace
  raft_group0_client: Add and use get_last_group0_state_id()
  group0_state_machine: Call ensure_group0_sched() with data_dictionary
  view_building_worker: Use its own system_keyspace& reference
2026-01-28 13:24:32 +02:00
Yaron Kaikov
7c49711906 test/cqlpy: Remove redundant pytest.register_assert_rewrite call
During test.py run, noticed this warning:
```
10:38:22  test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22    /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22      pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```

The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.

Closes scylladb/scylladb#28409
2026-01-28 13:17:05 +02:00
Avi Kivity
42fdea7410 github: fix iwyu workflow permissions
The include-what-you-use workflow fails with

```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```

Fix by adding the correct permissions.

Closes scylladb/scylladb#28390
2026-01-28 12:38:54 +02:00
Jakub Smolar
e1f623dd69 skip_mode: Allow multiple build modes in pytest skip_mode marker
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.

Before:
  @pytest.mark.skip_mode("dev", reason="...")
  @pytest.mark.skip_mode("debug", reason="...")

After:
  @pytest.mark.skip_mode(["dev", "debug"], reason="...")

This reduces duplication when the same skip condition applies
to multiple build modes.

Closes scylladb/scylladb#28406
2026-01-28 12:27:41 +02:00
Patryk Jędrzejczak
a2c1569e04 test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388
2026-01-28 10:54:22 +02:00
Avi Kivity
8d2689d1b5 build: avoid sccache by default for Rust targets
A bug[1] in sccache prevents correct distributed compilation of wasmtime.

Disable it by default for now, but allow users to enable it.

[1] https://github.com/mozilla/sccache/issues/2575

Closes scylladb/scylladb#28389
2026-01-28 10:36:49 +02:00
Pavel Emelyanov
2ffe5b7d80 tablet_allocator: Have its own explicit background scheduling group
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.

This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28356
2026-01-28 10:34:28 +02:00
Avi Kivity
47315c63dc treewide: include Seastar headers with angle brackets
Seastar is a "system" library from our point of view, so
should be included with angle brackets.

Closes scylladb/scylladb#28395
2026-01-28 10:33:06 +02:00
Botond Dénes
b7dccdbe93 Merge 'test/storage: speed up out-of-space prevention tests' from Łukasz Paszkowski
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.

Improvements. No backup is required.

Closes scylladb/scylladb#28396

* github.com:scylladb/scylladb:
  test/storage: speed up out-of-space prevention tests by using smaller volumes
  test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
2026-01-28 10:28:20 +02:00
Marcin Maliszkiewicz
931a38de6e service: remove unused has_schema_access
It became unused after we dropped support for thrift
in ad649be1bf

Closes scylladb/scylladb#28341
2026-01-28 10:18:26 +02:00
Pavel Emelyanov
834921251b test: Replace memory_data_source with seastar::util::as_input_stream
The existing test-only implementation is a simplified version of the
generic one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28339
2026-01-28 10:15:03 +02:00
Andrei Chekun
335e81cdf7 test.py: migrate nodetool to run by pytest
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.

Closes scylladb/scylladb#28348
2026-01-28 09:49:59 +02:00
Tomasz Grabiec
8e831a7b6d tablets: tablet_allocator.cc: Convert tabs to spaces 2026-01-28 01:32:01 +01:00
Tomasz Grabiec
9715965d0c tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
ef0e9ad34a tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
4a161bff2d tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
7228bd1502 tablets: load_balancer: Extract print_node_stats() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
615b86e88b tablet: load_balancer: Use empty() instead of size() where applicable 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
12fdd205d6 tablets: Fix redundancy in migration_plan::empty() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
0d090aa47b tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
f2b0146f0f tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
df32318f66 tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.
2026-01-28 01:32:00 +01:00
Emil Maskovsky
834961c308 db/view: add missing include for coroutine::all to fix build without precompiled headers
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.

The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.

Adding the explicit include fixes the build.

Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093

No backport: This problem is only present in master.

Closes scylladb/scylladb#28379
2026-01-27 18:56:56 +01:00
Calle Wilund
87aa6c8387 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
2026-01-27 18:01:21 +01:00
Calle Wilund
a896d8d5e3 utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.
2026-01-27 17:57:17 +01:00
Pavel Emelyanov
02af292869 Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

Closes scylladb/scylladb#27891

* github.com:scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-27 18:45:43 +03:00
Avi Kivity
59f2a3ce72 test: test_alternator_proxy_protocol: wait for the node to report itself as serving
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
2026-01-27 17:25:59 +02:00
Avi Kivity
ebac810c4e test: cluster_manager: add ability to wait for supervisor STATUS=serving
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
2026-01-27 17:24:55 +02:00
Botond Dénes
7ac32097da docs/cql/ddl.rst: Tombstones GC: explain propagation delay
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.

Closes scylladb/scylladb#28132
2026-01-27 16:05:52 +01:00
Tomasz Grabiec
32b336e062 tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.
2026-01-27 16:01:36 +01:00
Piotr Dulikowski
29da20744a storage_service: fix indentation after previous patch 2026-01-27 15:49:01 +01:00
Piotr Dulikowski
d28c841fa9 raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
2026-01-27 15:48:05 +01:00
Łukasz Paszkowski
8829098e90 reader_concurrency_semaphore: Remove cpu_concurrency's default value
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.

The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
2026-01-27 15:40:11 +01:00
Łukasz Paszkowski
3ef594f9eb test/storage: speed up out-of-space prevention tests by using smaller volumes
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.

Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.

This cuts ~5–6s per test.
2026-01-27 15:28:59 +01:00
Łukasz Paszkowski
0f86fc680c test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
2026-01-27 15:28:59 +01:00
Piotr Dulikowski
10e9672852 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
2026-01-27 14:37:43 +01:00
Pavel Emelyanov
87920d16d8 raft_group0_client: Dont export system keyspace
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:51:40 +03:00
Pavel Emelyanov
966119ce30 raft_group0_client: Add and use get_last_group0_state_id()
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.

This patch adds a helper method to provide the needed ID.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:50:25 +03:00
Pavel Emelyanov
dded1feeb7 group0_state_machine: Call ensure_group0_sched() with data_dictionary
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.

There's more explicit way -- get the data_dictionary via proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:48:22 +03:00
Pavel Emelyanov
20a2b944df view_building_worker: Use its own system_keyspace& reference
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:46:48 +03:00
Avi Kivity
16b56c2451 Merge 'Audit: avoid dynamic_cast on a hot path' from Marcin Maliszkiewicz
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.

Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug

Closes scylladb/scylladb#28326

* github.com:scylladb/scylladb:
  audit: replace batch dynamic_cast with static_cast
  audit: eliminate dynamic_cast to batch_statement in inspect
  audit: cql: remove create_no_audit_info
  audit: add batch bool to audit_info class
2026-01-27 12:54:16 +02:00
Pavel Emelyanov
c61d855250 hints: Provide explicit scheduling group for hint_sender
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.

This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28358
2026-01-27 12:50:11 +02:00
Gleb Natapov
9daa109d2c test: get rid of consistent_cluster_management usage in test
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.

Closes scylladb/scylladb#28340
2026-01-27 11:31:30 +01:00
Avi Kivity
fa5ed619e8 Merge 'test: perf: add perf-cql-raw benchmarking tool' from Marcin Maliszkiewicz
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark
- multi table mode
- non superuser mode

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

> Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
        mean=   99700.92 standard-deviation=3039.28
        median= 100409.60 median-absolute-deviation=2759.85
        maximum=102487.87 minimum=95707.93
instructions_per_op:
        mean=   109256.53 standard-deviation=2281.39
        median= 108337.37 median-absolute-deviation=1034.83
        maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
        mean=   76184.36 standard-deviation=2493.46
        median= 75320.20 median-absolute-deviation=922.09
        maximum=80572.19 minimum=74320.00

Backports: no, new tool

Closes scylladb/scylladb#25990

* github.com:scylladb/scylladb:
  test: perf: reuse stream id
  main: test: add future and abort_source to after_init_func
  test: perf: add option to stress multiple tables in perf-cql-raw
  test: perf: add perf-cql-raw benchmarking tool
  test: perf: move cut_arg helper func to common code
2026-01-27 12:23:25 +02:00
Yaron Kaikov
3f10f44232 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374
2026-01-27 10:59:21 +02:00
Avi Kivity
f1c6094150 Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.

This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.

Enanching testing code, not backporting

Closes scylladb/scylladb#28352

* github.com:scylladb/scylladb:
  code: Move limiting data source to test/lib
  util: Simplify limiting_data_source API
  util: Remove buffer_input_stream
  test: Use seastar::util::temporary_buffer_data_source in data consumer test
  sstables: Use seastar::util::as_input_stream() in mx reader
2026-01-26 22:05:59 +02:00
Raphael S. Carvalho
0e07c6556d test: Remove useless compaction group testing in database_test
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.

I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28324
2026-01-26 19:16:27 +02:00
Marcin Maliszkiewicz
19af46d83a audit: replace batch dynamic_cast with static_cast
Since we know already it's a batch we can use static
cast now.
2026-01-26 18:14:38 +01:00
Anna Stuchlik
edc291961b doc: update the GPG keys
Update the keys in the installation instructions (linux packages).

Fixes https://github.com/scylladb/scylladb/issues/28330

Closes scylladb/scylladb#28357
2026-01-26 19:13:18 +02:00
Piotr Dulikowski
5d5e829107 Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.

The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.

As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250

backport: not required, this is a refactor

Closes scylladb/scylladb#28093

* github.com:scylladb/scylladb:
  db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
  db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
  db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
  db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
  db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
2026-01-26 17:16:01 +01:00
Marcin Maliszkiewicz
55d246ce76 auth: bring back previous version of standard_role_manager::can_login
Previously, we wanted to make minimal changes with regards to the new
unified auth cache. However, as a result, some calls on the hot path
were missed. Now we have switched the underlying find_record call
to use the cache. Since caching is now at a lower level, we bring
back the original code.
2026-01-26 16:04:11 +01:00
Marcin Maliszkiewicz
3483452d9f auth: switch find_record to use cache
Since every write-type auth statement takes group0_guard at the beginning,
we hold read_apply_mutex and cannot have a running raft apply during our
operation. Therefore, the auth cache and internal CQL reads return the same,
consistent results. This makes it safe to read via cache instead of internal
CQL.

LDAP is an exception, but it is eventually consistent anyway.
2026-01-26 16:04:11 +01:00
Botond Dénes
755e8319ee replica/partition_snapshot_reader: remove unused includes 2026-01-26 16:52:46 +02:00
Botond Dénes
756837c5b4 partition_snapshot_reader: remove "flat" from name
The "flat" migration is long done, this distinction is no longer
meaningful.
2026-01-26 16:52:46 +02:00
Botond Dénes
9d1933492a mv partition_snapshot_reader.hh -> replica/
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.

Also move the code to the replica namespace.

test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
2026-01-26 16:52:08 +02:00
Avi Kivity
32cc593558 Merge 'tools/scylla-sstable: introduce filter command' from Botond Dénes
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.

Fixes: #13076

New functionality, no backport.

Closes scylladb/scylladb#27836

* github.com:scylladb/scylladb:
  tools/scylla-sstable: introduce filter command
  tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
  tools/scylla-sstable: make partition_set ordered
  tools/scylla-stable: remove unused boost/algorithm/string.hpp include
2026-01-26 16:32:38 +02:00
Ernest Zaslavsky
912c48a806 connection_factory: includes cleanup 2026-01-26 15:15:21 +02:00
Ernest Zaslavsky
3a31380b2c dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.
2026-01-26 15:15:15 +02:00
Ernest Zaslavsky
a05a4593a6 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.
2026-01-26 15:14:18 +02:00
Ernest Zaslavsky
6eb7dba352 connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.
2026-01-26 15:11:49 +02:00
Nadav Har'El
25ff4bec2a Merge 'Refactor streams' from Radosław Cybulski
Refactor streams.cc - turn `.then` calls into coroutines.

Reduces amount of clutter, lambdas and referenced variables.

Fixes #28342

Closes scylladb/scylladb#28185

* github.com:scylladb/scylladb:
  alternator: refactor streams.cc
  alternator: refactor streams.cc
2026-01-26 14:31:15 +02:00
Andrei Chekun
3d3fabf5fb test.py: change the name of the test in failed directory
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.

Closes scylladb/scylladb#28226
2026-01-26 13:29:45 +01:00
Łukasz Paszkowski
f06094aa95 topology_coordinator: add write_both_read_old_fallback_cleanup state
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.

Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.

Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.

This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.

By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
2026-01-26 13:14:37 +01:00
Łukasz Paszkowski
0a058e53c7 topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
b12f46babd topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.

In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).

Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.

This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
7c331b7319 topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
    if (action_failed(tablet_state.barriers[trinfo.stage])) {
        if (check_excluded_replicas()) {
            transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
            break;
        }
    }
    if (do_barrier()) {
        ...
    }
    break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.

On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.

Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.

Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
2026-01-26 13:14:36 +01:00
Marcin Maliszkiewicz
440590f823 auth: make find_record and callers standard_role_manager members
It will be usefull for following commit. Those methods are anyway class specific
2026-01-26 13:10:11 +01:00
Anna Stuchlik
84281f900f doc: remove the troubleshooting section on upgrades from OSS
This commit removes a document originally created to troubleshoot
upgrades from Open Source to Enterprise.

Since we no longer support Open Source, this document is now redundant.

Fixes https://github.com/scylladb/scylladb/issues/28246

Closes scylladb/scylladb#28248
2026-01-26 14:02:53 +02:00
Anna Stuchlik
c25b770342 doc: add the version name to the Install pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28022
It adds the version name to more install pages.

Closes scylladb/scylladb#28289
2026-01-26 13:11:14 +02:00
Alex
954d18903e db: view: extend try/catch scope in handle_create_view_local
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
2026-01-26 13:10:37 +02:00
Alex
2c3ab8490c db: view: refine create/drop coroutine signatures
Refactor the create/drop coroutine interfaces to accept parameters as
const references, enabling a clearer workflow and safer data flow.
2026-01-26 13:10:34 +02:00
Alex
114f88cb9b db: view: switch from continuations to coroutines
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
2026-01-26 13:08:24 +02:00
Alex
87c1c6f40f db: view: introduce helper to acquire or reuse semaphore units
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
2026-01-26 13:03:26 +02:00
Avi Kivity
ec70cea2a1 test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336
2026-01-26 12:27:19 +02:00
Pavel Emelyanov
77435206b9 code: Move limiting data source to test/lib
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:49:42 +03:00
Pavel Emelyanov
111b376d0d util: Simplify limiting_data_source API
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.

Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:37 +03:00
Pavel Emelyanov
e297ed0b88 util: Remove buffer_input_stream
It's now unused. All the users had been patched to use seastar memory
data source implementation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:10 +03:00
Pavel Emelyanov
4639681907 test: Use seastar::util::temporary_buffer_data_source in data consumer test
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.

This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:44:33 +03:00
Pavel Emelyanov
2399bb8995 sstables: Use seastar::util::as_input_stream() in mx reader
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.

Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:43:17 +03:00
Marcin Maliszkiewicz
6f32290756 audit: eliminate dynamic_cast to batch_statement in inspect
This is costly and not needed we can use a simple
bool flag for such check. It burns around 300 cpu
instructions on a hot request's path.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
a93ad3838f audit: cql: remove create_no_audit_info
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.

Not having special value allows us to return
fast when audit is not enabled.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
02d59a0529 audit: add batch bool to audit_info class
In the following commit we'll use this field
instead of costly dynamic_cast when emitting
audit log.
2026-01-26 10:18:38 +01:00
Botond Dénes
57b2cd2c16 docs: inter-link incremental-repair and repair documents
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
2026-01-26 09:55:54 +02:00
Botond Dénes
a84b1b8b78 docs: incremental-repair: fix curl example
Currently it is regular text, make it a code block so it is easier to
read and copy+paste.
2026-01-26 09:55:54 +02:00
Pavel Emelyanov
1796997ace Merge 'restore: Enable direct download of fully contained SSTables' from Ernest Zaslavsky
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.

1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908

No need to backport as this code targets 2026.2 release (for tablet-aware restore)

Closes scylladb/scylladb#26834

* github.com:scylladb/scylladb:
  tests: reuse test_backup_broken_streaming
  streaming: enable direct download of contained sstables
  storage: add method to create component source
  streaming: keep sharded database reference on tablet_sstable_streamer
  streaming: skip streaming empty sstable sets
  streaming: inline streamer instance creation
  tests: fix incorrect backup/restore test flow
2026-01-26 10:22:34 +03:00
Ernest Zaslavsky
cb2aa85cf5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240
2026-01-26 10:19:57 +03:00
Avi Kivity
55422593a7 Merge 'test/lib: Fix bugs in boost_test_tree_lister.cc' from Dawid Mędrek
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:

* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
  listed in the output.

Refs scylladb/scylladb#25415

Backport: not needed. The code hasn't been used anywhere yet.

Closes scylladb/scylladb#28255

* github.com:scylladb/scylladb:
  test/lib/boost_test_tree_lister.cc: Record empty test suites
  test/lib/boost_test_tree_lister.cc: Deduplicate labels
2026-01-25 21:34:32 +02:00
Andrei Chekun
cc5ac75d73 test.py: remove deprecated skip_mode decorator
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.

Closes scylladb/scylladb#28299
2026-01-25 18:17:27 +02:00
Ernest Zaslavsky
66a33619da connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
2026-01-25 16:12:29 +02:00
Ernest Zaslavsky
5b3e513cba connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
ce0c7b5896 connection_factory: remove unnecessary else 2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
359d0b7a3e connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
bd9d5ad75b s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call
2026-01-25 15:42:48 +02:00
Alex
1aadedc596 db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0 2026-01-25 14:29:09 +02:00
Ernest Zaslavsky
70f5bc1a50 tests: reuse test_backup_broken_streaming
reuse the `test_backup_broken_streaming` test to check for direct
sstable download
2026-01-25 13:27:44 +02:00
Ernest Zaslavsky
13fb605edb streaming: enable direct download of contained sstables
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
2026-01-25 13:27:44 +02:00
Yaron Kaikov
3dea15bc9d Update ScyllaDB version to: 2026.2.0-dev 2026-01-25 11:09:17 +02:00
Botond Dénes
edda66886e Merge 'Replace sstable_stream_source_impl's internal sink and source with seastar utilities' from Pavel Emelyanov
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.

Using newer seastar facilities, not backporting.

Closes scylladb/scylladb#28321

* github.com:scylladb/scylladb:
  sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
  sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
2026-01-23 16:55:18 +02:00
Pavel Emelyanov
3e09d3cc97 test: Keep test_gossiper_live_endpoints checks togethger
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28224
2026-01-23 16:53:48 +02:00
Piotr Dulikowski
3ec4f67407 Merge 'vector_index: Implement rescoring' from Szymon Malewski
This series implements rescoring algorithm.

Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.

When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.

Example:

Creating a Vector Index with Rescoring

```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
    id int PRIMARY KEY,
    embedding vector<float, 128>
);

-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
    USING 'vector_index'
    WITH OPTIONS = {
        'similarity_function': 'cosine',
        'quantization': 'i8',
        'oversampling': '2.0',
        'rescoring': 'true'
    };
```

1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results

Query example:

```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```

With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results

In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.

Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - doesn't need backport.

Closes scylladb/scylladb#27769

* github.com:scylladb/scylladb:
  vector_index: rescoring: Fetch oversampled rows
  vector_index: rescoring: Sort by similarity column
  select_statement: Modify `needs_post_query_ordering` condition
  vector_index: rescoring: Add hidden similarity score column
  vector_index: Refactor extracting ANN query information
2026-01-23 15:20:10 +01:00
Patryk Jędrzejczak
a41d7a9240 Merge 'test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Petr Gusev
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.

Fixes scylladb/scylladb#28260

backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version

Closes scylladb/scylladb#28315

* https://github.com/scylladb/scylladb:
  storage_proxy: drop stop() method
  test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
2026-01-23 15:18:17 +01:00
Avi Kivity
30d6f3b8e0 test: test_proxy_protocol: bump timeout
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.

The test never expects a timeout, so increasing it won't increase
the test duration.

Fixes #28028

Closes scylladb/scylladb#28272
2026-01-23 15:37:00 +02:00
Łukasz Paszkowski
09fde82a33 test/scylla_gdb: fix coro_task request usage, rename duplicate test
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
  and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
  to `test_sstable_index_cache` so both tests are collected

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#28286
2026-01-23 15:25:58 +02:00
Pavel Emelyanov
4a307d931a sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:22:22 +03:00
Pavel Emelyanov
97b1340a68 sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:21:14 +03:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Petr Gusev
c45244b235 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.
2026-01-23 11:22:03 +01:00
Petr Gusev
f5ed3e9fea test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260
2026-01-23 11:20:36 +01:00
Botond Dénes
35c9a00275 Merge 'test.py: pass correctly extra cmd line arguments' from Andrei Chekun
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.

No backport needed, only framework enhancement.

Closes scylladb/scylladb#28156

* github.com:scylladb/scylladb:
  test.py: do not crash when there is no boost log
  test.py: pass correctly extra cmd line arguments
2026-01-23 11:26:01 +02:00
Andrzej Jackowski
c493a66668 test: check cql_requests_count instead of tasks_processed in SL
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).

To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.

Fixes: scylladb/scylladb#27715

Closes scylladb/scylladb#28318
2026-01-23 10:19:29 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Anna Stuchlik
c681b3363d doc: remove an outdated KB
This page was created for very outdated versions of ScyllaDB (5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise) to ensure smooth upgrade to that versions.

We no longer need that document.

Fixes https://github.com/scylladb/scylladb/issues/28264

Closes scylladb/scylladb#28266
2026-01-22 18:27:14 +01:00
Szymon Wasik
927aebef37 Add vector search documentation links to CQL docs
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.

Fixes: SCYLLADB-371

Closes scylladb/scylladb#28312
2026-01-22 16:46:38 +01:00
Botond Dénes
f375288b58 tools/scylla-sstable: introduce filter command
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
2026-01-22 17:20:07 +02:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Michael Litvak
1f7a65904e alternator: don't require rf_rack flag for indexes, validate instead
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
2026-01-22 16:11:35 +01:00
Szymon Malewski
29d090845a vector_index: rescoring: Fetch oversampled rows
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.

With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-22 15:38:44 +01:00
Szymon Malewski
0bc95bcf87 vector_index: rescoring: Sort by similarity column
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.

However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
2026-01-22 15:38:44 +01:00
Szymon Malewski
57e7a4fa4f select_statement: Modify needs_post_query_ordering condition
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.

In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.

In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.

Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
2026-01-22 15:38:44 +01:00
Szymon Malewski
c89957b725 vector_index: rescoring: Add hidden similarity score column
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.

In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
2026-01-22 15:38:40 +01:00
Anna Stuchlik
0aa881f190 doc: add the info about Alternator ports to the Admin Guide
Fixes https://github.com/scylladb/scylladb/issues/23706

Closes scylladb/scylladb#27724
2026-01-22 16:10:58 +03:00
Israel Fruchter
6b3ce5fdcc dist: scylla_coredump_setup: force unmount /var/lib/systemd/coredump before setup
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.

Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.

Fix: scylladb/scylla-enterprise#5692

Closes scylladb/scylladb#28300
2026-01-22 14:35:26 +02:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Botond Dénes
21900c55eb tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.

Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.

Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
2026-01-22 13:55:59 +02:00
Botond Dénes
a1ed73820f tools/scylla-sstable: make partition_set ordered
Next patch will want partitions to be ordered. Remove unused
partition_map type.
2026-01-22 13:55:59 +02:00
Botond Dénes
d228e6eda6 tools/scylla-stable: remove unused boost/algorithm/string.hpp include 2026-01-22 13:55:59 +02:00
Piotr Smaron
d4c28690e1 db: fail reads and writes with local consistencty level to a DC with RF=0
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893
2026-01-22 12:49:45 +01:00
Piotr Smaron
9475659ae8 db: consistency_level: split local_quorum_for()
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
2026-01-22 12:49:23 +01:00
Piotr Smaron
0b3ee197b6 db: consistency_level: fix nrs -> nts abbreviation
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
2026-01-22 12:48:37 +01:00
Marcin Maliszkiewicz
32543625fc test: perf: reuse stream id
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
7bf7ff785a main: test: add future and abort_source to after_init_func
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
0d20300313 test: perf: add option to stress multiple tables in perf-cql-raw 2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
a033b70704 test: perf: add perf-cql-raw benchmarking tool
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull
to assess tool's overhead or use it as bigger scale benchmark
- uses prepared statements by default
- connection only mode, for testing storms

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
	mean=   99700.92 standard-deviation=3039.28
	median= 100409.60 median-absolute-deviation=2759.85
	maximum=102487.87 minimum=95707.93
instructions_per_op:
	mean=   109256.53 standard-deviation=2281.39
	median= 108337.37 median-absolute-deviation=1034.83
	maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
	mean=   76184.36 standard-deviation=2493.46
	median= 75320.20 median-absolute-deviation=922.09
	maximum=80572.19 minimum=74320.00
2026-01-22 12:26:50 +01:00
Patryk Jędrzejczak
e11450ccca test: test_raft_recovery_user_data: prevent repeated ALTER KEYSPACE request
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.

This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.

We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.

The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.

Fixes #27862

No backport needed as 2e7ba1f8ce is only
in master.

Closes scylladb/scylladb#28290
2026-01-22 14:13:42 +03:00
Nadav Har'El
9baaddb613 test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-22 10:43:59 +01:00
Szymon Malewski
e0cc6ca7e6 vector_index: Refactor extracting ANN query information
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
2026-01-22 10:00:47 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
d6557764b9 snapshot: keep per-sstable metadata in manifest.json
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.

Fixes: SCYLLADB-196

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:42:52 +02:00
Benny Halevy
dc9093303d snapshot: add table info and tablet_count to manifest.json
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.

For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables.  In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.

Fixes SCYLLADB-195

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:36:52 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
5e90fbb9d2 table: snapshot_on_all_shards: take snapshot_options
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
d9fc3b1c11 snapshot: add snapshot name in manifest.json
Store the snapshot tag in the manifest file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0f014e2916 test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0d82e56078 snapshot: add node info to manifest.json
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
24040efc54 snapshot: add manifest info to manifest.json
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
9e0f5410ae test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Patryk Jędrzejczak
e503340efc test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.

Fixes #28227

Closes scylladb/scylladb#28233
2026-01-22 06:55:16 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Nadav Har'El
5c1e525618 Merge 'vector_search: cache restrictions JSON at prepare time ' from Dawid Pawlik
Add `prepared_filter` class which handles the preparation, construction, and caching of Vector Search filtering compatible JSON object.

If no bind markers found in primary key restrictions, the JSON object will be built once at prepare time and cached for use during execution calls.

Additionally, this patch moves the filter functions from `cql3::restrictions` to `vector_search` namespace and does some renaming to make the purpose of those functions clear.

Follow-up: https://github.com/scylladb/scylladb/pull/28109
Fixes: [SCYLLADB-299](https://scylladb.atlassian.net/browse/SCYLLADB-299)

[SCYLLADB-299]: https://scylladb.atlassian.net/browse/SCYLLADB-299?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28276

* github.com:scylladb/scylladb:
  test/vector_search: add filter tests with bind variables
  vector_search: cache restrictions JSON at prepare time
  refactor: vector_search: move filter logic to vector_search namespace
2026-01-21 19:03:35 +02:00
Ernest Zaslavsky
51285785fa storage: add method to create component source
Extend the storage interface with a public method to create a
`data_source` for any sstable component.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
757e9d0f52 streaming: keep sharded database reference on tablet_sstable_streamer 2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
4ffa070715 streaming: skip streaming empty sstable sets
Avoid invoking streaming when the sstable set is empty. This prevents
unnecessary calls and simplifies the streaming logic.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
0fcd369ef2 streaming: inline streamer instance creation
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
32173ccfe1 tests: fix incorrect backup/restore test flow
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
2026-01-21 16:40:12 +02:00
Patryk Jędrzejczak
1f0f694c9e test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274
2026-01-21 15:17:42 +01:00
Petr Gusev
843da2bbb8 test_encryption: capture stderr
The check_output function doesn't capture stderr by default, making
it impossible to debug if something goes wrong.
2026-01-21 14:56:01 +01:00
Petr Gusev
925d86fefc test/cluster: add test_strong_consistency.py
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.

Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
2026-01-21 14:56:01 +01:00
Petr Gusev
59a876cebb raft_group_registry: disable metrics for non-0 groups
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.

This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
2026-01-21 14:56:01 +01:00
Petr Gusev
1f170d2566 strong consistency: implement select_statement::do_execute() 2026-01-21 14:56:01 +01:00
Petr Gusev
a5d611866e cql: add select_statement.cc 2026-01-21 14:56:01 +01:00
Petr Gusev
493ebe2add strong consistency: implement coordinator::query() 2026-01-21 14:56:01 +01:00
Petr Gusev
ccf90cfde8 cql: add modification_statement
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
2026-01-21 14:56:01 +01:00
Petr Gusev
989566e8a3 cql: add statement_helpers
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.

redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.

is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
2026-01-21 14:56:01 +01:00
Petr Gusev
b94fd11939 strong consistency: implement coordinator::mutate()
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().

Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.

The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.

We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
2026-01-21 14:56:01 +01:00
Petr Gusev
801bf82a34 raft.hh: make server::wait_for_leader() public
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.

We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
2026-01-21 14:56:01 +01:00
Petr Gusev
7d111f2396 strong_consistency: add coordinator
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
2026-01-21 14:56:01 +01:00
Petr Gusev
4413142f25 modification_statement: make get_timeout public
We'll need to access this method in a new
strong_consistency/modification_statement class.
2026-01-21 14:56:00 +01:00
Petr Gusev
4902186ede strong_consistency: add groups_manager
This class is reponsible for managing raft groups for
strongly-consistent tablets.
2026-01-21 14:56:00 +01:00
Petr Gusev
4eee5bc273 strong_consistency: add state_machine and raft_command
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.

We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
2026-01-21 14:56:00 +01:00
Petr Gusev
a8350b274e table: add get_max_timestamp_for_tablet
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.

This commit adds a function that returns the maximum timestamp for a
given tablet.

Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
  timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
  the maximum of the contributing sstables.
2026-01-21 14:56:00 +01:00
Petr Gusev
dc4896b69b tablets: generate raft group_id-s for new table
We assign shard to zero because raft is currently only supported on the
zero shard.
2026-01-21 14:56:00 +01:00
Petr Gusev
6dd74be684 tablet_replication_strategy: add consistency field
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
2026-01-21 14:56:00 +01:00
Petr Gusev
53f93eb830 tablets: add raft_group_id
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.

This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.

The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
2026-01-21 14:56:00 +01:00
Petr Gusev
cab3e1eea5 modification_statement: remove virtual where it's not needed
This is a refactoring/simplification commit.
2026-01-21 14:56:00 +01:00
Petr Gusev
9015bed794 modification_statement: inline prepare_statement()
This is a refactoring/simplification commit.

There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
2026-01-21 14:56:00 +01:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Petr Gusev
6b0d757f28 cql: rename strongly_consistent statements to broadcast statements
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.

The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
2026-01-21 14:56:00 +01:00
Pavel Emelyanov
18b5a49b0c Populate all sl:* groups into dedicated top-level supergroup
This patch changes the layout of user-facing scheduling groups from

/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)

into

/
`- user (supergroup)
   `- statement
   `- sl:default
   `- sl:*
`- other groups (compaction, streaming, etc.)

The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).

The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.

The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.

Unit tests keep their user groups under root supergroup (for simplicity)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28235
2026-01-21 14:14:48 +02:00
Botond Dénes
cf4133c62d Merge 'service: pass topology guard to RBNO' from Aleksandra Martyniuk
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.

Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759

No topology_guard in any version; needs backport to all versions

Closes scylladb/scylladb#27839

* github.com:scylladb/scylladb:
  service: use session variable for streaming
  service: pass topology guard to RBNO
2026-01-21 12:47:28 +02:00
Robert Bindar
ea8a661119 reduce test_backup.py and test_refresh.py datasets
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).

This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28280
2026-01-21 10:47:36 +02:00
Nadav Har'El
8962093d90 Merge 'vector_index: Introduce rescoring index option' from Szymon Malewski
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.

Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294

No backporting - it is a new feature.

Closes scylladb/scylladb#28165

* github.com:scylladb/scylladb:
  vector_search: Add more rescoring validation tests
  vector_search: Add rescoring validation test
  vector_search: doc: Document new index option
  vector_search: test: Add `rescoring` index option test
  vector_index: introduce rescoring option
  vector_index: improve options validation
2026-01-21 10:46:22 +02:00
Avi Kivity
ecb6fb00f0 streamed_mutation_freezer: use chunked_vector instead of std::deque for clustering rows
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.

Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.

Fixes #28275

Closes scylladb/scylladb#28281
2026-01-21 10:13:44 +02:00
Botond Dénes
09d3b6c98b Merge 'auth: use paged internal queries during migration' from Piotr Smaron
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: https://github.com/scylladb/scylladb/issues/27577

A minor enhancement, no need to backport.

Closes scylladb/scylladb#25395

* github.com:scylladb/scylladb:
  auth: use paged internal queries during migration
  auth: move some code in migrate_to_auth_v2 up
  auth: re-align pieces of migrate_to_auth_v2
  cql: extend `query_internal` with `query_state` param
2026-01-21 09:58:06 +02:00
Raphael S. Carvalho
d16f9c821d Revert "api: storage_service/tablets/repair: disable incremental repair by default"
This reverts commit c8cff94a5a.

Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28120
2026-01-21 08:50:13 +02:00
Dawid Pawlik
4a7e20953a test/cqlpy: remove the xfail mark from already passing tests
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.

Closes scylladb/scylladb#28231
2026-01-21 08:47:20 +02:00
Botond Dénes
7a3f51b304 Update seastar submodule
* seastar e00f1513...f55dc7eb (8):
  > build: detect io_uring correctly
  > ioinfo: report actual I/O latency goal instead of configured value
  > core/loop: repeater,repeat_until_value: don't rethrow exceptions
  > seastar::test: Add exception interception to extract nested info
  > bool_class: add operator<=>
  > Merge 'Perftune.py: generic virtual device with lower-level slave network devices support' from Vladislav Zolotarov
    scripts/perftune.py: NetPerfTuner: rename "hardware interface" methods to "tunable interface" methods
    scripts/perftune.py: NetPerfTuner: refactor NIC handling to support any virtual interface with slaves
    scripts/perftune.py: NetPerfTuner: rename methods to private for better encapsulation
  > scripts/perftune.py: add fast path queues IRQ filtering and indexing for Microsoft Azure Network Adapter (MANA) NICs
  > file: Finally override dma_read_bulk() in posix_file_impl

Closes scylladb/scylladb#28262
2026-01-21 08:44:20 +02:00
Szymon Malewski
d6226500f6 vector_search: Add more rescoring validation tests
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.

As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
2026-01-20 21:01:45 +01:00
Karol Nowacki
376c70be75 vector_search: Add rescoring validation test
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.

First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
2026-01-20 21:01:45 +01:00
Karol Nowacki
41dcf12463 vector_search: doc: Document new index option
Adds documentation for the rescoring option for vector search indexes.
2026-01-20 21:01:45 +01:00
Karol Nowacki
b268eda67e vector_search: test: Add rescoring index option test
Add tests to validate `rescoring` index options.
It also improves tests for related `oversampling` option validation.
2026-01-20 21:01:45 +01:00
Szymon Malewski
c5945b1ef4 vector_index: introduce rescoring option
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
2026-01-20 21:01:45 +01:00
Szymon Malewski
262a8cef0b vector_index: improve options validation
In this patch we enhance validation of option by:
- giving context (option name) in error messages
- listing supported values in error messages of enumerated options
- avoiding using templates

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Follow-up: https://github.com/scylladb/scylladb/pull/27677
2026-01-20 21:01:41 +01:00
Dawid Pawlik
f27ef79d0d test/vector_search: add filter tests with bind variables
Add tests that check if preparation of the filter  does work
and not produce cache when the restrictions consist of bind variables.
2026-01-20 18:17:46 +01:00
Dawid Pawlik
e62cb29b7d vector_search: cache restrictions JSON at prepare time
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.

If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.

Adjust tests accordingly to use prepared filters.

Follow-up: #28109
Fixes: SCYLLADB-299
2026-01-20 17:15:52 +01:00
Michał Jadwiszczak
f89a8c4ec4 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183
2026-01-20 15:58:08 +01:00
Andrei Chekun
91e2b027ce test.py: do not crash when there is no boost log
Small fix to not crash the whole process when boost tests failed to
start and do not produced the log file that can be parsed.
2026-01-20 15:52:40 +01:00
Andrei Chekun
58d3052ad4 test.py: pass correctly extra cmd line arguments
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
2026-01-20 15:52:40 +01:00
Anna Stuchlik
84e9b94503 doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725
2026-01-20 15:41:03 +01:00
Dawid Pawlik
f54a4010c0 refactor: vector_search: move filter logic to vector_search namespace
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.

Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.

Follow-up: #28109
2026-01-20 13:13:43 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Avi Kivity
c7dda5500c database: simplify apply_counter_update exception handling
Use coroutine::try_future to exit the coroutine immediately on
error instead of explict checks.

Closes scylladb/scylladb#28257
2026-01-20 11:13:49 +02:00
Wojciech Mitros
fc2aecea69 idl: don't redefine bound_weight and partition_region in paging_state.idl.hh
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.

Closes scylladb/scylladb#28228
2026-01-20 11:12:47 +02:00
Aleksandra Martyniuk
2be5ee9f9d service: use session variable for streaming
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
3fe596d556 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
f0dbf6135d test: add test for enforce_rack_list option 2026-01-20 10:01:15 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Michael Litvak
e7ec87382e Revert "alternator: require rf_rack_valid_keyspaces when creating index"
This reverts commit 4b26a86cb0.

The rf_rack_valid_keyspaces option is now not required for creating MVs.
2026-01-20 09:56:48 +01:00
Pavel Emelyanov
8ecd4d73ac test: Update cluster/object_store/ tests to use new S3 config format
Currently the suite generates config in old format, and only a single
test validates that using new format "works".

This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28113
2026-01-20 10:53:34 +02:00
Pavel Emelyanov
5e369c0439 audit: Stop using deprecated seastar UDP sending API
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28198
2026-01-20 10:51:23 +02:00
Avi Kivity
874322f95e multishard_query: simplify do_query() coroutine/continuation complexity
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.

While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.

Closes scylladb/scylladb#28258
2026-01-20 10:45:10 +02:00
Łukasz Paszkowski
e07fe2536e test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140
2026-01-20 10:38:20 +02:00
Piotr Smaron
6084f250ae auth: use paged internal queries during migration
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: scylladb/scylladb#27577
2026-01-20 09:32:21 +01:00
Piotr Smaron
345458c2d8 auth: move some code in migrate_to_auth_v2 up
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
2026-01-20 09:15:53 +01:00
Piotr Smaron
97fe3f2a2c auth: re-align pieces of migrate_to_auth_v2
This is needed to make next commits more readable.
Just moved the touched code 4 spaces to the right.
2026-01-20 09:15:53 +01:00
Piotr Smaron
3ca6b59f80 cql: extend query_internal with query_state param
This later is going to be used to pass a query timeout via `qs` to
`query_internal`.
2026-01-20 09:15:48 +01:00
Nadav Har'El
70b3cd0540 Merge 'vector_index: introduce quantization and oversampling options' from Szymon Malewski
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - no backporting

Closes scylladb/scylladb#27677

* github.com:scylladb/scylladb:
  vector_search: doc: Document new index options
  vector_search: test: Test oversampling
  vector_search: test: Add rescoring index options test
  vector_search: test: Extract Configure utility to shared header
  vector_index: introduce `quantization` and `oversampling` options
2026-01-20 08:50:46 +02:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Dawid Mędrek
3b8bf85fbc test/lib/boost_test_tree_lister.cc: Record empty test suites
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.

However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.

We start including those empty suites and files in the output of
`--list_json_content`.

---

Examples (with additional formatting):

* Consider the following test file, `test/boost/dummy_test.cc` [1]:

  ```
  BOOST_AUTO_TEST_SUITE(dummy_suite1)
  BOOST_AUTO_TEST_SUITE(dummy_suite2)
  BOOST_AUTO_TEST_SUITE_END()
  BOOST_AUTO_TEST_SUITE_END()

  BOOST_AUTO_TEST_SUITE(dummy_suite3)
  BOOST_AUTO_TEST_SUITE_END()
  ```

  Before this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file":"test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites": [
       {"name": "dummy_suite2", "suites": [], "tests": []}
    ], "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}
  ], "tests": []}}]
  ```

* Consider the same test file as in Example 1, but also assume it's compiled
  into `test/boost/combined_tests`.

  Before this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
  $
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content
  [..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites":
      [{"name": "dummy_suite2", "suites": [], "tests": []}],
    "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}],
  "tests":[]}}, ...]
  ```

[1] Note that the example is simplified. As of now, it's not possible to use
    `--list_json_content` with a file without any Boost tests. That will
    result in the following error: `Test setup error: test tree is empty`.

Refs scylladb/scylladb#25415
2026-01-19 18:03:24 +01:00
Dawid Mędrek
1129599df8 test/lib/boost_test_tree_lister.cc: Deduplicate labels
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.

Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.

---

Example (with additional formatting):

Consider the following test in the file `test/boost/dummy_test.cc`:

```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
    return make_ready_future();
}
```

Before this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```

After this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```

Refs scylladb/scylladb#25415
2026-01-19 18:01:14 +01:00
Nikos Dragazis
4cde34f6f2 storage_service: Remove redundant yields
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-19 16:18:49 +02:00
Marcin Maliszkiewicz
1318ff5a0d test: perf: move cut_arg helper func to common code
It will be reused later.
2026-01-19 14:33:10 +01:00
Tomasz Grabiec
7977c97694 Merge 'db: add effective_capacity to load_per_node virtual table' from Ferenc Szili
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.

Size based load balancing is currently only on master, so no backport is needed.

Closes scylladb/scylladb#28220

* github.com:scylladb/scylladb:
  docs: add effective_capacity to system keyspace docs
  virtual_table: add effective_capacity to load_per_node
2026-01-19 13:17:28 +01:00
Marcin Maliszkiewicz
be8a30230b Merge 'test/cluster/dtest: import scrub_test.py' from Botond Dénes
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.

dtest import, no backport needed

Closes scylladb/scylladb#28085

* github.com:scylladb/scylladb:
  test/cluster/dtest: remove is_win() and users
  test/cluster/dtest/scrub_test.py: add license blurb
  test/cluster/dtest: import scrub_test.py
  test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
  test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
2026-01-19 12:14:08 +01:00
Botond Dénes
2e4d0e42f0 test/cluster/dtest: remove is_win() and users
ScyllaDB and its tests never run on windows, this function is not
needed, patch it out.
2026-01-19 12:56:57 +02:00
Botond Dénes
8953a143e5 test/cluster/dtest/scrub_test.py: add license blurb
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
2026-01-19 12:55:59 +02:00
Botond Dénes
d2c266eb47 test/cluster/dtest: import scrub_test.py
Import the test verbatim. Requires adding is_win() to ccmlib/common.py,
with a dummy implementation.
2026-01-19 12:52:44 +02:00
Botond Dénes
99e8a92aef test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
To work in the local test.py context.
2026-01-19 12:52:44 +02:00
Botond Dénes
807da53583 test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
2026-01-19 12:52:44 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Karol Nowacki
324b829263 vector_search: doc: Document new index options
Adds documentation for the `quantization` and `oversampling`
options for vector search indexes.
2026-01-19 10:28:46 +01:00
Karol Nowacki
bca17290f4 vector_search: test: Test oversampling
Add test to verify that Scylla correctly oversamples the limit
according to the oversampling option.
2026-01-19 10:28:46 +01:00
Karol Nowacki
e347f6d0d4 vector_search: test: Add rescoring index options test
Add tests to validate quantization and oversampling index options.
2026-01-19 10:28:44 +01:00
Karol Nowacki
24b037e8e3 vector_search: test: Extract Configure utility to shared header
Move Configure test utility to dedicated file for reuse across test suites.
2026-01-19 10:21:44 +01:00
Szymon Malewski
b8e91ee6ae vector_index: introduce quantization and oversampling options
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-19 10:21:43 +01:00
Tomasz Grabiec
dd0fc35c63 lsa: Export metrics for reclaim/evict/compact time
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.

Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.

reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.

evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.

Closes scylladb/scylladb#27774
2026-01-19 12:08:16 +03:00
Nadav Har'El
3e270a49f7 test/cqlpy: remove test_describe.py from cluster reuse blacklist
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.

I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.

We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
   CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
   if "USE" was in effect, it will return the correct results.

I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.

Fixes #26291

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27973
2026-01-19 12:02:00 +03:00
Dario Mirovic
823d1b9c03 audit: fix start_audit init sequence placement
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.

Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.

Fixes SCYLLADB-252

Closes scylladb/scylladb#28139
2026-01-19 11:57:39 +03:00
Botond Dénes
6f5f42305a docs: make the glossary more tablet inclusive
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.

One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.

Closes scylladb/scylladb#28170
2026-01-19 11:50:13 +03:00
Ernest Zaslavsky
829bd9b598 aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143
2026-01-19 11:41:47 +03:00
Botond Dénes
b7bc48e7b7 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155
2026-01-19 11:37:51 +03:00
Nadav Har'El
d86d5b33aa test/cqlpy: translate Cassandra's unit tests for LWT
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.

This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.

Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.

These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.

However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.

Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.

Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:

1. On a successful conditional write, Cassandra returns just true, Scylla
   also returns the old contents of the row.

2. Similarly, in an IF EXISTS write that failed (the row did not exist),
   Cassandra returns just false, Scylla also returns extra null values for
   each and every column of the row.

3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
   UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.

4. When there are static columns, Scylla's LWT response returns the static
   column first, Cassandra returns the modified column first. Since both
   also say which columns they return, neither is more correct than the other,
   a normally users will address specific columns by name, not by position.

5. docs/kb/lwt-differences.rst explains that "the returned result set
   contains an old row for every conditional statement in the batch".
   Beyond this different, actually non-conditional updates in the batch will
   also get a row in Scylla's result. Refs #27955.

6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
   and other conditions for the same row. Cassandra doesn't, so checks that
   these combinations are not allowed were commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27961
2026-01-19 09:46:04 +02:00
Botond Dénes
19efd7f6f9 Merge 'The system_replicated_keys should be mark as a system keyspace' from Amnon Heiman
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.

A side effect of that is that metrics that are not supposed to be reported are.
Fixes #27903

Closes scylladb/scylladb#27954

* github.com:scylladb/scylladb:
  distributed_loader: system_replicated_keys as system keyspace
  replicated_key_provider: make KSNAME public
2026-01-19 09:37:41 +02:00
Aleksandra Martyniuk
65cba0c3e7 service: node_ops: remove coroutine::lambda wrappers
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.

The lambda captures may be destroyed before the lambda is called, leading
to use after free.

Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.

Fixes: https://github.com/scylladb/scylladb/issues/28200.

Closes scylladb/scylladb#28201
2026-01-19 09:19:53 +02:00
Botond Dénes
c8811387e1 Merge 'service: do not change the schema while pausing the rf change ' from Aleksandra Martyniuk
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).

Fixes: https://github.com/scylladb/scylladb/issues/28167

No backport needed; changes that introduced a bug are only on master

Closes scylladb/scylladb#28168

* github.com:scylladb/scylladb:
  service: fin indentation
  test: add test_numeric_rf_to_rack_list_conversion_abort
  service: tasks: fix type of global_topology_request_virtual_task
  service: do not change the schema while pausing the rf change
2026-01-19 09:15:20 +02:00
Botond Dénes
7d637b14e8 erge 'test/cluster/test_internode_compression: Transpose test from dtest' from Calle Wilund
Refs #27429

Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.

Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.

Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.

Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.

When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...

Closes scylladb/scylladb#28133

* github.com:scylladb/scylladb:
  test/cluster/test_internode_compression: Transpose test from dtest
  gossiper/main: Extend special treatment of node ID resolve for rpc_address
2026-01-19 08:34:31 +02:00
Ferenc Szili
1136a3f398 docs: add effective_capacity to system keyspace docs
This adds the description of effective_capacity to the documentation
of the system keyspace.
2026-01-18 16:57:08 +01:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Nadav Har'El
3e138a2685 test/cqlpy: Add our copyright/license to translated Cassandra tests
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.

However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.

So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:

   # Modifications: Copyright 2026-present ScyllaDB
   # SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

with the appropriate year instead of 2026.

Fixes #28215

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28216
2026-01-18 16:25:28 +01:00
Tomasz Grabiec
3b0df29ceb docs: Document parallel decommission and removenode and relevant task API 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
85140cdf7e test: Add tests for parallel decommission/removenode 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
5c93e12373 test: util: Introduce ensure_group0_leader_on()
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.

And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.

It's much easier to ensure the leader, than to prepare tests to
handle failovers.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
478b8f09df test: tablets: Check that there are no migrations scheduled on draining nodes
In case of decommission, it's not desirable because it's less urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
e082e32cc7 test: lib: topology_builder: Introduce add_draining_request() 2026-01-18 15:36:07 +01:00
Tomasz Grabiec
baea12c9cb topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
1b784e98f3 tablets: topology_coordinator: Refactor to propagate reason for migration rollback
Will be easier to implement later change to cancel topology request,
where we need to give a reason for doing so.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
2d954f4b19 tablet_allocator: Skip co-location on draining nodes
In case of decommission, it's not desirable because it's less
urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
d9e1a6006f node_ops: task_manager_module: Populate entity field also for active requests 2026-01-18 15:36:06 +01:00
Tomasz Grabiec
bbd293d440 tasks: node_ops: Put node id in the entity field
If we have many node requests active at a time, it's useful to know
which requets works on which node.

Fixes #27208
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
576ebcdd30 tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
They should return the same, so extract the common logic.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
629d6d98fa topology: Protect against empty cancelation reason
Request would be deemed successful, which is counter to the intention
of cancelation and effect on the system.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
7446eb7e8d tasks, topology: Make pending node operations abortable
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
091ed4d54b doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
Since 288e75fe22
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
a009644c7d raft_topology, tablets: Drain tablets in parallel with other topology operations
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
1c2e47e059 locator: topology: Add "draining" flag to a node
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a37b1ce832 topology_coordinator: Extract generate_cancel_request_update() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
77bd00bf9f storage_service: Drop dependency in topology_state_machine.hh in the header
To reduce compilation time.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a24c3fc229 locator: Extract common code in assert_rf_rack_valid_keyspace() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
d3ee82ea51 topology_coordinator, storage_service: Validate node removal/decommission at request submission time
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.

This trips tests which expect validation failure, but get tablet
draining failure instead.

Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.

So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.

validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.

Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
2026-01-18 15:36:04 +01:00
Nadav Har'El
34d28475d9 Merge 'Implement Vector Search filtering API' from Dawid Pawlik
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.

After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.

---

This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.

Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.

---

Fixes: SCYLLADB-249

New feature, should land into 2026.1

Closes scylladb/scylladb#28109

* github.com:scylladb/scylladb:
  docs: update documentation on filtering with vector queries
  test/vector_search: add test for filtered ANN with VS mock
  test/vector_search: add restriction to JSON conversion unit tests
  vector_search: cql: construct and use filter in ANN vector queries
  select_statement: do not require post query ordering for vector queries
  vector_search: add `statement_restrictions` to JSON parsing
2026-01-18 16:11:29 +02:00
Ernest Zaslavsky
eb76858369 Update seastar submodule
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```

Closes scylladb/scylladb#28147
2026-01-18 15:00:48 +02:00
Andrzej Jackowski
6eca7e4ff6 transport: unify lambda capture lifetime for control connections
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.

The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
 - update_user_scheduling_group_v1 and update_user_scheduling_group_v2
   use the deducing this (this auto self) to ensure the proper
   lifecycle of the lambda capture.
 - update_control_connection_scheduling_group doesn’t use the deducing
   this, but the lambda captures only `this`, which is used before
   the first possible coroutine preemption. Therefore, it doesn’t seem
   that any memory corruption or undefined behavior is possible here.

Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.

Fixes: SCYLLADB-284

Closes scylladb/scylladb#28158
2026-01-17 20:36:31 +02:00
Nikos Dragazis
8aca7b0eb9 test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197
2026-01-17 20:32:06 +02:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Ferenc Szili
0aebc17c4c docs: correct spelling errors in size based balancing docs
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.

Closes scylladb/scylladb#28196
2026-01-16 17:41:57 +02:00
Aleksandra Martyniuk
ad2381923f service: fin indentation 2026-01-16 11:38:10 +01:00
Aleksandra Martyniuk
504290902c test: add test_numeric_rf_to_rack_list_conversion_abort
Add regression test that checks whether aborted rf change leaves
the system_schema.keyspaces unchanged.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
3ed8701301 service: tasks: fix type of global_topology_request_virtual_task
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).

Retrieve the type from the variant before passing it to task type.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
580dfd63e5 service: do not change the schema while pausing the rf change
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
2026-01-16 11:36:15 +01:00
Patryk Jędrzejczak
eb7be9010d Merge 'topology_coordinator: Refresh load stats after table is created or altered' from Tomasz Grabiec
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes https://github.com/scylladb/scylladb/issues/27921

No backport becuse only master is vulnerable (size-based load balancing).

Closes scylladb/scylladb#27926

* https://github.com/scylladb/scylladb:
  test: cluster: Add reproducer for missed notification in topology coordinator
  topology_coordinator: Wake up the state machine after stats refresh
  topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
  topology_coordinator: Fix potential missed notification
  topology_coordinator: Refresh load stats after table is created or altered
  tablets: Do a group0 read barrier on tablet load stats refresh
  topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
  test: Use ManagerClient.{disable,enable}_tablet_balancing()
  test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
  test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
2026-01-16 11:34:57 +01:00
Dawid Pawlik
383f9e6e56 docs: update documentation on filtering with vector queries
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
67d3454d2b test/vector_search: add test for filtered ANN with VS mock
Implement a test using Vector Store mock to check if end-to-end
integration works with filtered ANN query.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a54be82536 test/vector_search: add restriction to JSON conversion unit tests
Add unit tests for conversion of CQL restrictions to Vector Store filtering API
compatible JSON objects. The tests include:
- empty restriction
- `ALLOW FILTERING` in restriction
- single column restrictions
    - `=`, `<`, `>`, `<=`, `>=`, `IN`
- multiple column restrictions
    - `()=()`, `()<()`, `()>()`, `()<=()`, `()>=()`, `() IN ()`
- multiple restrictions conjunction
- `TEXT` and `BOOLEAN` column restrictions
2026-01-16 11:18:23 +01:00
Dawid Pawlik
2a38794b8e vector_search: cql: construct and use filter in ANN vector queries
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.

Adjust existing `vector_store_client_test` tests accordingly.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
304e908e3b select_statement: do not require post query ordering for vector queries
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.

If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a84d1361db vector_search: add statement_restrictions to JSON parsing
Add a module parsing the statement restrictions into Vector Store
filtering API compatible JSON objects.

The API was defined in: scylladb/vector-store#334

Examplary JSON object compatible with the API:
```
{
 "restrictions": [
     { "type": "==", "lhs": "pk", "rhs": 1 },
     { "type": "IN", "lhs": "pk", "rhs": [2, 3] },
     { "type": "<", "lhs": "ck", "rhs": 4 },
     { "type": "<=", "lhs": "ck", "rhs": 5 },
     { "type": ">", "lhs": "pk", "rhs": 6 },
     { "type": ">=", "lhs": "pk", "rhs": 7 },
     { "type": "()==()", "lhs": ["pk", "ck"], "rhs": [10, 20] },
     { "type": "()IN()", "lhs": ["pk", "ck"], "rhs": [[100, 200], [300, 400]] },
     { "type": "()<()", "lhs": ["pk", "ck"], "rhs": [30, 40] },
     { "type": "()<=()", "lhs": ["pk", "ck"], "rhs": [50, 60] },
     { "type": "()>()", "lhs": ["pk", "ck"], "rhs": [70, 80] },
     { "type": "()>=()", "lhs": ["pk", "ck"], "rhs": [90, 0] }
 ],
 "allow_filtering": true
}
```
2026-01-16 11:18:23 +01:00
Tomasz Grabiec
3fb7719277 topology_coordinator: Update load stats in case rebuilding with no live replica
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.

This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.

Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.

Fixes #28115

Closes scylladb/scylladb#28121
2026-01-16 11:19:01 +02:00
Sergey Zolotukhin
799d837295 test: disable test_start_bootstrapped_with_invalid_seed
The test intermittently fails when an invalid DNS name is resolved,
likely due to ISP DNS error hijacking (see scylladb/scylladb#28153).

Disable this test to unblock CI.

Fixes scylladb/scylladb#28153

Closes scylladb/scylladb#28162
2026-01-15 10:25:45 +01:00
Jenkins Promoter
51d61f809e Update pgo profiles - aarch64 2026-01-15 05:13:03 +02:00
Jenkins Promoter
eed1e7fa23 Update pgo profiles - x86_64 2026-01-15 04:33:43 +02:00
Tomasz Grabiec
eef798d84f Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.

This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.

Fixes #27281

Closes scylladb/scylladb#27397

* github.com:scylladb/scylladb:
  test: reduce dataset and number of test cases or debug builds
  test: bump repair timeout up, it's sometimes not enough in CI
  test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
  test: extend test_restore_with_streaming_scopes
  test: Adjust test_restore_primary_replica_different_dc_scope_all
  test: Refactor restoring code in test_backup to match SM pattern
  test: add check_mutation_replicas calls after fresh creation of dataset
  test: extend create_dataset to accept consistency_level
  test: refactor check_mutation_replicas so it's more readable
  test: make create_dataset async and refactor so it's configurable
  test: use defaultdict in collect_mutations
  test: add log marks to facilitate reusing server for restore
  locator: tablets: Distribute data evenly among primary replicas during restore
2026-01-14 18:57:55 +01:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Yaniv Michael Kaul
d919aacc69 storage_proxy: mark write_timeouts metric for counter write timeouts
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.

Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.

This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28019
2026-01-14 17:50:46 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Botond Dénes
551eecab63 Merge 'EAR: deprecate the replicated key provider' from Calle Wilund
Refs #22733.

Adds runtime warning and docs info that replicated provider is deprecated and will be removed.

Fixes #27292

Closes scylladb/scylladb#27270

* github.com:scylladb/scylladb:
  docs::encryption: Add warning that replicated provider is deprecated
  ent::encryption: Switch default key provider from replicated to local
  replicated_key_provider: Add deprecation warning on usage
2026-01-14 13:47:23 +02:00
Calle Wilund
2fd6ca4c46 test/cluster/test_internode_compression: Transpose test from dtest
Refs #27429

re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.

v2:
* Included positive test (mode=all)
2026-01-14 10:53:34 +01:00
Patryk Jędrzejczak
6b5923c64e test: test_group0_schema_versioning: wait for schema sync in system.local
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.

The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.

We fix the problem by making the function wait until the schema versions
match.

Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.

Fixes #23803

Closes scylladb/scylladb#28114
2026-01-14 09:55:45 +01:00
Jakub Smolar
aefd815194 test.py: add pexpect to the dependencies
Use pexpect to control a presistent GDB process with pattern reads and timeouts. This makes 'scylla_gdb' tests faster and less flaky.
Added python3-pexpect in 'install-dependencies.sh'.

Closes scylladb/scylladb#26419

[avi:
  build optimized clang 21.1.8
  regenerated frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28134
2026-01-14 10:17:37 +02:00
Botond Dénes
122b7847e5 Merge 'index: Accept view properties in CREATE INDEX' from Dawid Mędrek
Problem
-------
Secondary indexes are implemented via materialized views under the
hood. The way an index behaves is determined by the configuration
of the view. Currently, it can be modified by performing the CQL
statement `ALTER MATERIALIZED VIEW` on it. However, that raises some
concerns.

Consider, for instance, the following scenario:

1. The user creates a secondary index on a table.
2. In parallel, the user performs writes to the base table.
3. The user modifies the underlying materialized view, e.g. by setting
   the `synchronous_updates` to `true` [1].

Some of the writes that happened before step 3 used the default value
of the property (which is `false`). That had an actual consequence
on what happened later on: the view updates were performed
asynchronously. Only after step 3 had finished did it change.

Unfortunately, as of now, there is no way to avoid a situation like
that. Whenever the user wants to configure a secondary index they're
creating, they need to do it in another schema change. Since it's
not always possible to control how the database is manipulated in
the meantime, it leads to problems like the one described.

That's not all, though. The fact that it's not possible to configure
secondary indexes is inconsistent with other schema entities. When
it comes to tables or materialized views, the user always have a means
to set some or even all of the properties during their creation.

Solution
--------
The solution to this problem is extending the `CREATE INDEX` CQL
statement by view properties. The syntax is of form:

```
> CREATE INDEX <index name>
> .. ON <keyspace>.<table> (<columns>)
> .. WITH <properties>
```

where `<properties>` corresponds to both index-specific and view
properties [2, 3]. View properties can only be used with indexes
implemented with materialized views; for example, it will be impossible
to create a vector index when specifying any view property (see
examples below).

When a view property is provided, it will be applied when creating the
underlying materialized view. The behavior should be similar to how
other CQL statements responsible for creating schema entities work.

High-level implementation strategy
----------------------------------
1. Make auxiliary changes.
2. Introduce data structures representing the new set of index
   properties: both index-specific and those corresponding to the
   underlying view.
3. Extend `CREATE INDEX` to accept view properties.
4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include
   view properties in their output.

User documentation is also updated at the steps to reflect the
corresponding changes.

Implementation considerations
-----------------------------
There are a number of schema properties that are now obsolete. They're
accepted by other CQL statements, but they have no effect. They
include:

* `index_interval`
* `replicate_on_write`
* `populate_io_cache_on_flush`
* `read_repair_chance`
* `dclocal_read_repair_chance`

If the user tries to create a secondary index specifying any of those
keywords, the statement will fail with an appropriate error (see
examples below).

Unlike materialized views, we forbid specifying the clustering order
when creating a secondary index [4]. This limitation may be lifted
later on, but it's a detail that may or may not prove troublesome. It's
better to postpone covering it to when we have a better perspective on
the consequences it would bring.

Examples
--------
Good examples
```
> CREATE INDEX idx ON ks.t (v);
> CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property';
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'multiple view properties are ok'
  .. AND synchronous_updates = true;
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'default value ok'
  .. AND synchronous_updates = false;
```

Bad examples
```
> CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true;

SyntaxException: Unknown property 'replicate_on_write'

> CREATE INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Cannot specify options for a non-CUSTOM index"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="CUSTOM index requires specifying the index class"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. USING 'vector_index'
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="You cannot use view properties with a vector index"

> CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC);

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Indexes do not allow for specifying the clustering order"
```

and so on. For more examples, see the relevant tests.

References:
[1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views
[2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index
[3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options
[4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clause

Fixes scylladb/scylladb#16454

Backport: not needed. This is an enhancement.

Closes scylladb/scylladb#24977

* github.com:scylladb/scylladb:
  cql3: Extend DESC INDEX by view properties
  cql3: Forbid using CLUSTERING ORDER BY when creating index
  cql3: Extend CREATE INDEX by MV properties
  cql3/statements/create_index_statement: Allow for view options
  cql3/statements/create_index_statement: Rename member
  cql3/statements/index_prop_defs: Re-introduce index_prop_defs
  cql3/statements/property_definitions: Add extract_property()
  cql3/statements/index_prop_defs.cc: Add namespace
  cql3/statements/index_prop_defs.hh: Rename type
  cql3/statements/view_prop_defs.cc: Move validation logic into file
  cql3/statements: Introduce view_prop_defs.{hh,cc}
  cql3/statements/create_view_statement.cc: Move validation of ID
  schema/schema.hh: Do not include index_prop_defs.hh
2026-01-14 09:54:27 +02:00
Pavel Emelyanov
e57ee84662 util: Re-use seastar::util::memory_data_sink
A data_sink that stores buffers into an in-memory collection had
appeared in seastar recently. In Scylla there's similar thing that uses
memory_data_sink_buffer as a container, so it's possible to drop the
data_sink_impl iself in favor of seastar implementation.

For that to work there should be append_buffers() overload for the
aforementioned container. For its nice implementation the container, in
turn, needs to get push_back() method and value_type trait. The method
already exists, but is called put(), so just rename it. There's one more
user of it this method in S3 client, and it can enjoy the added
append_buffers() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28124
2026-01-14 08:54:00 +02:00
Avi Kivity
3fc5a32136 tools: toolchain: update instructions for building optimized clang with version information
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.

Closes scylladb/scylladb#28135
2026-01-14 06:46:20 +02:00
Botond Dénes
6e17bf5c1a tools/scylla-nodetool: migrate to std::localtime
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).

Closes scylladb/scylladb#27821
2026-01-13 20:46:31 +02:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
1e37781d86 schema: Add initializer for compression defaults
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).

Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.

Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).

Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.

Fixes #26914.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
5b4aa4b6a6 schema: Initialize static properties eagerly
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).

Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".

This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.

In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:55 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Nikos Dragazis
4ec7a064a9 test: Check that CQL and Alternator tables respect compression config
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.

Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.

Mark them as xfail-ing. The marker will be removed later in this series.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:16:08 +02:00
Avi Kivity
489d1a0fbc Merge 'replica: don't throw exceptions for read timeout' from Botond Dénes
Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions.

Fixes: scylladb/scylladb#25062

Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this.

Closes scylladb/scylladb#25068

* github.com:scylladb/scylladb:
  reader_permit: remove check_abort()
  test/boost/database_test: add test for read timeout exceptions
  sstables/mx/reader: don't throw exceptions on the read-path
  readers/multishard: don't throw exceptions on the read-path
  replica/table: don't throw exceptions on the read-path
  multishard_mutation_query: fix indentation
  multishard_mutation_query: don't throw exceptions on the read-path
  service/storage_proxy: don't throw exceptions on the full-scan path
  cql3/query_processor: don't throw exceptions on the read-path
  reader_permit: add get_abort_exception()
2026-01-13 16:17:41 +02:00
Calle Wilund
da17e8b18b gossiper/main: Extend special treatment of node ID resolve for rpc_address
Refs #27429

If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.

Added export of this via gossiper for symmetry.
2026-01-13 14:12:19 +01:00
Avi Kivity
c6dfae5661 treewide: #include Seastar headers with angle brackets
Seastar is an external library from the point of view of
ScyllaDB, so should be included with angle brackets.

Closes scylladb/scylladb#27947
2026-01-13 14:56:15 +02:00
Tomasz Grabiec
63b9a7e2b5 test: pylib: log_browsing: Grep logs without considering newly appended lines
At the end of the test case, the framework greps logs for errors and
backtraces. The servers are still running at this point. Some test
cases enable debug-level logging. If servers manage to produce new
lines between the python script processes them, the grep will never
return.

Protect against this by grepping over a file snapshot.

Fixes #28086

Closes scylladb/scylladb#28088
2026-01-13 14:41:02 +02:00
Nadav Har'El
fc6fff61d1 docs/alternator: add document on reducing Alternator network costs
This patch adds a new document, docs/alternator/network.md,
explaining the various mechanisms that can be used to reduce
network usage in Alternator. It explains compression of requests
and responses, header reduction, rack-aware routing, and RPC compression.

Many of these topics - especially support in the client libraries -
are work in progress, so some details are still missing in the new
document. Still, I think it is a good start that can be improved
later.

Fixes #27915.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27927
2026-01-13 14:29:01 +02:00
Radosław Cybulski
7b1060fad3 alternator: refactor streams.cc
Fix indentation levels from previous commit.
2026-01-13 12:04:13 +01:00
Radosław Cybulski
ef63fe400a alternator: refactor streams.cc
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
2026-01-13 12:03:54 +01:00
Karol Baryła
2c471ec57a protocol-extensions.md: Fix client_options docs
When this column and relevant SUPPORTED key were added, the
documentation was mistakenly put in the section about shard awareness
extension. This commit moves the documentation into a dedicated section.

I also expended it to describe both the new column and the new SUPPORTED
key.
2026-01-13 11:49:00 +01:00
Karol Baryła
30d4d3248d system_keyspace.md: Add client_options column
It was recently introduced, but the documentation was not updated.
2026-01-13 11:35:52 +01:00
Karol Baryła
a0a6140436 system_keyspace.md: Fix order in system.clients
scheduling_group column is places after protocol_version in the current
version.
2026-01-13 11:33:34 +01:00
Pavel Emelyanov
9ffd22491f test: Validate S3 endpoints new format works
Extend the test_get_object_store_endpoints() test to configure S3
endpoints in full-url format and check that they are rendered properly
via API/CQL.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:18 +03:00
Pavel Emelyanov
bd225784bd docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
f227de24b2 object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
8f97e6b3de s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
bee3564564 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
83e88d206c test: Rename badconf variable into objconf
It's not actually a "bad" config, it's just some config the test works
with.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:20 +03:00
Pavel Emelyanov
9c627bc44a test: Split the object_store/test_get_object_store_endpoints test
It tests two things -- the way object storage config is represented via
API and CQL (from sytem.config) and that updating config affects CREATE
KEYSPACE CQL (with keyspace storage options)

It's better to split the test, as its former part is going to be
extented to validate old/new config formats (see #26570)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:03 +03:00
Robert Bindar
dfcabb5fa4 test: reduce dataset and number of test cases or debug builds
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:51 +02:00
Robert Bindar
ca3c57e821 test: bump repair timeout up, it's sometimes not enough in CI
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:49 +02:00
Robert Bindar
6f5e58e718 test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:48 +02:00
Robert Bindar
6e636a4231 test: extend test_restore_with_streaming_scopes
to test restoring with a different min_tablet_count
than the schema was originally created with.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:46 +02:00
Robert Bindar
92cd1ddec3 test: Adjust test_restore_primary_replica_different_dc_scope_all
to match the new topology arhitecture

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:44 +02:00
Robert Bindar
db13ece9a0 test: Refactor restoring code in test_backup to match SM pattern
This patch refactors the restoring code in cluster/test_backup.py
so it matches better the way SM works.
The patch also refactors test_restore_with_streaming_scopes so to
facilitate running restore scenarios under all supported scopes
with or w/o primary_replica_only enabled by reusing the servers
and backups for a topology. This allows us to test a lot more scenarios
without making the test impossibly slow.

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:43 +02:00
Robert Bindar
ba01589f53 test: add check_mutation_replicas calls after fresh creation of dataset
to validate that mutation assertions are sane

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:41 +02:00
Robert Bindar
b835d32cb0 test: extend create_dataset to accept consistency_level
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:35 +02:00
Robert Bindar
e7d44356d9 test: refactor check_mutation_replicas so it's more readable
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:31 +02:00
Robert Bindar
733b4dbbb7 test: make create_dataset async and refactor so it's configurable
with num_keys and min_tablet_count

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:20 +02:00
Robert Bindar
f2c8949e4a test: use defaultdict in collect_mutations
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:45:03 +02:00
Robert Bindar
45faeba97d test: add log marks to facilitate reusing server for restore
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:48 +02:00
Robert Bindar
d88036db48 locator: tablets: Distribute data evenly among primary replicas during restore
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.

split from bhalevy/load-balance-primary-replica

Fixes #27281

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:20 +02:00
Nadav Har'El
2a831ad373 Merge 'Address CodeQL Errors' from Botond Dénes
Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality.
This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed.
Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code.

Code cleanup, no backport

Closes scylladb/scylladb#27838

* github.com:scylladb/scylladb:
  pgo/pgo.py: don't mutate input params
  test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
  test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
  idl-compiler.py: raise TypeError instead of raw str
  test/pylib/lcov_utils.py: don't call set when iterating over it
  configure.py: move away from .format(**locals())
  test/cluster/object_store/conftest.py: add missing call to parent constructor
  idl-compiler.py: add missing call to parent class constructor
  tools/scyllatop/fake.py: pass correct number of args to _add_metric
2026-01-13 11:43:57 +02:00
Nadav Har'El
609b283d98 test/cqlpy: add another reproducer for known issue
This patch adds a second reproducer for issue #25839, which is about
scanning a secondary index which returns partial results. The new test
uses count(*) without requesting the row themselves, but still has the
same problem of counting only part of the rows. This is the problem that
a user reported in issue #28026.

Unlike the previous test, this test works correctly on older versions
of Scylla - by using larger data, like on Cassandra - without changing
a configuration variable that did not yet exist. So with this test we
can confirm that this bug is a Scylla 5.2 regression:

  test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count

passes, while

  test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count

fails. It also fails on master, so the new test is marked "xfail".

Refs #25839
Refs #28026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28108
2026-01-13 11:15:27 +02:00
Botond Dénes
7b562bb185 Merge 'system.clients: Address SSL refactor review comments' from Piotr Smaron and Copilot
Addresses outstanding review comments from PR #22961 where SSL field
collection was refactored into generic_server::connection base class.
This patch consists of minor cosmetic enhancements for increased
readability, mainly, with some minor fixups explained in specific
commits.

Cosmetic changes, no need to backport.

Closes scylladb/scylladb#27575

* github.com:scylladb/scylladb:
  test_ssl: fix indentation
  generic_server: improve logging broken TLS connection
  test_ssl: improve timeout and readability
  alternator/server: update SSL comment
2026-01-13 11:00:26 +02:00
Botond Dénes
354c805e6a reader_permit: remove check_abort()
This method can cause performance regressions if used in the wrong place
-- namely if it is used to abort reads by throwing the abort exception.
Exceptions should be propagated during reads without throwing them,
otherwise they cause extra CPU load, making a bad situation worse.
Remove this method, so it doesn't accidentally get more users, migrate
remaining users to get_abort_exception().
2026-01-13 10:47:57 +02:00
Botond Dénes
a0ddac655d test/boost/database_test: add test for read timeout exceptions
Read timeouts shouldn't trigger exceptions thrown, exceptions should be
solely propagated via futures, otherwise they put extra strain on the
system at the worst possible time: when it is overload already enough
that reads started to time out.
The test covers both single partition reads and full scans, with two
scenarios:
* timeout while the read is queued
* timeout when the read is already ongoing
2026-01-13 10:47:57 +02:00
Botond Dénes
6801611b94 sstables/mx/reader: don't throw exceptions on the read-path
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
2026-01-13 10:47:57 +02:00
Botond Dénes
fa6ffe9d20 readers/multishard: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
1f6f7ceb68 replica/table: don't throw exceptions on the read-path
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.

Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
2026-01-13 10:47:57 +02:00
Botond Dénes
131489fe48 multishard_mutation_query: fix indentation
Left broken by previous patch.
2026-01-13 10:47:57 +02:00
Botond Dénes
7eeb7fcfba multishard_mutation_query: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
9bea842c01 service/storage_proxy: don't throw exceptions on the full-scan path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
404f5b0808 cql3/query_processor: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
5cd237fec5 reader_permit: add get_abort_exception()
Will replace check_abort(). The latter throws an exception which is
something we want to avoid when a read is aborted, in particular when it
times out.
Also add a convenience get_abort_exception() method to mutation_reader.
2026-01-13 10:47:57 +02:00
Michał Hudobski
c8aa49b196 vector search, paging: add test for paging warnings
We add a test that validates that indexed queries
do not throw a warning related to vector search paging

Fixes: SCYLLADB-248

Closes scylladb/scylladb#28077
2026-01-13 10:33:36 +02:00
Nadav Har'El
34191d8fd4 alternator: fix signature checking of headers with multiple spaces
We have a test in test_compressed_response.py that reproduces a bug
where in Alternator's signature checking code, if a header had multiple
consecutive spaces its signature isn't checked correctly.

This patch fixes this and that xfailing test begins to pass.

But it turns out that the handling of multiple consecutive spaces in
headers when calculating the authentication signature is just one example
of "header canonization" that the AWS Signature V4 specification requires
us to do. There are additional types of header canonization that Alternator
must do, and this patch also adds new tests in test_authorization.py for
checking *all* the types of canonization.

Fortunately, for all other types of canonizations, we already handled
them correctly - Alternator already lowercases header names, sorts them
alphabetically and removes leading and trailing spaces before calculating
the signature. So most of the new tests added pass also without this patch,
and only one of them, test_canonization_middle_whitespace, needs this
patch to pass. As usual, all the new tests also pass on DynamoDB.

Fixes #27775

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28102
2026-01-13 10:29:13 +02:00
Andrei Chekun
3f14d45d6e test.py: fix the link for the failed_test directory
With new UI Jenkins escaping the HTML tags during rendering to prevent
XSS. This will show just link without custom name as a string that can
be copied and then pasted to navigate to the failed directory.

Closes scylladb/scylladb#28062
2026-01-13 10:22:38 +02:00
Yaniv Kaul
4e3aa53f8b test/rest_api/test_gossiper.py: fix for Variable defined multiple times
To fix the problem, we need to remove the first, redundant definition of
test_gossiper_unreachable_endpoints (lines 19-24). The second definition
(lines 25-40) should be retained as it has more substantial test logic.
No other code changes or imports are needed, as the test logic is
preserved fully in the retained definition.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27632
2026-01-13 10:17:29 +02:00
Yaniv Kaul
1722e35a7e .github/workflows/docs-validate-metrics.yml: add workflow permissions
Potential fix for code scanning alert no. 167: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27819
2026-01-13 10:16:35 +02:00
Yaniv Kaul
39c26794ec .github/workflows/docs-pr.yaml: add workflow permissions
Potential fix for code scanning alert no. 145: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27808
2026-01-13 10:15:45 +02:00
Yaniv Michael Kaul
d9cce7ccbf .github/copilot-instructions.md: add test philosophy
Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28032
2026-01-13 10:05:13 +02:00
Anna Stuchlik
8141283262 doc: add the version name to the Install ScyllaDB page
Fixes https://github.com/scylladb/scylladb/issues/28021

Closes scylladb/scylladb#28022
2026-01-13 10:01:48 +02:00
Avi Kivity
66aee0fb5e alternator: add optional listeners for proxy protocol v2
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.

We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.

Closes scylladb/scylladb#27889
2026-01-13 09:59:24 +02:00
Nadav Har'El
e7df03127b alternator: support "deflate" encoding in request compression
Currently Alternator supports compressed requests in the gzip format
with "Content-Encoding: gzip". We did not support any other compression
formats.

It turns out that DynamoDB also supports the "deflate" encoding.
The "deflate" format is just a small variant of gzip and also supported
by the same zlib library that we already use, so it is very easy
to add support for it as well. So this patch adds it.

Beyond compatibility with DynamoDB, another benefit of this patch is
symmetry with our response compression support (PR #27454), where
we supported both gzip and deflate compression of responses - so
we should support the same for requests.

This patch also adds tests for Content-Encoding: deflate, which pass
on DynamoDB (proving that "deflate" is indeed supported there).
On Alternator the new tests failed before this patch and pass with
this patch.

Refs #27243 (which asks to support more compression formats).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27917
2026-01-13 09:58:12 +02:00
Nadav Har'El
a1f198d453 test/cqlpy: translate Cassandra's test InsertInvalidateSizedRecordsTest
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertInvalidateSizedRecordsTest.java into our
cqlpy framework.

This is one of the tests added to Cassandra as part of the vector
search work, but actually has nothing to do with vector search -
it checks what happens when key columns of different types exceeed
their maximum size (64KB).

Unfortunately, each one of the tests added here *fail* on ScyllaDB,
providing more reproducers for two already known issues (which
already had plenty of reproducers...):

Refs  #8627 Cleanly reject updates with indexed values where value > 64k
Refs #12247 Better error reporting for oversized keys during INSERT

One of the tests also fails on Cassandra, due to CASSANDRA-19270.
It is not clear to me how this unit test actually passed on Cassandra,
I can only guess that the Python driver somehow makes the request
differently than what the Java unit tests use to make requests to
Cassandra.

One of the tests in the original Cassandra source file I did not
translate, readingEmptyStringsForDifferentTypes, because it tests
cqlsh, not pure CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27944
2026-01-13 08:59:36 +02:00
Yaniv Kaul
2c26ca3651 .github/workflows/read-toolchain.yaml: hardening for code scanning alert no. 146: Workflow does not contain permissions #27807
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27807
2026-01-13 08:57:53 +02:00
tomek7667
19313d67e3 docs/cql/ddl.rst: fix formatting of deprecated initial sub-option
Closes scylladb/scylladb#26852
2026-01-13 08:55:24 +02:00
Anna Stuchlik
14cadcbc18 doc: remove references to Open Source
Fixes https://github.com/scylladb/scylladb/issues/28118

Closes scylladb/scylladb#28119
2026-01-13 08:43:26 +02:00
Botond Dénes
25db8f6a70 pgo/pgo.py: don't mutate input params
It is considered a dangerous practice with possible unintended
side-effects, affecting later calls to the same function.
Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
4ec3d76b87 test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
It is considered a dangerous practice as it creates a side-effect for
later calls to the same function.
Create a new variable instead and mutate that. Also remove the unused
update_known_ids parameter, which defaults to True and no caller changes
it. Passing False to this param also seem to have no effect. Instead of
trying to guess what the desired effect of passing False is and fixing
it, just remove this unused param.

Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
157fe2b80e test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
2026-01-13 08:33:17 +02:00
Botond Dénes
849332c5c2 idl-compiler.py: raise TypeError instead of raw str
Unlike in C++, in Python one can only throw objects which inherit from
Exception. The message complains about wrong type so wrap it in
TypeError before passing to raise.
Found by CodeQL "Illegal raise".
2026-01-13 08:33:17 +02:00
Botond Dénes
3c6e9637a0 test/pylib/lcov_utils.py: don't call set when iterating over it
Probably a typo. Found by CodeQL "Non-callable called".
2026-01-13 08:33:17 +02:00
Botond Dénes
eb4ee5a126 configure.py: move away from .format(**locals())
Use f strings instead, they are just as convenient with the added bonus
of editors providing syntax highighting for it.

Additionally, this shuts up CodeQL complaint about "Suspicious unused
loop iteration variable" in loops where the loop variable was passed to
format indirectly via **locals().
2026-01-13 08:33:17 +02:00
Botond Dénes
d2db84714e test/cluster/object_store/conftest.py: add missing call to parent constructor
Replace manual init of parent fields.
Found by CodeQL: "Missing call to superclass `__init__` during object
initialization".

The secret_key is not initialized to server.secret_key, instead of
server.access_key. This probably fixes a (benign) bug.
2026-01-13 08:33:17 +02:00
Botond Dénes
26e978cf08 idl-compiler.py: add missing call to parent class constructor
Replace manual init of parent fields.
Found by CodeQL "Missing call to superclass `__init__` during object
initialization".
2026-01-13 08:33:17 +02:00
Botond Dénes
20d0782dc4 tools/scyllatop/fake.py: pass correct number of args to _add_metric
Discovered by CodeQL "Wrong number of arguments in call".
2026-01-13 08:33:17 +02:00
Michał Jadwiszczak
649efd198f docs/dev/service_levels: update docs to service levels on raft
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.

Fixes scylladb/scylladb#18177

Closes scylladb/scylladb#26556
2026-01-13 06:49:18 +02:00
Botond Dénes
725e99e263 Merge 'test.py: Fix boost logs' from Andrei Chekun
Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail.

Fixes: https://github.com/scylladb/scylladb/issues/28045

Framework enhancements, no need to backport.

Closes scylladb/scylladb#28107

* github.com:scylladb/scylladb:
  test.py: remove XML log from fail summary
  test.py: fix truncated boost output to stdout file
2026-01-13 06:19:05 +02:00
Anna Stuchlik
791ab4ed02 doc: clarify the information about SSTable version support
Fixes https://github.com/scylladb/scylladb/issues/27765

Closes scylladb/scylladb#27835
2026-01-13 06:17:37 +02:00
Tomasz Grabiec
8dc20e6aaf test: cluster: Add reproducer for missed notification in topology coordinator 2026-01-13 00:40:23 +01:00
Tomasz Grabiec
7a04dd2d22 topology_coordinator: Wake up the state machine after stats refresh
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.

The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.

Fixes #25163
2026-01-13 00:40:23 +01:00
Tomasz Grabiec
d910e6ea63 topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.

This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.

Refs #28086
2026-01-13 00:40:16 +01:00
Tomasz Grabiec
e5dee2aab8 topology_coordinator: Fix potential missed notification
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.

Fix by subscribing for the event before checking for work.

Fixes #27958
2026-01-13 00:39:01 +01:00
Tomasz Grabiec
2b7aa3211d topology_coordinator: Refresh load stats after table is created or altered
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes #27921
2026-01-13 00:38:59 +01:00
Tomasz Grabiec
663831ebd7 tablets: Do a group0 read barrier on tablet load stats refresh
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c4c5ed5aba topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
Refresh can be triggered from different places, but it should run in
the gossip scheduling group, like group0 operations.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
5e6935f276 test: Use ManagerClient.{disable,enable}_tablet_balancing() 2026-01-13 00:38:00 +01:00
Tomasz Grabiec
6936704677 test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c8098e07c9 test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
It's a global operation, so we can use any server.

It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
2026-01-13 00:38:00 +01:00
Andrei Chekun
dfa6a61721 test.py: remove XML log from fail summary
Remove XML log from fail summary.
Add text from the first error in the XML file to the fail summary
2026-01-12 14:26:58 +01:00
Andrei Chekun
d96a50481a test.py: fix truncated boost output to stdout file
Change the behavior of the catching the boost log output. With this
change boost will output it's logging to stdour with HRF format and to
the tempfile in XML format. This will help for easier debuggint when all
messages will be in the output file and still in the fail summary.
2026-01-12 14:15:23 +01:00
Botond Dénes
6bcc18e5c6 erge 'test.py: integrate python tests to be executed with pytest runner' from Andrei Chekun
This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest.
This is another step for unification of test execution.
Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly.

NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api

Changes only in framework, so no backport.

This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely.
While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference
test.py output before PR:
```bash
> ./test.py --mode=release  rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics

```
test.py output in this PR:
```bash
> ./test.py --mode=release  test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list

rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1
rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1
rest_api/test_compaction_task.py::test_regular_compaction_task.release.1
rest_api/test_compaction_task.py::test_compaction_task_abort.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_task_manager.py::test_task_manager_modules.release.1
rest_api/test_task_manager.py::test_task_manager_tasks.release.1
rest_api/test_task_manager.py::test_task_manager_status_running.release.1
rest_api/test_task_manager.py::test_task_manager_status_done.release.1
rest_api/test_task_manager.py::test_task_manager_status_failed.release.1
rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1
rest_api/test_task_manager.py::test_task_manager_wait.release.1
rest_api/test_task_manager.py::test_task_manager_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1
rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1
rest_api/test_task_manager.py::test_module_not_exists.release.1
rest_api/test_task_manager.py::test_task_folding.release.1
rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1
```

Fixes: https://github.com/scylladb/scylladb/issues/27716

Closes scylladb/scylladb#26395

* github.com:scylladb/scylladb:
  test.py: fix test_vector_similarity.py
  docs: add directories excluded from test.py
  test.py: prevent file descriptors leaking
  test.py: capture print inside the test
  test.py: do not print header for collection with test.py
  test.py: remove not supported functionality
  test.py: switch of execution of several test directories by test.py runner
  test.py: integrate python tests to be executed with pytest runner
  test.py: fix test/vector_search_validator to be able to run with pytest
  test.py: prepare base class for migration
  test.py: move environment preparation to one method
  test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
2026-01-12 14:17:19 +02:00
Botond Dénes
04b8f72946 Merge 'repair: Implement auto repair for tablet repair' from Asias He
repair: Implement auto repair for tablet repair

This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99

New feature. No backport.

Closes scylladb/scylladb#27534

* github.com:scylladb/scylladb:
  topology_coordinator: Add metrics for tablet repair
  repair: Implement auto repair for tablet repair
2026-01-12 14:16:01 +02:00
Yaniv Kaul
f1c9eda49e Potential fix for code scanning alert no. 144: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27809
2026-01-12 12:21:35 +02:00
Marcin Maliszkiewicz
3c9f52e709 Merge 'doc: update the Web Installer instructions' from Anna Stuchlik
This PR:
- Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly.
- Removes the instruction for Open Source from the Web Installer docs.

Fixes https://github.com/scylladb/scylladb/issues/28005
Fixes https://github.com/scylladb/scylladb/issues/28079

Closes scylladb/scylladb#28046

* github.com:scylladb/scylladb:
  doc: remove the instruction for Open Source from the Web Installer docs
  doc: add the version variable to the Web Installer instructions
2026-01-12 11:10:04 +01:00
Petr Gusev
889d7782ed treewide: use coroutine::maybe_yield in coroutines
It's more efficient since coroutine::maybe_yield returns
a lightweight struct (awaitable), not the future.

Closes scylladb/scylladb#28101
2026-01-12 10:38:47 +01:00
Marcin Maliszkiewicz
09af3828ab auth: remove confusing deprecation msg from hash_with_salt
Closes scylladb/scylladb#27705
2026-01-12 10:12:54 +01:00
Asias He
7980890029 topology_coordinator: Add metrics for tablet repair
- scylla_tablet_ops_failed

Number of failed tablet {auto, user} repair

- scylla_tablet_ops_succeeded

Number of succeeded tablet {auto, user} repair

Currently auto_repair and user_repair tablet task are added. We can add
more tablet tasks later, e.g., rebuild, migration.
2026-01-12 15:26:05 +08:00
Alex
e430065c92 db: views: serialize create/drop view operations via shard 0
Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build`

A problematic sequence looks like this:

* `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created
* `on_drop_view()` runs on shard 0 → entry for shard 0 is removed
* `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again
* `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains

This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI.

This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating.

new process:
  - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background.
  - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That:
      - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...).
  - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...).
  - Each shard runs handle_create_view_local(...), which:
      - waits for pending base writes/streams, flushes the base,
      - resets the reader to the current token and adds the new view,
      - handles errors and triggers _build_step to continue processing.

  Drop view

  - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background.
  - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...).
  - Each shard runs handle_drop_view_local(...), which:
      - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps,
      - ignores missing keyspace cases.
  - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which:
      - removes global build progress, built‑view state, and view build status in system tables,

  Shutdown

  - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted.

In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness.

Fixes: https://github.com/scylladb/scylladb/issues/27898
Backport: not required as the problem happens on master

Closes scylladb/scylladb#27929
2026-01-12 09:23:22 +02:00
Michał Hudobski
92c988514c vector_search: allow all where clauses in vector search queries
To prepare for implementation of filtering we skip validation
of where clauses in vector search queries. All queries that would
be blocked by the lack of ALLOW FILTERING now will pass through.

Fixes: VECTOR-410

Closes scylladb/scylladb#27758
2026-01-11 12:56:44 +02:00
Marcin Maliszkiewicz
03e0dd0841 Merge 'test/alternator: fix most tests to run on DynamoDB' from Nadav Har'El
We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec.

Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them.

Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer)

Closes scylladb/scylladb#27995

* github.com:scylladb/scylladb:
  test/alternator: fix some expected error messages to fit DynamoDB
  test/alternator: fix compressed request test on non-us-east1
  test/alternator: fix test's expected error message on DynamoDB
  test/alternator: mark Alternator-only test scylla_only
  test/alternator: fix test on DynamoDB
  test/alternator: increase wait_for_gsi() timeout
  test/alternator: fix test passing a spurious parameter
2026-01-09 18:05:20 +01:00
Botond Dénes
7e1c8776b7 docs: remove sstabledump and sstablemetadata
These tools are deprecated and no longer shipped by ScyllaDB packages.
They no longer support the latest SSTable versions and ScyllaDB-only
features, like encryption and dictionary based compression.

Remove them from the documentation.

Closes scylladb/scylladb#27608
2026-01-09 17:31:54 +01:00
Dawid Mędrek
2385afa1c7 scripts/pull_github_pr.sh: Update instructions for creating token
The interface of Jenkins has changed, and the instructions for creating
a token are out-of-date. This commit updates them.

Closes scylladb/scylladb#28054
2026-01-09 17:45:00 +02:00
Ferenc Szili
0ede8d154b docs: add docs for size based load balancing
This patch updates the documentation for size based load balancing.

Closes scylladb/scylladb#27616
2026-01-09 16:25:25 +02:00
Andrei Chekun
1f60208aa0 test.py: fix test_vector_similarity.py
There is a known limitation of the xdist.
Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail.
This PR just converts set to dist, to eliminate issue mentioned above.
2026-01-09 15:08:40 +01:00
Yaniv Michael Kaul
af8eaa9ea5 scripts: fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.

I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.

While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).

Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27883
2026-01-09 15:13:12 +02:00
Anna Stuchlik
396093ff60 doc: remove the instruction for Open Source from the Web Installer docs
Fixes https://github.com/scylladb/scylladb/issues/28079
2026-01-09 14:07:32 +01:00
Botond Dénes
af6cb0d0a4 Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

Fixes #28057

Backport this PR to all branches as it fixes a problematic bug.

Closes scylladb/scylladb#27435

* github.com:scylladb/scylladb:
  gossiper: add_saved_endpoint: make generations of excluded nodes negative
  test: introduce test_full_shutdown_during_replace
  utils: error_injection: allow aborting wait_for_message
  raft topology: preserve IP -> ID mapping of a replacing node on restart
2026-01-09 14:56:16 +02:00
Calle Wilund
a7cdb602e1 db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998
2026-01-09 14:06:58 +02:00
Łukasz Paszkowski
7bf26ece4d test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads
The current code:
```
try:
   cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
   pass
```

contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.

Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.

Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```

Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.

This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
  read on the coordinator node
- Reconnect the driver after node restart

Fixes https://github.com/scylladb/scylladb/issues/27933

Closes scylladb/scylladb#27934
2026-01-09 13:42:05 +02:00
Andrei Chekun
82e81a8664 docs: add directories excluded from test.py
Add new directories that are excluded from the test.py executor and will
be fully managed by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
353bae7d66 test.py: prevent file descriptors leaking
With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this.
Fix issue when we had issue with getting cluster and then trying to set it dirty while it None.
Put cluster to the pool only if it was created
2026-01-09 11:59:25 +01:00
Andrei Chekun
67c5267053 test.py: capture print inside the test
Capture the printing inside the test case to output it after the test and not directly during the testing process.
2026-01-09 11:59:25 +01:00
Andrei Chekun
594aedd6a5 test.py: do not print header for collection with test.py
Skip printing the default pytest headers when printing list of the tests.
Before:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

Test session starts (platform: linux, Python 3.13.9, pytest 8.3.4, pytest-sugar 1.1.1)
rootdir: /home/xtrey/projects/scylladb/test
configfile: pytest.ini
plugins: xdist-3.8.0, allure-pytest-2.15.0, sugar-1.1.1, anyio-4.8.0, asyncio-0.24.0, timeout-2.3.1
asyncio: mode=Mode.AUTO, default_loop_scope=session
timeout: 24000.0s
timeout method: signal
timeout func_only: False
session timeout: 24000.0s
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
After:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
2026-01-09 11:59:25 +01:00
Andrei Chekun
21a1ff3d5c test.py: remove not supported functionality
In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better.
Remove skip node, since it already added as a pytest mark for each test in the file.
Remove pool_size, since this is not used by pytest at all. Pytest uses
xdist to set the amount of threads instead of pool_size used by test.py
2026-01-09 11:59:25 +01:00
Andrei Chekun
e8c50a5ad4 test.py: switch of execution of several test directories by test.py runner
With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest.

NOTE: this is a breaking change. From this commit, several directories
with tests will require a path to the file to launch the test.
Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
2026-01-09 11:59:25 +01:00
Andrei Chekun
61d49525ad test.py: integrate python tests to be executed with pytest runner
With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself.
With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest.
`--test-py-init` tells to run pytest session in test.py-compatible mode
Update the help text for the name parameter for test.py about changes
how it works and which directory is served by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
808b29885f test.py: fix test/vector_search_validator to be able to run with pytest
build_mode fixture have dynamic scope. It depends how the pytest is
executed. When it executed through test.py scope will be session and
since it's broader that package everything work fine. While with pure
pytest it will fail because build_mode will have module scope.
This fix allows to run tests with pure pytest, this needed for migration
test to be executed by pytest runner instead test.py.
2026-01-09 11:59:25 +01:00
Andrei Chekun
8252de7b55 test.py: prepare base class for migration
Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml
After full migration suite.yaml case will be removed
2026-01-09 11:59:25 +01:00
Andrei Chekun
48ff74b6b2 test.py: move environment preparation to one method
Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.
2026-01-09 11:59:25 +01:00
Andrei Chekun
e074e21490 test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).
2026-01-09 11:59:25 +01:00
copilot-swe-agent[bot]
8c48b82b84 test_ssl: fix indentation 2026-01-09 10:27:17 +01:00
Piotr Smaron
2bcbebe92d generic_server: improve logging broken TLS connection
Preiously we were logging a broken TLS connection and then this has been
logged later again, so now instead of logging we're constructing an
exception with a message extened with TLS info, which later will be
catched with its full message still logged.
2026-01-09 10:24:55 +01:00
Piotr Smaron
7016fc4835 test_ssl: improve timeout and readability
1. With this change the test really waits 10s, previously (in case
   something went wrong), the timeout could take way more than that.
2. Added `else` to above `if` to increase clarity of execution flow -
   it doesn't change logic, but makes it more clear.
2026-01-09 10:22:19 +01:00
Asias He
7ba7b25bdd repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99
2026-01-09 16:11:39 +08:00
Botond Dénes
60570d7114 Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack

Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.

The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.

Fixes scylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820

backport to relevant versions for materialized views with tablets since it depends on rf-rack validity

Closes scylladb/scylladb#26354

* github.com:scylladb/scylladb:
  docs: update RF-rack restrictions
  cql3: don't apply RF-rack restrictions on vector indexes
  cql3: add warning when creating mv/index with tablets about rf-rack
  service/tablet_allocator: always allow tablet merge of tables with views
  locator: extend rf-rack validation for rack lists
  test: test rf-rack validity when creating keyspace during node ops
  locator: fix rf-rack validation during node join/remove
  test: test topology restrictions for views with tablets
  test: add test_topology_ops_with_rf_rack_valid
  topology coordinator: restrict node join/remove to preserve RF-rack validity
  topology coordinator: add validation to node remove
  locator: extend rf-rack validation functions
  view: change validate_view_keyspace to allow MVs if RF=Racks
  db: enforce rf-rack-validity for keyspaces with views
  replica/db: add enforce_rf_rack_validity_for_keyspace helper
  db: remove enforce parameter from check_rf_rack_validity
  test: adjust test to not break rf-rack validity
2026-01-09 10:01:23 +02:00
Patryk Jędrzejczak
eee2b6c7af Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.

We should do it via topology request, which will interrupt the tablet scheduler.

Enabling of balancing can be immediate.

Fixes https://github.com/scylladb/scylladb/issues/27647
Fixes #27210

Closes scylladb/scylladb#27736

* https://github.com/scylladb/scylladb:
  test: Verify that repair doesn't block disabling of tablet load balancing
  tablets: Make balancing disabling call preempt tablet transitions
2026-01-08 21:55:19 +02:00
Piotr Dulikowski
8e3e39a64a Merge 'service/storage_service: update service levels cache after upgrade to v2' from Michał Jadwiszczak
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this patch adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90)

This fix should be backported to all versions containing service levels on Raft.

[SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27585

* github.com:scylladb/scylladb:
  service/storage_service: update service levels cache after upgrade to v2
  service/storage_service: check if service levels were already upgraded before doing migration to raft
2026-01-08 21:55:19 +02:00
Michał Hudobski
e2e479f20d auth: fix cdc vector search indexing permission bug
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.

Fixes: VECTOR-476

Closes scylladb/scylladb#28050
2026-01-08 21:55:19 +02:00
Ernest Zaslavsky
19fe630c0e Update seastar submodule
seastar 4dcd4df..dd46b6fe
```
dd46b6fe net: expose DNS TTL via net::hostent
b94f81b0 test: Extend statat() test to check ENOENT exception reporting
```

Closes scylladb/scylladb#28006
2026-01-08 21:55:19 +02:00
Michael Litvak
8f15c7a874 db/view/view_update_generator: move discover_staging_sstables to start
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.

The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.

view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.

Fixes scylladb/scylladb#27956

Closes scylladb/scylladb#27970
2026-01-08 21:55:19 +02:00
Botond Dénes
8c72dcc1ec Merge 'database: truncate_table_on_all_shards: consider can_flush on all shards' from Benny Halevy
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

Closes scylladb/scylladb#27643

* github.com:scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-08 21:55:19 +02:00
Avi Kivity
633e6e0037 build: update toolchain generation procedure for optimized clang
Explain where to pick up existing clang archives, and how to upload new ones.

Closes scylladb/scylladb#27690
2026-01-08 21:55:18 +02:00
Evgeniy Naydanov
a9da14be19 test: dtest: reproducer for parallel rebuild failure
2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2.

Steps to reproduce:
    1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter.
    2. Let’s assume datacenter names are "dc1" and "dc2".
    3. Create a keyspace ("keyspace1") with RF=0 in dc2.
    4. Populate some data into dc1.
    5. Change keyspace1 replication in dc2 to 2.
    6. On 2 nodes in dc2 run the following command in parallel:
         nodetool rebuild --source-dc dc1

Parallel execution of rebuilds is not possible with RBNO enabled.

This test is the repro for #27804

Closes scylladb/scylladb#27747
2026-01-08 21:55:18 +02:00
Botond Dénes
9b4a7f1d14 Merge 'test: cluster: object_store: test_backup: modernize do_abort_restore' from Benny Halevy
Currently the function uses a regular expression
to check the system log for a specific message.
This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error:
```
test/cluster/object_store/test_backup.py:534
  /home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option.
    await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs])
```

Thsi change modernizes the implementation by:
- using auto_dc_rack for manager.servers_add
- using new_test_keyspace to generate and auto delete the keyspace
- using async gatherio and a prepared statement to insert the data
- simplifing the keys and values by NOT using os.urandom (that is notoriously slow)
- inserting fewer keys in debug mode
- removing the log check

With that, the test can be reenabled in all modes.

* No backport needed since the test was disabled

Closes scylladb/scylladb#27892

* github.com:scylladb/scylladb:
  test_backup: do_abort_restore: reduce data footprint
  test_backup: do_abort_restore: use error injection
  test_backup: do_abort_restore: use asyncio for cql
  test_backup: do_abort_restore: use new_test_keyspace
  test_backup: do_abort_restore: use logger rather than print
  test_backup: do_abort_restore: pass auto_rack_dc to servers_add
2026-01-08 21:55:18 +02:00
Anna Stuchlik
f614482e66 doc: add the patch release upgrade procedure for version 2025.4
Adds the patch upgrade guide based on previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/27982

Closes scylladb/scylladb#27985
2026-01-08 21:55:18 +02:00
Asias He
4f77dd058d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679
2026-01-08 21:55:18 +02:00
Anna Stuchlik
3f1c7c70f5 doc: remove the link to the Download Center
... from the OS support page.

Fixes https://github.com/scylladb/scylladb/issues/28047

Closes scylladb/scylladb#28048
2026-01-08 21:55:18 +02:00
Asias He
0aabf51380 repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870
2026-01-08 21:55:18 +02:00
Radosław Cybulski
5f48ab3875 storage_proxy: fix invalid assert
Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as
the latter was clearly meant.

Closes scylladb/scylladb#27900
2026-01-08 21:55:18 +02:00
Andrei Chekun
c950c2e582 test.py: convert skip_mode function to pytest.mark
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate

skip_mode to be dynamic pytest.mark that can be used as ordinary mark.

Closes scylladb/scylladb#27853

[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
2026-01-08 21:55:16 +02:00
Tomasz Grabiec
a52de4ecdc test: cluster: test_topology_ops[_encrypted]: Fix failures due to background migrations fencing out writes
The test if flaky, with failures in:

        for server in servers:
>           await check_node_log_for_failed_mutations(manager, server)

test/cluster/test_topology_ops_encrypted.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590>
server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578)

    async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo):
        logging.info(f"Checking that node {server} had no failed mutations")
        log = await manager.server_open_log(server.server_id)
        occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE|DEBUG|INFO)")
>       assert len(occurrences) == 0
E       AssertionError

test/cluster/util.py:319: AssertionError

As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625:

"The fencing errors here look legit given that we do not wait for all
requests to complete while shutting down the storage proxy. The
scenario is this:

Test does writes to rf=3 keyspace with cl=one. One node is shutting
down while there is a tablet migration. Tablet migration executes
barrier and drain which fails on a node that is been shutdown. The
topology coordinator proceeds fencing the old topology, but there
still can be un-handled mutation requests from the shutting down node
on other nodes and they will generate fencing errors like they should.

They way to avoid it (though it is benign) is to wait for all outgoing
storage proxy requests to complete during shutdown, but even then the
error may still happen since a request may timeout before it is
processed by the other side, so it may be completed by a storage proxy
coordinator side, but still not handled by replica side. This what we
have fencing for in the first place."

Fix by diabling background tablet migrations, so that we have no
topology barriers concurrent with node shutdown.

Fixes #27942

Closes scylladb/scylladb#28034
2026-01-08 21:53:47 +02:00
Tomasz Grabiec
34df158605 test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:

```
   @pytest.mark.asyncio
    async def test_not_enough_token_owners(manager: ManagerClient):
        """
        Test that:
        - the first node in the cluster cannot be a zero-token node
        - removenode and decommission of the only token owner fail in the presence of zero-token nodes
        - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
          owners would fall below the RF of some keyspace using tablets
        """
        logging.info('Trying to add a zero-token server as the first server in the cluster')
        await manager.server_add(config={'join_ring': False},
                                 property_file={"dc": "dc1", "rack": "rz"},
                                 expected_error='Cannot start the first node in the cluster as zero-token')

        logging.info('Adding the first server')
        server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})

        logging.info('Adding two zero-token servers')
        # The second server is needed only to preserve the Raft majority.
        server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]

        logging.info(f'Trying to decommission the only token owner {server_a}')
        await manager.decommission_node(server_a.server_id,
                                        expected_error='Cannot decommission the last token-owning node in the cluster')

        logging.info(f'Stopping {server_a}')
        await manager.server_stop_gracefully(server_a.server_id)

        logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
        await manager.remove_node(server_b.server_id, server_a.server_id,
                                  expected_error='cannot be removed because it is the last token-owning node in the cluster')

        logging.info(f'Starting {server_a}')
        await manager.server_start(server_a.server_id)

        logging.info('Adding a normal server')
        await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})

        cql = manager.get_cql()

        await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)

>       async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:

test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
    await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None

    @asynccontextmanager
    async def new_test_keyspace(manager: ManagerClient, opts, host=None):
        """
        A utility function for creating a new temporary keyspace with given
        options. It can be used in a "async with", as:
            async with new_test_keyspace(ManagerClient, '...') as keyspace:
        """
        keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
        try:
            yield keyspace
        except:
            logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
            raise
        else:
>           await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})

test/cluster/util.py:544: NoHostAvailable
```

Fixes #28011

Closes scylladb/scylladb#28040
2026-01-08 21:53:47 +02:00
Andrei Chekun
ee0bf35615 test.py: add custome exit code for pytest in case maxfail reached
This PR adds custom exit code in case when maxfail reached. This is
needed for easier detection why pytest failed in CI.

Closes scylladb/scylladb#28018
2026-01-08 21:53:47 +02:00
Anna Stuchlik
1b653166f1 doc: add the version variable to the Web Installer instructions
This commit replaces a fixed version name with the variable for the current version
in the instructions for installing a non-default version with Web Installer.
This will make using the installer more user-friendly.

Fixes https://github.com/scylladb/scylladb/issues/28005
2026-01-08 10:12:21 +01:00
Benny Halevy
ebd667a8e0 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
93b827c185 database: truncate_table_on_all_shards: drop outdated TODO comment
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
2a803d2261 database: truncate_table_on_all_shards: consider can_flush on all shards
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.

Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().

Fixes #27639

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
02ee341a03 memtable_list: unify can_flush and may_flush
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
0342a24ee0 test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:45 +02:00
Benny Halevy
5be6b80936 replica: table, storage_group, compaction_group: add needs_flush
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Benny Halevy
ec4069246d test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Patryk Jędrzejczak
946a2bb988 storage_service: do not call raft_topology_update_ip for left nodes
This `raft_topology_update_ip` call always returns after `t.find(raft_id)`
returns `nullptr`, so it effectively does nothing. It's not a bug, since
there is no reason to update `system.peers` for left nodes anyway. We
delete the rows corresponding to left nodes in `process_left_node` (called
just above).

Closes scylladb/scylladb#27899
2026-01-07 16:52:13 +01:00
Michał Jadwiszczak
be16e42cb0 service/storage_service: update service levels cache after upgrade to v2
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes SCYLLADB-90
2026-01-07 14:06:13 +01:00
Michał Jadwiszczak
53d0a2b5dc service/storage_service: check if service levels were already upgraded
before doing migration to raft

There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.
2026-01-07 14:06:13 +01:00
Nadav Har'El
f7eae50d98 test/alternator: fix some expected error messages to fit DynamoDB
All tests I am fixing in this patch do pass for me on DynamoDB, but
other developers report that they fail because some DynamoDB servers
apparently use slightly different error messages, with less detail about
the cause of an error. For example, some of our tests currently expect
an error message that looks like:

    An error occurred (ValidationException) when calling the Query
    operation: Invalid operator used in KeyConditionExpression:
    attribute_exists

But some servers don't report the ": attribute_exists" at the end, so
we can't use the word "attribute_exists" it in the test to recognize
the correct error, and needs to use a different word (which both
versions of DynamoDB and Alternator all print).

As another example, the good old DynamoDB error:

    An error occurred (ValidationException) when calling the Query
    operation: 1 validation error detected: Value 'DOG' at
   'conditionalOperator' failed to satisfy constraint: Member must
   satisfy enum value set: [OR, AND]

Got replaced by the following less informative message:

    An error occurred (ValidationException) when calling the Query
    operation: Failed to satisfy constraint: Member must satisfy enum
    value set: [ALL, OR]'

So we need to fix the test to allow it too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 14:06:33 +02:00
Nadav Har'El
e97fbc2d65 test/alternator: fix compressed request test on non-us-east1
The test test_compressed_request.py::test_compressed_request coerces
boto3 to send a compressed request, and wrongly used region_name=us-east-1
to set up the connection. Theoretically, this doesn't matter because
we also set the correct URL (for either Alternator or the desired region
in AWS). But in fact it does matter, because region name is part of the
request's signature, and DynamoDB refuses the request if it comes to
a different region than it is signed for. So this test fails when run
on DynamoDB on any other region except us-east-1.

The fix is simple - don't use the constant "us-east-1", but pick up the
correct region name from the original connection.

The functions new_dynamodb_session(), new_dynamodb() and
new_dynamodb_stremas() had the same bug and we fix it too, but it didn't
break any test because the only tests using these functions were
Scylla-only so the AWS region problem didn't apply to them.
2026-01-07 13:33:46 +02:00
Patryk Jędrzejczak
f0d159abb0 Merge 'test/raft: use valid sentinel in liveness check to prevent digest errors' from Emil Maskovsky
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.

The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).

Closes scylladb/scylladb#28010

* https://github.com/scylladb/scylladb:
  test/raft: use valid sentinel in liveness check to prevent digest errors
  test/raft: improve debugging in randomized_nemesis_test
2026-01-07 12:31:21 +01:00
Nadav Har'El
2c02e463ff test/alternator: fix test's expected error message on DynamoDB
The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an
error - it's not allowed to set a tag on a GSI or LSI - but the error
message that DynamoDB prints recently changed - instead of saying
"ResourceArn" the new error message says "resource arn".

Change the test to allow both forms, so it will pass on both Alternator
(which still uses the word ResourceArn - which is the name of the
parameter) and on DynamoDB (which uses "resource arn").

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
4f3150c282 test/alternator: mark Alternator-only test scylla_only
The test test_batch.py::test_batch_write_item_large_broken_connection
failed on DynamoDB (Refs #26079). It turns out this test has many
problems:

1. This test wrongly assumes a batch write needs to complete in one
   attempt - and this fails on DynamoDB with low WCU capacity where
   the batch needs to be resumed in multiple requests. Using boto3's
   batch_writer() fixes this problem.
2. This test has NOTHING to do with batches - so is mis-named and
   mis-placed. The batch write is just a way to prepare some data
   in the table, and the real test is about Query'ing the data back
   and observing the long response and reproducing issue #14454.
   I did not rename or move the test, but left a comment explaining
   the situation.
3. This test is written to assume the Query's response uses HTTP
   chunked encoding. Which isn't actually true for DynamoDB, at least
   not at the time of this writing. So the test fails on DynamoDB.

For the last reason, I made this test scylla_only. This test can't
really be run on DynamoDB without rewriting it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
df6b347911 test/alternator: fix test on DynamoDB
The test test_batch.py::test_batch_write_item_large often fails when
running on DynamoDB, and this patch fixes it. The test checks that a
large but not over-the-limits large batch works. However, "works" only
means that the batch is not an error - it doesn't guarantee that all the
items in the batch are performed. If the WCU limits of the table are
exceeded DynamoDB may perform only part of the the batch and return the
remaining items as UnprocessedItems. This not only can happen, it
usually does happen on DynamoDB - because a new on-demand-billing table
always start with a very low WCU capacity.

So in this patch we update the test to recognize and perform the
UnprocessedItems, instead of assuming it needs to be empty.

The test continues to pass on Alternator, and finally passes on
DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
9d6a463324 test/alternator: increase wait_for_gsi() timeout
In Alternator tests, the wait_for_gsi() utility function is used in
tests that add a GSI to an existing table, to wait for this new GSI
to become ready. Although this takes a fraction of a second on
Alternator, we noticed that this takes many minutes (!) on DynamoDB
so we used an absurdly high 10 minute timeout to allow tests to also
pass on DynamoDB.

But it turns out that 10 minutes wasn't absurdly high enough, and
tests using it in test_gsi_updatetable.py started to fail on DynamoDB.
Empirically, 10 minutes was enough in the past but it seems that today
adding a GSI to an empty table routinely takes as much as 20 minutes.

So this patch increases the wait_for_gsi() timeout to a whopping 30
minutes. After this patch, the tests in test_gsi_updatetable.py which
used to fail - test_gsi_backfill_with_lsi,
test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and
test_gsi_backfill_oversized_key now all pass on DynamoDB - but each
takes more than 20 minutes to pass.

To allow the test to fail much more quickly on Alternator (where
creating a GSI takes a fraction of a second), we set a much lower
but still very high timeout when running on Alternator - 60 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:50:54 +02:00
Łukasz Paszkowski
62313a6264 load_sketch: Allow populating load_sketch with normalized current load
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.

This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.

Fixes https://github.com/scylladb/scylladb/issues/27620

Closes scylladb/scylladb#27802
2026-01-07 11:49:01 +01:00
Avi Kivity
2642636ada build: avoid ccache masquarading when choosing ccache too
In 12dcf79c60, we avoid the ccache masquarate directory
when choosing sccache, as that would give us a double-caching
effect: first sccache is called, then clang++ is looked up
finding ccache masquarading as clang++. We solved that by
converting the name clang++ to the absolute path /usr/bin/clang++
(or whatever), skipping over the masquarade directory in $PATH.

It turns out that we need to do the same for ccache. That commit
changed the compile command to 'ccache clang++', and ccache will
look up clang++ in $PATH, finding itself in the masquarade directory.

Fix that by avoiding the masquarade directory if a compiler cache is
specified explicitly or is found with --compiler-cache=auto.

Closes scylladb/scylladb#27996
2026-01-06 17:47:09 +02:00
Nadav Har'El
5f79d93102 Merge 'Alternator response compression' from Szymon Malewski
This pull request introduces HTTP response compression to Alternator, allowing responses (both string and chunked) to be compressed using `gzip` or `deflate` when requested by clients and when the response size exceeds configurable thresholds.

* Added new source files `http_compression.cc` and `http_compression.hh` implementing compression logic, including parsing client `Accept-Encoding` headers, selecting compression algorithms, and compressing response bodies using zlib.

* Added two new configuration options to `db::config` (`alternator_response_gzip_compression_level` and `alternator_response_gzip_compression_threshold_in_bytes`) to control compression level (and optionally disable compression with level 0 - no compression) and minimum response size for compression.

* Added tests showing compliance with DynamoDB behavior.

Fixes #27246

New feature - no backporting

Closes scylladb/scylladb#27454

* github.com:scylladb/scylladb:
  alternator/http_compression: Add compression of streamed response
  alternator/http_compression: Add implementation od gzip/deflate of string response
  alternator/http_compression: Add handling of Accept-Encoding header
  test/alternator: add tests for compressed responses
2026-01-06 16:47:11 +02:00
Emil Maskovsky
4ba3e90f33 test/raft: use valid sentinel in liveness check to prevent digest errors
Replace -1 with 0 for the liveness check operation to avoid triggering
digest validation failures. This prevents rare fatal errors when the
cluster is recovering and ensures the test does not violate append_seq
invariants.

The value -1 was causing invalid digest results in the append_seq
structure, leading to assertion failures. This could happen when the
sentinel value was the first (or only) element being appended, resulting
in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid
and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is
a valid elem_t that does not violate the invariants of the append_seq
structure. In particular, the sentinel value is typically used only
when no valid result is received from any server in the current loop
iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307
2026-01-06 14:34:02 +01:00
Emil Maskovsky
3af5183633 test/raft: improve debugging in randomized_nemesis_test
Move the post-condition check before the assertion to ensure it is
always executed first. Before, the wrong value could be passed to the
digest_remove assertion, making the pre-check trigger there instead of
the post-check as expected.

Also, add a check in the append_seq constructor to ensure that the
digest value is valid when creating an append_seq object.
2026-01-06 14:32:46 +01:00
Ferenc Szili
a51cb3dad9 test: fix flaky test_update_load_stats_after_migration
Disable load balancing to avoid the balancer moving the tablet from a
node with less to a node with more available disk space. Otherwise, the
move_tablet API can fail (if the tablet is already in transisiton) or
be a no-op (in case the tablet has already been migrated)

Fixes: #27980

Closes scylladb/scylladb#27993
2026-01-06 11:57:35 +02:00
Andrei Chekun
b546315edf test.py: fix race condition in initizlization of cqlpy tests
Fix the race condition when the process finished, while test is trying
to checks its descriptors. Now instead of failing the whole loop, it
will continue to iterate the rest of the process to find the needed
process.

Closes scylladb/scylladb#27994
2026-01-06 10:40:25 +02:00
Avi Kivity
4c9c3aae23 tools: toolchain: add dockerfile for future toolchain
To avoid surprises when libstdc++, clang, or other components
in the toolchain introduce regressions, we introduce a "future
toolchain". This builds on the Fedora version under active
development, and the development branches of gcc and llvm.

The future toolchain is not intended to be frozen. Rather,
periodically we will build the future toolchain, then build
ScyllaDB and run its unit tests under that toolchain, then
discard it. Any problems will then have be be tracked down
by a developer and either reported to the source repository,
or fixed in ScyllaDB.

Closes scylladb/scylladb#27964
2026-01-05 19:38:58 +02:00
Nadav Har'El
384e394ff0 Merge 'Add similarity functions to calculate similarity of given vectors' from Dawid Pawlik
It should be possible to return the similarity of vectors in CQL statements following the [Cassandra compatible syntax](https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html#query-vector-data-with-cql):

```
SELECT comment, similarity_cosine(comment_vector, [0.1, 0.15, 0.3, 0.12, 0.05])
    FROM cycling.comments_vs;
```

Although the calculations are slow, and we already have calculated results returned via Vector Store API,
we need the functionality as it allows us to calculate similarity of vectors not stored in vector indexes.

It will be needed for [quantization and rescoring](https://scylladb.atlassian.net/wiki/spaces/RND/pages/195985800/Quantization+and+Rescoring).

The feature is also a nice-to-have in testing as requested many times by testing and CX teams.

The optimized version utilizing already calculated distances from Vector Store without a need of rescoring will be coming soon after via https://github.com/scylladb/scylladb/pull/27991.

---

The patch adds functions:
- `similarity_cosine(<vector>, <vector>)`,
- `similarity_euclidean(<vector>, <vector>)`,
- `similarity_dot_product(<vector>, <vector>)`

Where `<vector>` is either a column of type `VECTOR<FLOAT, N>` or a vector of floats literal.

These functions can be called with every `SELECT` query, not only ANN vector queries as opposed to https://github.com/scylladb/scylladb/pull/25993.

The similarity calculations are implemented inspired by [USearch's implementation](
a2f1759910/include/usearch/index_plugins.hpp (L1304-L1385)) and made compatible with [Cassandra's documentation](https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/functions.html#vector-similarity-functions).
That would guarantee the results in ScyllaDB are calculated using the exact same algorithms as used in Vector Store indexes.

---

Fixes: SCYLLADB-88
Fixes: SCYLLADB-89

New feature, should land into 2026.1

Closes scylladb/scylladb#27524

* github.com:scylladb/scylladb:
  docs: add vector similarity functions documentation
  test/cqlpy: add similarity functions correctness tests
  test/cqlpy: add similarity functions invalid call tests
  cql3: introduce similarity functions syntax
  vector_similarity_fcts: introduce similarity functions
  vector_similarity_fcts: retrieve similarity function argument types
  vector_similarity_fcts: add calculating similarity between vectors
2026-01-05 18:28:10 +02:00
Tomasz Grabiec
ffa11d6a2d test: Verify that repair doesn't block disabling of tablet load balancing
Refs #27647
2026-01-05 13:22:15 +01:00
Tomasz Grabiec
ccdb301731 tablets: Make balancing disabling call preempt tablet transitions
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.

New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.

Fixes #27647
2026-01-05 13:22:08 +01:00
Nadav Har'El
5c2ca56adf test/alternator: fix test passing a spurious parameter
The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi
failed on DynamoDB (Refs #26079) because we passed an unused parameter
NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is
only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL.
DynamoDB refuses to create an LSI with such inconsistent parameters,
and we just need to remove this unnecessary parameter from this test.

The reason why this test didn't fail on Alternator is that Alternator
doesn't yet support or even parse the Projection parameter (Refs #5036).

We also add an xfailing test (passes on DynamoDB, fails on Alternator)
checking that a spurious NonKeyAttributes parameter is rejected. When
we get around to implement the projection feature (#5036), this will
be yet another acceptance test for this feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-05 13:51:01 +02:00
Botond Dénes
e4da0afb8d reader_concurrency_semaphore: add protection against negative count resource leaks
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.

I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.

Fixes: SCYLLADB-163

Closes scylladb/scylladb#27764
2026-01-05 12:45:15 +02:00
Anna Stuchlik
375479d96c doc: fix the syntax of internal links
Some internal links had the wrong syntax: they were formatted as external links.
As a result, they redirected the user to the outdated Open Source documentation.
This commit fixes that bug.

Fixes https://github.com/scylladb/scylladb/issues/25899

Closes scylladb/scylladb#27905
2026-01-05 10:44:58 +01:00
Szymon Malewski
1f658bb2e2 alternator/http_compression: Add compression of streamed response
This patch adds compression of chunked responses.
It adds intermediate stream to compress chunks of data that are provided to http sink.

Fixes #27246
2026-01-05 10:14:42 +01:00
Szymon Malewski
b8afb173a6 alternator/http_compression: Add implementation od gzip/deflate of string response
Previous commit added means to decide whether client asks for compression and with which algorithm.
This patch adds actual compression of responses based on zlib library.
For now only string (not chunked) responses are compressed.
Several previously defined tests start to pass.
2026-01-05 10:14:42 +01:00
Szymon Malewski
ec329f85b0 alternator/http_compression: Add handling of Accept-Encoding header
This is an initial patch to add support of Alternator's compressed responses.
The actual compression (gzip,deflate) will be added in the following commits.
The main functionality added in this commmit is parsing of Accept-Encoding header,
that indicates compression algorithms supported by the client.
In this commit we add also configuration parameters of response gzip/deflate compression.
They allow to enable/disable compression, set level and a size threshold below which a response is not compressed.
With current implementation it is possible to decide a compression for each response, but it is not used yet.
2026-01-05 10:14:40 +01:00
Szymon Malewski
08386ea959 test/alternator: add tests for compressed responses
Adds set of tests that:
1. Show how DynamoDB handles response compression.
It supports 'gzip' and 'deflate' compression, which can be selected by providing 'Accept-Encoding` header. It only encodes response above 4096B.
- `test_compressed_response`, `test_compressed_response_large` show compression for various response sizes.
- `test_accept_encoding_header` focuses on testing various values of Accept-Encoding header.
- `test_multiple_accept_encoding_headers` verifies behaviour with repeted Accept-Encoding headers.

2. Will confirm implementation of response compression in Alternator (#27246)
Additonally to above test, we check Altenator specific expectations:
- `test_chunked_response_compression` makes sure that compression will work also for chunked responses.
- `test_set_compression_options` checks config options to set response size threshold for compression and compression level

3. `test_signature_trims_accept_encoding_spaces` reveals Alternator's bug in signature verification (#27775)
2026-01-05 10:13:40 +01:00
Avi Kivity
0df85c8ae8 Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov"
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.

Fixes #27966

Closes scylladb/scylladb#27969
2026-01-05 08:53:41 +02:00
Benny Halevy
a8114f9bcc test_backup: do_abort_restore: reduce data footprint
To make the test fast, in particular in debug mode
insert fewer keys and do not rely on os.urandom
which is notoriously slow

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:04:08 +02:00
Benny Halevy
c0dd662144 test_backup: do_abort_restore: use error injection
Currently the test depends on timing and enough inserted
data to abort the restore tasks at exactly the right time.
This is flaky in nature, so instead, use error injection
to synchronize the abort with mutation streaming.

Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS
log message, so waiting for it is dropped from the test.
The most imporant thing is that some restore tasks must fail.
(We cannot guarantee all would fail unfortunately)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
16dd07c7d4 test_backup: do_abort_restore: use asyncio for cql
Use the more modern asyncio facility to run cql queries
and a prepared statement to insert data into the table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
f1a583c39c test_backup: do_abort_restore: use new_test_keyspace
For creating a keyspace with a unique name and auto-deleting
it on exit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
3e8431a3d9 test_backup: do_abort_restore: use logger rather than print
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
acb2c9b045 test_backup: do_abort_restore: pass auto_rack_dc to servers_add
To generate multi-rack cluster, otherwise we get the following error:
```
E   cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1">
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Dani Tweig
1ef6ac5439 consolidating jira automation to one workflow file
Closes scylladb/scylladb#27854
2026-01-05 07:09:03 +02:00
copilot-swe-agent[bot]
4e41b6f106 tools/scylla-nodetool: Increase precision of compression ratio from 1 to 2 decimal places
In the tablestats (cfstats) command.
Fixes: https://github.com/scylladb/scylladb/issues/27962

Closes scylladb/scylladb#27965
2026-01-05 07:07:06 +02:00
Avi Kivity
e03d24e3f3 Merge 'Use file_stat with a relative path when listing directories' from Benny Halevy
With the additional file_stat overload introduced in
[Update seastar submodule](3e9b071838),
use the opened directory for more efficient, relative-path based stat.

* Enhancement, no backport needed

Closes scylladb/scylladb#27967

* github.com:scylladb/scylladb:
  table: get_snapshot_details: use relative-path based file_stat
  table: get_snapshot_details: fix warning in exists_in_dir
  table: get_snapshot_details: fix staging dir calculation
  backup: process_snapshot_dir: use relative-path based file_stat
  directory_lister: add ctor with opened directory
2026-01-04 22:06:34 +02:00
Nadav Har'El
c4a9d7eb3e cql: fix DESC KEYSPACES when a "USE" is in effect
If a CQL session USEs a keyspace and then calls DESC TABLES, the user
expects to see only the tables in the chosen keyspace. However, calling
DESC KEYSPACES should still return list all the keyspaces - returning
just the USEd one is not useful - and also not what Cassandra does.
We had an xfailing test test_describe.py::test_keyspaces_with_use which
reproduces this bug (and passes on Cassandra).

In this patch we fix this bug. The fix is simple - USE should affect
DESC statements, but be ignored for DESC KEYSPACES. We can then remove
the xfail marker from the test.

The patch also includes a new test for the DESC TABLES case, where the
USE *does* have an affect. And I wanted to make sure the patch doesn't
break this case. As usual, the new test passes on both Cassandra and
ScyllaDB.

Fixes #26334

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27971
2026-01-04 22:01:12 +02:00
Dawid Mędrek
77a934e5b9 db/hints: Prevent draining hints before hint replay is allowed
Context
-------
The procedure of hint draining boils down to the following steps:

1. Drain a hint sender. That should get rid of all hints stored
   for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.

Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.

Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:

* `hint_sender::replay_allowed()` is not `true`. This can happen when
  hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
  corresponding endpoint is not alive AND it hasn't left the cluster
  AND it's still a normal token owner.

There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.

---

Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:

```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```

The folder with the remaining hints will also stay on disk, which is, of
course, undesired.

---

When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.

Dangerous situations:

1. We try to drain hints before hint replay is allowed. That will
   violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
   and it still possesses some tokens.

---

How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).

The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:

1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.

The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.

The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:

> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".

I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.

Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:

> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.

This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.

Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.

When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.

---

Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.

The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.

Fixes scylladb/scylladb#27693

Closes scylladb/scylladb#27713
2026-01-04 16:54:05 +02:00
Benny Halevy
4d46674d03 table: get_snapshot_details: use relative-path based file_stat
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
2d2177d2c9 table: get_snapshot_details: fix warning in exists_in_dir
The functor is called both on the data directory as well
as on the staging directory, so the warning printed if the
found file is not the same inode should print the given path,
not datadir / name (as was copy and pasted).

Refs #27635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
240b32a87a table: get_snapshot_details: fix staging dir calculation
staging is based off of datadir, not snapshot_dir.

the issue was introduced in f5ca3657e2.

Refs #27635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
1a08ef2062 backup: process_snapshot_dir: use relative-path based file_stat
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
8d00266f88 directory_lister: add ctor with opened directory
This ctor allows the caller to open the directory first,
on its own, and pass it down to the directory_lister.

Once all callers use this ctor we can get rid of
the delayed open in the get() method.

Also, in can be used to replace full-path based file_stat calls
on listed entries with file_stat(directory, name) calls
that are based on statat() and a relative path name that is present
in the listed directory entry.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

sq
2026-01-04 11:05:18 +02:00
Amnon Heiman
c6d1c63ddb distributed_loader: system_replicated_keys as system keyspace
This patch adds system_replicated_keys to the list of known system
keyspaces.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:41:47 +02:00
Amnon Heiman
83c1103917 replicated_key_provider: make KSNAME public
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.

It will be used to identify it as a system keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:39:51 +02:00
Dawid Pawlik
c0b06a7fc6 docs: add vector similarity functions documentation
Add documentation in `functions.rst` as the CQL reference
for a vector similarity functions.
This includes the syntax, example usage, and prerequisites
for the parameters.
2026-01-02 13:02:59 +01:00
Dawid Pawlik
115bd51873 test/cqlpy: add similarity functions correctness tests
Add `calculate_similarity` function for testing purposes.

Add tests checking if CQL returned values match the calculated
ones with the precision up to 5th decimal place.

The tests should also be run on Cassandra to check compatibility
with their responses.
2026-01-02 13:02:59 +01:00
Dawid Pawlik
12aa33106f test/cqlpy: add similarity functions invalid call tests
Add tests checking that calling similarity functions with:
- non-vector columns
- non-vector values
- vectors with mismatching dimensions
as arguments fails.
2026-01-02 12:49:22 +01:00
Dawid Pawlik
b03d520aff cql3: introduce similarity functions syntax
The similarity function syntax is:

`similarity_<metric_name>(<vector>, <vector>)`

Where `<metric_name>` is one of `cosine`, `euclidean` and `dot_product`
matching the intended similarity metric to be used within calculations.
Where `<vector>` is either a vector column name or vector literal.

Add `vectorSimilarityArgs` symbol that is an extension of `selectionFunctionArgs`,
but allowing to use the `value` as an argument as well as the `unaliasedSelector`.
This is needed as the similarity function syntax allows both the arguments to be
a vector value, so the grammar needs to recognize the vector literal there as well.

Since we actually support `SELECT`s with constants since this patch,
return true instead of throwing an error while trying to convert the function call
to constant.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
5b2b8d596a vector_similarity_fcts: introduce similarity functions
This patch introduces scalar functions `similarity_cosine()`,
`similarity_euclidean()`, and `similarity_dot_product()`
which should return a float - similarity of the given vectors
calculated according to the function's similarity metric.

The argument types of this function are retrieved with
the `retrieve_vector_arg_types`, but shall be assignable to
`vector<float, N>` where `N` is the same for both arguments.

This patch introduces a dimensionality check during the execusion
of those functions.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
b72df3ae27 vector_similarity_fcts: retrieve similarity function argument types
This patch retrieves the argument types for similarity functions.
Newly introduced `retrieve_vector_arg_types` function checks if
the provided arguments are vectors of floats and if
both the vector values match the same type (dimension).
If so, we know the exact type and set it as the function arguments type.
Otherwise, if the exact type is unkown, but we can assign to vector<float, N>
then the dimensionality check will be done during execution of
the similarity function.
This also takes care of null values and bind variables the same way
as implemented in Cassandra to stay compatible.
Meaning that if we can infer the type from one argument, then the latter
may be unknown (null or ?).

Additionally this patch adds `test_assignment_any_vector` function
which tests the weak assignment to vector<float, N> as mentioned
above.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
2bedefbb85 vector_similarity_fcts: add calculating similarity between vectors
This commit introduces `compute_cosine_similarity`, `compute_euclidean_similarity`,
`compute_dot_product_similarity` functions to calculate the vectors similarity
in respective metric.
The similarity is a float value meaning how similar the vectors are in a range of [0, 1].
Values closer to 1 indicate greater similarity.

The `dot_product` similarity requires L2 normalized vectors as arguments.
The similarity is calculated based on the jVector's implementation used by Cassandra.
f967f1c924/jvector-base/src/main/java/io/github/jbellis/jvector/vector/VectorSimilarityFunction.java (L36-L69)
2026-01-02 12:48:08 +01:00
Nadav Har'El
6c8ddfc018 test/alternator: fix typo in test_returnvalues.py
Different DynamoDB operations have different settings allowed for
their "ReturnValues" argument. In particular, some operations allow
ReturnValues=UPDATED_OLD but the DeleteItem operation *does not*.

We have a test, test_delete_item_returnvalues, aimed to verify this
but it had a typo and didn't actually check "UPDATED_OLD". This patch
fixes this typo.

The test still passes because the code itself (executor.cc,
delete_item_operation's constructor) has the correct check - it was
just the test that was wrong.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27918
2026-01-01 19:33:23 +02:00
Israel Fruchter
40ada3f187 Update tools/cqlsh submodule (v6.0.32)
* tools/cqlsh scylladb/scylla-cqlsh@9e5a91d7...scylladb/scylla-cqlsh@5a1d7842 (9):
  > fix wrong reference in copyutil.py
  > Add GitHub Action workflow to create releases on new tags
  > test_copyutil.py: introdcue test for ImportTask
  > fix(copyutil.py): avoid situatuions file might be move withing multiple processes
  > Fix Unix socket port display in show_host() method
  > Merge pull request #157 from scylladb/alert-autofix-1
    .github/workflows/build-push.yml: Potential fix for code scanning alert no. 1: Workflow does not contain permissions
  > .github/workflows/dockerhub-description.yml: Potential fix for code scanning alert no. 9: Workflow does not contain permissions
  > test_cqlsh_output: skip some cassandra 5.0 table options
  > tests: template compression cql to use `class` insted of `sstable_comprission`
  > Pin Cassandra version to 5.0 for reproducible builds
  > Remove scylla-enterprise integration test and update Cassandra to latest

Closes scylladb/scylladb#27924
2026-01-01 19:30:34 +02:00
Łukasz Paszkowski
76b84b71d1 storage/test_out_of_space_prevention.py: Fix async/await bugs
- Add missing await keywords for async operations on s2_log.wait_for()
  and coord_log.wait_for()
- Fix incorrect regex: "compaction .* Split {cf}" → "compaction.*Split {cf}"
- The commit https://github.com/scylladb/scylladb/commit/f7324a4 demoted
  compaction start/end log messages to debug level. Hence add
  compaction=debug log messages to the following tests:
    test_split_compaction_not_triggered
    test_node_restart_while_tablet_split
    test_repair_failure_on_split_rejection

Fixes https://github.com/scylladb/scylladb/issues/27931

Closes scylladb/scylladb#27932
2026-01-01 14:24:30 +02:00
Anna Stuchlik
624869de86 doc: remove cassandra-stress from installation instructions
The cassandra-stress tool is no longer part of the default package
and cannot be run in the way described.

This commit removes the instruction to run cassandra-stress.

Fixes https://github.com/scylladb/scylladb/issues/24994

Closes scylladb/scylladb#27726
2026-01-01 14:20:58 +02:00
Jenkins Promoter
69d6e63a58 Update pgo profiles - aarch64 2026-01-01 05:10:51 +02:00
Jenkins Promoter
d6e2d3d34c Update pgo profiles - x86_64 2026-01-01 04:27:14 +02:00
Nadav Har'El
e28df9b3d0 test: fix Python warnings in regular expressions
Like C, Python supports some escape sequences in strings such as the
familiar "\n" that converts to a newline character.
Originally, when backslash was used before a random character, for
example, "\.", Python used to just use these literal characters
backslash and dot, in the string - and not make a fuss about it.
This made it ok to use a string like "hi\.there" as a regular expression.
We have a few instances of this in our Python tests.

But recent releases of Python started to produce ugly warnings about
these cases. The error message looks like:

    SyntaxWarning: "\." is an invalid escape sequence. Such sequences
    will not work in the future. Did you mean "\\."? A raw string is
    also an option.

Indeed in most cases the easiest solution is to use a "raw string",
a string literal preceded with r. For example, r"hi\.there". In such
strings Python doesn't replace escape sequences like \n in the string,
and also leaves the \. unchanged for the regular expression to see.

So in this patch we use raw strings in all places in test/ where Python
warns have this problem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27856
2025-12-31 20:44:01 +02:00
Yaniv Michael Kaul
597d300527 main.cc: remove warning: 'metric_help' is deprecated
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Backport: no, benign issue.

Closes scylladb/scylladb#27680
2025-12-31 18:36:55 +02:00
Avi Kivity
b690ddb9e5 tools: toolchain: dbuild: bind-mount full ~/.cache to container
In afb96b6387, we added support for sccache. As a side effect
it changed the method of invoking ccache from transparent via PATH
(if it contains /usr/lib64/ccache) to explicit, by changing the compiler
command line from 'clang++' (which may or may not resolve the the ccache
binary) to 'ccache /usr/local/bin/clang++', which always invokes ccache.

In the default dbuild configuration, PATH does not contain /usr/lib64/ccache,
so ccache isn't invoked by default. Users can change this via the
SCYLLADB_DBUILD environment variable.

As a result of ccache being suddenly enabled for dbuild builds, ccache
will now attempt to create ~/.cache/ccache. Under docker, this does
not work, because we bind-mount ~/.cache/dbuild. Docker will create the
intermediate ~/.cache, but under the root user, not $USER. The intermediate
directory being root-owned prevents ~/.cache/ccache from being created.

Under podman, this does work, because everything runs under the container's
root user.

The fix is to bind-mount the entire ~/.ccache into the container. This
not only lets ccache create the directory, it will also find an existing
~/.cache/ccache directory and use it, enabling reuse across invocations.

Since ccache will now respect configuration changes without access to
its configuration file (notably, the maximum cache size), we also
bind-mount ~/.config.

Since ~/.ccache and ~/.config are not automatically created, we create
them explicitly so the bind mounts can work. This is for new nodes enlisted
from the cloud; developer machines will have those directories preexisting.

Note that the ccache directory used to be ~/.ccache, but was later changed.
Had the author known, we would have bind-mounted ~/.cache much earlier.

Fixes #27919.

Closes scylladb/scylladb#27920
2025-12-31 14:08:41 +01:00
Asias He
3abda7d15e topology_coordinator: Ensure repair_update_compaction_ctrl is executed
Consider this:

- n1 is a coordinator and schedules tablet repair
- n1 detects tablet repair failed, so it schedules tablet transition to end_repair state
- n1 loses leadership and n2 becomes the new topology coordinator
- n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000
- when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step

To fix, we use the global_tablet_id to index the lock instead of the
session id.

In addition, we retry the repair_update_compaction_ctrl verb in case of
error to ensure the verb is eventually executed. The verb handler is
also updated to check if it is still in end_repair stage.

Fixes #26346

Closes scylladb/scylladb#27740
2025-12-31 13:17:18 +01:00
Benny Halevy
3e9b071838 Update seastar submodule
* seastar f0298e40...4dcd4df5 (29):
  > file: provide a default implementation for file_impl::statat
  > util: Genralize memory_data_sink
  > defer: Replace static_assert() with concept
  > treewide: drop the support of fmtlib < 9.0.0
  > test: Improve resilience of netsed scheduling fairness test
  > Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai
    file: Put alignment helpers in anonymous namespace
    file: Use query_device_alignment_info in blkdev_alignments
  > Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai
    file: Apply physical_block_size override to filesystem files
    file: Use designated initializers in xfs_alignments
    iotune: Add physical block size detection
    disk_params: Add support for physical_block_size overrides from io_properties.yaml
    block_device: Query alignment requirements separately for memory and I/O
  > Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy
    json: formatter: fix formatting of std:string_view
    json: formatter: make sure std::string_view conforms to is_string_like

Fixes #27887

  > demos:improve the output of demo_with_io_intent() in file_demo
  > test: Add accept() vs accept_abort() socket test
  > file: Refine posix_file_impl alignments initialization
  > Add file::statat and a corresponding file_stat overload
  > cmake: don't compile memcached app for API < 9
  > Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs
    future: adjust lifetime for lvalue continuations
    future: fix value class operator()
  > pollable_fd: Unfriend everything
  > Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy
    file: experimental_list_directory: use buffered generator
    file: define list_directory_generator_type
  > Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov
    net: Deprecate datagram::get_data() returning packet
    memcache: Fix indentation after previous patch
    memcache: Use new datagram::get_buffers() API
    dns: Use new datagram::get_buffers() API
    tests: Use new datagram::get_buffers() API
    demo: Use new datagram::get_buffers() API
    udp: Make datagram implementations return span of temporary_buffer-s
  > Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov
    reactor: Fix indentation after previous patch
    timers: Remove enabling callback from timer_set::complete()
  > treewide: avoid 'static sstring' in favor of 'constexpr string_view'
  > resource: Hide hwloc from public interface
  > Merge 'Fix handle_exception_type for lvalues' from Travis Downs
    futures_test: compile-time tests
    function_traits: handle reference_wrapper
  > posix_data_sink_impl: Assert to guard put UB
  > treewide: fix build with `SEASTAR_SSTRING` undefined
  > avoid deprecation warnings for json_exception
  > `util/variant_utils`: correct type deduction for `seastar::visit`
  > net/dns: fixed socket concurrent access
  > treewide: add missing headers
  > Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov
    file: Remove file_read_state
    test: Add a test for posix_file_impl::do_dma_read_bulk()
  > membarrier: simplify locking

Adjust scylla to the following changes in scylla:
- file_stat became polymorphic
  - needs explicit inference in table::snapshot_exists, table::get_snapshot_details
- file::experimental_list_directory now returns list_directory_generator_type

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27916
2025-12-30 19:37:13 +03:00
Yaniv Kaul
0264ec3c1d test: test_downgrade_after_partial_upgrade: check that feature is disabled on all nodes after partial upgrade
We should check that the test feature is disabled on all nodes after a partial
upgrade. This hardens the test a bit, although the old code wasn't that bad,
since enabled features are a part of the group 0 state shared by all nodes.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27654
2025-12-30 17:34:56 +01:00
Nadav Har'El
ffcce1ffc8 test/boost: fix flaky test node_view_update_backlog
The boost test view_schema_test.cc::node_view_update_backlog can be
flaky if the test machine has a hiccup of 100ms, and this patch fixes
it:

The test is a unit test for db::view::node_update_backlog, which is
supposed to cache the backlog calculation for a given interval. The
test asks to cache the backlog for 100ms, and then without sleeping
at all tries to fetch a value again and expect the unchanged cached
value to be returned. However, if the test run experiences a context
switch of 100ms, it can fail, and it did once as reported in #27876.

The fix is to change the interval in this test from 100ms to something
much larger, like 10 seconds. We don't sleep this amount - we just need
the second fetch to happen *before* 10 seconds has passed, so there's
no harm in using a very large interval.

However, the second half of this test wants to check that after the
interval is over, we do get a new backlog calculation. So for the
second half of this test we can and should use a shorter backlog -
e.g., 10ms. We don't care if the test machine is slow or context switched,
for this half of the test we want to to sleep *more* than 10ms, and
that's easy.

The fixed test is faster than the old one (10ms instead of 100ms) and
more reliable on a shared test machine.

Fixes #27876.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27878
2025-12-30 10:10:42 +01:00
Benny Halevy
c9eab7fbd4 test: test_refresh: add test_refresh_deletes_uploaded_sstables
The refresh api is expected to automatically delete
the sstable files from the uploads/ dir.  Verify that.

The code that does that is currently called by
sstables_loader::load_new_sstables:
```c++
        if (load_and_stream) {
...
                co_await loader.load_and_stream(ks_name, cf_name, table_id, std::move(sstables_on_shards[this_shard_id()]), primary_replica_only(primary), true /* unlink */, scope, {});
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27586
2025-12-30 10:51:24 +03:00
Nadav Har'El
80e5860a8c docs/alternator: document that Streams needs vnodes
The current state (after PR #26836) is that Alternator tables are
created by default using tablets. But due to issue #23838, Alternator
Streams cannot be enabled on a table that uses tablets... An attempt to
enable Streams on such a table results in a clear error:

    "Streams not yet supported on a table using tablets (issue #23838).
    If you want to use streams, create a table with vnodes by setting
    the tag 'system:initial_tablets' set to 'none'."

But users should be able to learn this fact from the documentation -
not just retroactively from an error message. This is especially important
because a user might create and fill a table using tablets, and only get
this error when attempting to enable Streams on the existing table -
when it is too late to change anything.

So this patch adds a paragraph on this to compatibility.md, where
several other requirements of Alternator Streams are already mentioned.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27000
2025-12-30 10:45:34 +03:00
Avi Kivity
853f3dadda Merge 'treewide: fix some spelling errors' from Piotr Smaron
Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all.

No need to backport, just cosmetic changes.

Closes scylladb/scylladb#27897

* github.com:scylladb/scylladb:
  treewide: fix some spelling errors
  codespell: ignore `iif` and `tread`
2025-12-29 20:45:31 +02:00
Patryk Jędrzejczak
0fed9f94f8 gossiper: add_saved_endpoint: make generations of excluded nodes negative
The explanation is in the new comment in `gossiper::add_saved_endpoint`.

We add a test for this change. It's "extremely white-box", but it's better
than nothing.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
749b0278e5 test: introduce test_full_shutdown_during_replace 2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
4526dd93b1 utils: error_injection: allow aborting wait_for_message
The test added in the following commit utilizes it.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
fc4c2df2ce raft topology: preserve IP -> ID mapping of a replacing node on restart
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
2025-12-29 19:13:53 +01:00
Patryk Jędrzejczak
4e63e74438 messaging: improve the error messages of closed_errors
The default error message of `closed_error` is "connection is closed".
It lacks the host ID and the IP address of the connected node, which
makes debugging harder. Also, it can be more specific when
`closed_error` is thrown due to the local node shutting down.

Fixes #16923

Closes scylladb/scylladb#27699
2025-12-29 18:36:07 +02:00
Avi Kivity
567c28dd0d Merge 'Decouple sstables::storage::snapshot() and ::clone() functionality' from Pavel Emelyanov
The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot.

Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting.

Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away.

Improving internal APIs, no need to backport

Closes scylladb/scylladb#27871

* github.com:scylladb/scylladb:
  sstables, storage: Drop unused bool classes and tags
  sstables/storage: Drop create_links_common() overloads
  sstable: Simplify storage::snapshot()
  sstables: Introduce storage::clone()
2025-12-29 17:50:54 +02:00
Avi Kivity
9927c6a3d4 Merge 'Reapply "audit: enable some subset of auditing by default"' from Piotr Smaron
This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13.

<h3>Why re-enabling table audit</h3>

Audit has been disabled (scylladb/scylla-enterprise/pull/3094) over many concerns raised against the table implementation, e.g. scylladb/scylla-enterprise/issues/2939 / scylladb/scylla-enterprise/issues/2759 + there's whole outstanding backlog of issues . One of the concerns was also a possible loss of availability, and since then we migrated audit keyspace from SimpleStrategy RF=1 to NetworkTopologyStrategy RF=3 (scylladb/scylla-enterprise/pull/3399) and stopped failing queries when auditing fails (scylladb/scylla-enterprise/pull/3118 & scylladb/scylla-enterprise/pull/3117), which improves the situation but doesn't address all the concerns. Eventually we want to use syslog as audit's sink, but it's not fully ready just yet, and so we'll restore table audit for now to increase the security, but later switch to syslog. BTW. cloud will enable table audit for AUTH category scylladb/sre-ops-automation/issues/2970 separately from this effort.

<h3>Performance considerations</h3>

We are assuming that the events for the enabled categories, i.e. DCL, DDL, AUTH & ADMIN, should appear at about the same, low cadence, with AUTH perhaps having the biggest impact of them all under some workloads. The performance penalty of enabling just the AUTH category [has been measured](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) and while authentication throughput and read/write throughput remain stable, the queries' P99 latency may decrease by a couple of % in the most hardcore scenarios.

Fixes: https://github.com/scylladb/scylladb/issues/26020

Gradually re-enabling audit feature, no need to backport.

Closes scylladb/scylladb#27262

* github.com:scylladb/scylladb:
  doc: audit: set audit as enabled by default
  Reapply "audit: enable some subset of auditing by default"
2025-12-29 16:41:04 +02:00
Tomasz Grabiec
bbf9ce18ef Merge 'load_balancer: compute node load based on tablet sizes' from Ferenc Szili
Currently, the tablet load balancer performs capacity based balancing by collecting the gross disk capacity of the nodes, and computes balance assuming that all tablet sizes are the same.

This change introduces size-based load balancing. The load balancer does not assume identical tablet sizes any more, and computes load based on actual tablet sizes.

The size-based load balancer computes the difference between the most and least loaded nodes in the balancing set (nodes in DC, or nodes in a rack in case of `rf-rack-valid-keyspaces`) and stops further balancing if this difference is bellow the config option `size_based_balance_threshold_percentage`.

This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node:

`delta = (most_loaded - least_loaded) / most_loaded`

If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.

This PR is a part of a series of PRs which are based on top of each other.

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
- The fifth part changes the load balancing simulator: #26438

This is a new feature, backport is not needed.

Fixes #26254

Closes scylladb/scylladb#26254

* github.com:scylladb/scylladb:
  test, load balancing: add test for table balance
  load_balancer: add cluster feature for size based balancing
  load_balancer: implement size-based load balancing
  config: add size based load balancing config params
  load_stats: use trinfo to decide how to reconcile tablet size
  load_sketch: use tablet sizes in load computation
  load_stats: add get_tablet_size_in_transition()
2025-12-29 15:01:38 +01:00
Pavel Emelyanov
d892140655 Merge 'Reduce allocations when traversing compaction_groups' from Benny Halevy
- table, storage_group: add compaction_group_count
  - And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
  - compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.

* Improvement, no backport needed

Closes scylladb/scylladb#27479

* github.com:scylladb/scylladb:
  table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
  replica: storage_group: rename compaction_groups to compaction_groups_immediate
2025-12-29 16:26:33 +03:00
Gleb Natapov
4a5292e815 raft topology: Notify that a node was removed only once
Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.

Closes scylladb/scylladb#27743
2025-12-29 14:22:34 +01:00
Piotr Smaron
fb4d89f789 treewide: fix some spelling errors 2025-12-29 13:53:56 +01:00
Piotr Smaron
ba5c70d5ab codespell: ignore iif and tread
There are correct:
- iif is a boost's header name
- `tread carefully` is an actual english phrase
2025-12-29 13:53:56 +01:00
Nadav Har'El
8df9cfcde8 Merge 'Add table size bytes to describe table' from Radosław Cybulski
Add table size to DescribeTable's reply in Alternator

Fills DescribeTable's reply with missing field TableSizeBytes.

- add helper class simple_value_with_expiry, which is like std::optional
  but the value put has a timeout.
- add ignore_errors to estimate_total_sstable_volume function - if set
  to true the function will catch errors during RPC and ignore them,
  substituting 0 for missing value.
- add a reference to storage_service to executor class (needed to call
  estimate_total_sstable_volume function).
- add fill_table_description and create_table_on_shard0 as non static
  methods to executor class
- calculate TableSizeBytes value for a given table and return it as
  part of DescribeTable's return value. The value calculated is cached for
  approximately 6 hours (as per DescribeTable's specification).
  The algorithm is as follows:
  - if the requested value is in cache and is still valid it's returned,
    nothing else happens.
  - otherwise:
    - every shard of every node is requested to calculate size of its data
    - if the error happens, the error is ignored and we assume the given
      shard has a size of 0
    - all such values are summed producing total size
    - produced value is returned to a caller
    - on the node the call for a size happened every shard is requested to
      cache produced value with a 6 hour timeout.
    - if the next call comes for a differet shard on the same node that
      doesn't yet have cached value, the shard will request the value to
      be calculated again. The new value will overwrite the old one on
      every shard on this node.
    - if the next call comes to a different node, the process of
      calculation will happen from start, possibly producing different
      value. The value will have it's own timeout, there's no attempt made
      to synchronize value between nodes.
- add a alternator_describe_table_info_timeout_in_seconds parameter, which
  will control, how long DescribeTable's table information are being held
  in cache. Default is 6 hours.
- update test to use parameter
  `alternator_describe_table_info_timeout_in_seconds` - setting it to 0
  and forcing flushing memtables to disk allows checking, that table size
  has grown.

Fixes #7551

Closes scylladb/scylladb#24634

* github.com:scylladb/scylladb:
  alternator: fix invalid rebase
  Update tests
  Update documentation
  Add table size to DescribeTable's output
  Promote fill_table_description and create_table_on_shard0 to methods
  Modify estimate_total_sstable_volume to opt ignore errors
  Add alternator_describe_table_info_cache_validity_in_seconds config option
  Add ref to service::storage_service to executor
  Add simple_value_with_expiry util class
2025-12-29 14:47:36 +02:00
Benny Halevy
f60033db63 db: system_keyspace: get_group0_history: unfreeze_gently
Prevent stall when the group0 history is too long using unfreeze_gently
rather than the synchronous unfreeze() function

Fixes #27872

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27873
2025-12-29 12:00:54 +02:00
copilot-swe-agent[bot]
d25d295e84 alternator/server: update SSL comment 2025-12-29 09:34:08 +01:00
Radosław Cybulski
df20f178aa alternator: fix invalid rebase
Fix an invalid rebase, that would properly merge code coming
from master, except that code would ignore refactor done in the patch.
2025-12-29 08:33:10 +01:00
Radosław Cybulski
a31c8762ca Update tests 2025-12-29 08:33:09 +01:00
Radosław Cybulski
5e1254eef0 Update documentation 2025-12-29 08:33:08 +01:00
Radosław Cybulski
a86b782d3f Add table size to DescribeTable's output
Add a table size to DescribeTable's output.
2025-12-29 08:33:07 +01:00
Radosław Cybulski
1bd855a650 Promote fill_table_description and create_table_on_shard0 to methods
Promote `executor::fill_table_description` and
`executor::create_table_on_shard0` to methods (from static functions).
2025-12-29 08:33:06 +01:00
Radosław Cybulski
6a26381f4f Modify estimate_total_sstable_volume to opt ignore errors
Modify `storage_service::estimate_total_sstable_volume` function to
optionally ignore errors (instead substitute 0), when `ignore_errors`
parameter is set to `yes`.
2025-12-29 08:33:06 +01:00
Radosław Cybulski
a532fc73bc Add alternator_describe_table_info_cache_validity_in_seconds config option
Add a `alternator_describe_table_info_cache_validity_in_seconds`
configuration option with default value of 6 hours.
2025-12-29 08:33:05 +01:00
Radosław Cybulski
e246abec4d Add ref to service::storage_service to executor
Add a reference to `service::storage_service` to executor object.
2025-12-29 08:33:03 +01:00
Radosław Cybulski
dfa600fb8f Add simple_value_with_expiry util class
Add a `simple_value_with_expiry` utility class, which functions like
a `std::optional` with added timeout. When emplacing a value, user
needs to provide timeout, after which value expires (in which case
the `simple_value_with_expiry` object behaves as if was never set
at all).
Add boost tests for the new class.
2025-12-29 08:32:52 +01:00
Pavel Emelyanov
2e33234e91 util: Remove lister::rmdir()
There's seastar helper that does the same, no need to carry yet another
implementation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27851
2025-12-28 19:46:19 +02:00
Avi Kivity
63e3a22f2e Merge 'group0_state_machine: don't update in-memory state machine until start' from Piotr Dulikowski
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
https://github.com/scylladb/scylladb/issues/26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf187a, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

The fix is accompanied by a reproducer of scylladb/scylladb#26945.

Fixes: scylladb/scylladb#26945

This is not a regression, so no need to backport.

Closes scylladb/scylladb#27528

* github.com:scylladb/scylladb:
  test: cluster: test for recovery after partial group0 command
  group0_state_machine: remove obsolete comment about group0 consistency
  group0_state_machine: don't update in-memory state machine until start
  group0_state_machine: move reloading out of std::visit
  service: raft: add state machine ref to raft_server_for_group
2025-12-28 13:59:26 +02:00
Pavel Emelyanov
e963a8d603 checked-file: Implement experimental_list_directory()
The method in question returns coroutine generator that co_yields
directory_entry-s. In case the method is not implemented, seastar
creates a fallback generator, that calls existing subscription-based
list_directory() and co_yields them. And since checked file doesn't yet
have it, fallback generator is used, thus skipping the lower file
yielding lister. Not nice.

This patch implements the generator lister for checked file, thus making
full use of lower file generator lister too.

A side note. It's not enough to implement it like

    return do_io_check([] {
        return lower_file->experimental_list_directory();
    });

like list_directory() does, since io-checking will _not_ happen on
directory reading itself, as it's supposed to.

This is the problem of the check_file::list_directory() implementation
-- it only checks for exception when creating the subscription (and it
really never happens), but reading the directory itself happens without
io checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27850
2025-12-28 13:37:44 +02:00
Yaron Kaikov
1ee89c9682 Revert "scripts: benign fixes flagged by CodeQL/PyLens"
This reverts commit 377c3ac072.

This breaks all artifact tests and cloud image build process

Closes scylladb/scylladb#27881
2025-12-28 09:49:49 +02:00
Ferenc Szili
6d3c720a08 test, load balancing: add test for table balance
This change adds a boost test which validates the resulting table
balance of size based load balancing. The threshold was set to a
conservative 1.5 overcommit to avoid flakyness.
2025-12-27 11:39:08 +01:00
Ferenc Szili
b7ebd73e53 load_balancer: add cluster feature for size based balancing
This patch adds a cluster feature size_based_load_balancing which, until
enabled, will force capacity based balancing. This is needed because
during rolling upgrades some of the nodes will have incomplete data in
load_stats (missing tablet sizes and effective_capacity) which are
needed for size based balancing to make good decisions and issue correct
migrations.
2025-12-27 11:39:08 +01:00
Ferenc Szili
10eb364821 load_balancer: implement size-based load balancing
This changes introduces tablet size based load balancing. It is an
extension of capacity based balancing with the addition of actual tablet
sizes.

It computes the difference between the most and least loaded nodes in
the DC and stops further balancing if this difference is bellow the
config option size_based_balance_threshold_percentage.

This config option does not apply to the absolute load, but instead to
the percentage of how much the most loaded node is more loaded than the
least loaded node:

delta = (most_loaded - least_loaded) / most_loaded

If this delta is smaller then the config threshold, the balancer will
consider the nodes balanced.
2025-12-27 11:20:20 +01:00
Ferenc Szili
cc9e125f12 config: add size based load balancing config params
This change adds:

- The config paremeter force_capacity_based_balancing which, when
  enabled performs capacity based balancing instead of size based.
- The config parameter size_based_balance_threshold_percentage which
  sets the balance threshold for the size based load balancer.
- The config parameter minimal_tablet_size_for_balancing which sets the
  minimal tablet size for the load balancer.
2025-12-27 10:37:38 +01:00
Ferenc Szili
0c9b93905e load_stats: use trinfo to decide how to reconcile tablet size
This patch corrects the way update_load_stats_on_end_migration() decides
which tablet transition occured, in order to reconcile tablet sizes in
load_stats. Before, the transition kind was inferred from the value of
leaving and pending replicas. This patch changes this to use the value
of trinfo.transition.

In case of a rebuild, and in case there is only one replica, the new
tablet size will be set to 0.
2025-12-27 10:37:38 +01:00
Ferenc Szili
621cb19045 load_sketch: use tablet sizes in load computation
This commit changes load_sketch so that it computes node and shard load
based on tablet sizes instead of tablet count.
2025-12-27 10:37:23 +01:00
Ferenc Szili
1c9ec9a76d load_stats: add get_tablet_size_in_transition()
This patch adds a method to load_stats which searches for the tablet
size during tablet transition. In case of tablet migration, the tablet
will be searched on the leaving replica, and during rebuild we will
return the average tablet size of the pending replicas.
2025-12-27 10:37:23 +01:00
Pavel Emelyanov
bda1709734 Merge 'test: fix infinite loop in python log browsing code triggered from test_orphaned_sstables_on_startup' from Avi Kivity
Recently, test/cluster/test_tablet.py::test_orphaned_sstables_on_startup started
spinning in the log browsing code, part of a the test library that looks into log files
for expected or unexpected patterns. This reproduced somewhat in continuous
integration, and very reliably for me locally.

The test was introduced in fa10b0b390, a year ago.

There are two bugs involved: first, that we're looking for crashes in this test,
since in fact it is expected to crash. The node expectedly fails with an
on_internal_error. Second, the log browsing code contains an infinite loop
if the crash backtrace happens to be the last thing in the log. The series
fixes both bugs.

Fixes #27860.

While the bad code exists in release branches, it doesn't trigger there so far, so best
to only backport it if it starts manifesting there.

Closes scylladb/scylladb#27879

* github.com:scylladb/scylladb:
  test: pylib: log_browsing: fix infinite loop in find_backtraces()
  test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
2025-12-26 10:45:56 +03:00
Pavel Emelyanov
712cc8b8f1 sstables, storage: Drop unused bool classes and tags
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
9e189da23a sstables/storage: Drop create_links_common() overloads
There's a bunch of tagged create_links_common() overloads that call the
most generic one with properly crafted arguments and the link_mode.
Callers of those one-liners can craft the args themselves.

As a result, there's only one create_links_common() overload and callers
explicitly specifying what they want from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
32cf358f44 sstable: Simplify storage::snapshot()
Now there are only two callers left -- sstable::snapshot() and
sstable::seal() that wants to auto-backup the sealed sstable.

The snapshot arguments are:
- relative path, use _base_dir
- no new generation provided
- no leave-unsealed tag

With that, the implementation of filesystem_storage::snapshot() is as
simple as
- prepare full path relative to _base_dir
- touch new directory
- call create_links_common()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
8e496a2f2f sstables: Introduce storage::clone()
And call it from sstable::clone() instead of storage::snapshot().

The snapshot arguements are:
- target directory is storage::prefix(), that's _dir itself
- new generation is always provided, no need for optional
- leave_unsealed bool flag

With that, the implementation of filesystem_storage::clone() is as
simple as call create_links_common() forwarding args and _dir to it. The
unification of leave_unsealed branches will come a bit later making this
code even shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Nadav Har'El
9c50d29a00 test/boost: fix flaky test_inject_future_disabled
The test boost/error_injection_test.cc::test_inject_future_disabled
checks what happens when a sleep injection is *disabled*: The test
has a 10-millisecond-sleep injection and measures how much it takes.
The test expects it to take less than 10 milliseconds - in fact it
should take almost zero. But this is not guaranteed - on a slow debug
build and an overcommitted server this do-nothing injection can take
some time, and in one run (#27798) it took 14 milliseconds - and the
test failed.

The solution is easy - make the sleep-that-doesn't-happen much longer -
e.g., 10 whole seconds. Since this sleep still doesn't happen, we
expect the injection to return in less - much less - than 10 seconds.
This 10 seconds is so ridiculously high we don't expect the do-nothing
injection to take 10 seconds, not even a ridiculously busy test machine.

Fixes #27798

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27874
2025-12-25 20:46:31 +02:00
Avi Kivity
92996ce9fa test: pylib: log_browsing: fix infinite loop in find_backtraces()
The find_backtraces() function uses a very convoluted loop to
read the log file. The loop fails to terminate if the last thing
in the log file is the backtrace, since the loop termination condition
(`not line`) continues to be true.

It's not clear why this did not reliably hit before, but it now
reliably reproduces for me on both x86 and aarch64. Perhaps timing
changed, or perhaps previously we had more text on the log.
2025-12-25 20:22:17 +02:00
Avi Kivity
50a3460441 test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.

The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.

Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
2025-12-25 20:22:17 +02:00
Avi Kivity
55c7bc746e Revert "vector_search_validator: move high availability tests from vector-store.git"
This reverts commit caa0cbe328. It is
either extremely slow or broken. I was never able to get it to
run on an r8gd.8xlarge (on the NVMe disk). Even when it passes,
it is very slow.

Test script:

```

git submodule update --recursive || exit 125

rm -rf build

d() { ./tools/toolchain/dbuild -it -- "$@"; }

d ./configure.py --mode release || exit 125
d ninja release-build || exit 125
d ./test.py --mode release
```

Ref #27858
Ref #27859
Ref #27860
2025-12-25 12:30:22 +00:00
Botond Dénes
ebb101f8ae scylla-gdb.py: scylla small-objects: make freelist traversal more robust
Traversing the span's freelist is known to generate "Cannot access
memory at address ..." errors, which is especially annoying when it
results in failed CI. Make this loop more robust: catch gdb.error coming
from it and just log a warning that some listed objects in the span may
be free ones.

Fixes: #27681

Closes scylladb/scylladb#27805
2025-12-25 13:26:09 +03:00
Alex
f769e52877 test: boost: Fix flaky test_large_file_upload_s3 by creating induvidual files for testing During CI runs, multiple instances of the same test may execute concurrently. Although the test uses a temporary directory, the downloaded bucket artifacts were written using an identical filename across all instances.
This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions.

Change Description

This change ensures that each test instance uses a unique filename for downloaded bucket files.
By isolating file writes per test execution, concurrent runs no longer interfere with each other.

Fixes: #27824

backport not required

Closes scylladb/scylladb#27843
2025-12-25 09:40:13 +02:00
Benny Halevy
51433b838a table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
compaction_groups_immediate() may allocate memory, but when called from a
synchronous call site, the caller can use for_each_compaction_group
instead to iterate over the compaction groups with no extra allocations.

Calling compaction_groups_immediate() is still required from an async
context when we want to "sample" the compaction groups
so we can safely iterate over them and yield in the inner loop.

Also, some performance insensitive call sites using
compaction_groups_immediate had been left as they are
to keep them simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-24 21:19:28 +02:00
Benny Halevy
0e27ee67d2 replica: storage_group: rename compaction_groups to compaction_groups_immediate
To better reflect that it returns a materialized vector
of compaction_group ptrs.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-24 21:19:26 +02:00
Nadav Har'El
186c91233b Merge 'scylla-gdb.py: improve scylla fiber and scylla read-stats' from Botond Dénes
Improve scylla fiber's ability to traverse through coroutines.
Add --direction command-line parameter to scylla-fiber.
Fix out-of-date premit collection in scylla read-stat and improve the printout.

scylla-gdb.py improvements, no backport needed

Closes scylladb/scylladb#27766

* github.com:scylladb/scylladb:
  scylla-gdb.py: scylla read-stats: include all permit lists
  scylla-gdb.py: scylla fiber: add --direction command-line param
  scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
2025-12-24 17:49:58 +02:00
Botond Dénes
27bf65e77a db/batchlog_manager: add missing <seastar/coroutine/parallel_for_each.hh> include
Build only fails if `--disable-precompiled-header` is passed to
`configure.py`. Not sure why.

Closes scylladb/scylladb#27721
2025-12-24 16:32:12 +02:00
Botond Dénes
c66275e05c cql3/statements/batch_statement: make size error message more verbose
Mention the type of batch: Logged or Unlogged. The size (warn/fail on
too large size) error has different significance depending on the type.

Refs: #27605

Closes scylladb/scylladb#27664
2025-12-24 15:27:01 +02:00
Piotr Szymaniak
9c5b4e74c3 doc: Correct reference in dev/audit.md
Closes scylladb/scylladb#27832
2025-12-24 15:25:15 +02:00
Botond Dénes
ccc03d0026 test/pylib/runner.py: pytest_configure(): coerce repeat to int
Coerce the return value of config.getoption("--repeat") to int to avoid:

    Traceback (most recent call last):
      File "/usr/bin/pytest", line 8, in <module>
        sys.exit(console_main())
                 ~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main
        code = main()
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main
        ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
        return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
               ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
        return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main
        config._do_configure()
        ~~~~~~~~~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure
        self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic
        res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False)
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
          return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure
        config.run_ids = tuple(range(1, config.getoption("--repeat") + 1))
                                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
    TypeError: can only concatenate str (not "int") to str

Closes scylladb/scylladb#27649
2025-12-24 15:13:02 +02:00
Nadav Har'El
8df5189f9c Merge 'docs: scylla-sstable.rst: extract script API to separate document' from Botond Dénes
The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate.

Documentation refactoring, no backport needed.

Closes scylladb/scylladb#27609

* github.com:scylladb/scylladb:
  docs: scylla-sstable-script-api.rst: add introduction and title
  docs: scylla-sstable.rst: extract script API to separate document
  docs: scylla-sstable: prepare for script API extract
2025-12-24 15:02:57 +02:00
Botond Dénes
b036a461b7 tools/scylla-sstable: dump-schema: incude UDT description in dump
If the table uses UDTs, include the description of these (CREATE TYPE
statement) in the schema dump. Without these the schema is not useful.

Closes scylladb/scylladb#27559
2025-12-24 14:46:52 +02:00
Botond Dénes
3071ccd54a Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.

Closes scylladb/scylladb#27762

* github.com:scylladb/scylladb:
  table: Move snapshot_file_set to table.cc
  table: Rename and move snapshot_on_all_shards() method
  table: Ditch jsondir variable
  table, sstables: Pass snapshot name to sstable::snapshot()
  table: Use snapshot_writer in write_manifest()
  table: Use snapshot_writer in write_schema_as_cql()
  table: Add snapshot_writer::sync()
  table: Add snapshot_writer::init()
  table: Introduce snapshot_writer
  table: Move final sync and rename seal_snapshot()
  table: Hide write_schema_as_cql()
  table: Hide table::seal_snapshot()
  table: Open-code finalize_snapshot()
  table: Fix indentation after previuous patch
  table: Use smp::invoke_on_all() to populate the vector with filenames
  table: Don't touch dir once more on seal_snapshot()
  table: Open-code table::take_snapshot() into caller lambda
  table: Move parts of table::take_snapshot to sstables_manager
  table: Introduce table::take_snapshot()
  table: Store the result of smp::submit_to in local variable
2025-12-24 13:46:47 +02:00
Nadav Har'El
4ae45eb367 test/alternator: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27676
2025-12-24 13:44:28 +02:00
Nadav Har'El
da00401b7d test/alternator: rename test with duplicate name
The file test/alternator/test_transact.py accidentally had two tests
with the same name, test_transact_get_items_projection_expression.
This means the first of the two tests was ignored and never run.

This patch renames the second of the two to a more appropriate
(and unique...) name.

I verified that after this change the number of tests in this file
grows by one, and that still all tests pass on DynamoDB and fail
(as expected by xfail) on Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27702
2025-12-24 13:43:43 +02:00
Botond Dénes
95d4c73eb1 Merge 'Make object storage config truly updateable' from Pavel Emelyanov
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.

Refs: #7316
Fixes: #26509

Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup

Closes scylladb/scylladb#27689

* github.com:scylladb/scylladb:
  sstables/storage_manager: Fix configured endpoints observer
  test/object_store: Add test to validate how endpoint config update works
2025-12-24 13:42:44 +02:00
Botond Dénes
12dcf79c60 Merge 'build: support (and prefer) sccache as the compiler cache' from Avi Kivity
Currently, we support ccache as the compiler cache. Since it is transparent, nothing
much is needed to support it.

This series adds support to sccache[1] and prefers it over ccache when it is installed.

sccache brings the following benefits over ccache:
1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler
2. Rust support
3. C++20 modules (upcoming[2])

It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce
build times, but without a compiler cache and distributed build support, they come with too large
a penalty. This removes the penalty.

The series detects that sccache is installed, selects it if so (and if not overridden
by a new option), enables it for C++ and Rust, and disables ccache transparent
caching if sccache is selected.

Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That
is left for later.

[1] https://github.com/mozilla/sccache
[2] https://github.com/mozilla/sccache/pull/2516

Toolchain improvement, won't be backported.

Closes scylladb/scylladb#27834

* github.com:scylladb/scylladb:
  build: apply sccache to rust builds too
  build: prevent double caching by compiler cache
  build: allow selecting compiler cache, including sccache
2025-12-24 13:40:02 +02:00
Nadav Har'El
74a57d2872 test/cqlpy: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27675
2025-12-24 13:31:41 +02:00
Andrzej Jackowski
632ff66897 doc: audit: mention double audit sink in Enabling Audit section
Configuration of both table and syslog audit is possible since
scylladb/scylladb#26613 was implemented. However, the "Enabling Audit"
section of the documentation wasn't updated, which can be misleading.

Ref: scylladb/scylladb#26613

Closes scylladb/scylladb#27790
2025-12-24 13:20:03 +02:00
Gleb Natapov
04976875cc topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757
2025-12-24 13:17:53 +02:00
Yaniv Michael Kaul
377c3ac072 scripts: benign fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
No functional changes, just to get rid of some standard CodeQL warnings.

Benign - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27801
2025-12-24 13:08:24 +02:00
Avi Kivity
d6edad4117 test: pylib: resource_gather: don't take ownership of /sys/fs/cgroup under podman
Under podman, we already own /sys/fs/cgroup. Run the chown command only
under docker where the container does not map the host user to the
container root user.

The chown process is sometimes observed to fail with EPERM (see issue).
But it's not needed, so avoid it.

Fixes #27837.

Closes scylladb/scylladb#27842
2025-12-24 10:56:24 +02:00
Marcin Maliszkiewicz
3c1e1f867d raft: auth: add semaphore to auth_cache::load_all
Auth cache loading at startup is racing between
auth service and raft code and it doesn't support
concurrency causing it to crash.

We can't easily remove any of the places as during
raft recovery snapshot is not loaded and it relies
on loading cache via auth service. Therefore we add
semaphore.

Fixes https://github.com/scylladb/scylladb/issues/27540

Closes scylladb/scylladb#27573
2025-12-24 10:56:24 +02:00
Nadav Har'El
f3a4af199f test/cqlpy/test_materialized_view.py: Fix for Commented-out code
This patch was suggested and prepared by copilot, I am writing the commit
message because the original one was worthless.

In commit cf138da, for an an unexplained reason, a loop waiting until the
expected value appears in a materialized view was replaced by a call for
wait_for_view_built(). The old loop code was left behind in a comment,
and this commented-out code is now bothering our AI. So let's delete the
commented-out code.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27646
2025-12-24 10:56:23 +02:00
Botond Dénes
1bb897c7ca Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.

Closes scylladb/scylladb#27360

* github.com:scylladb/scylladb:
  object_storage: Temporarily handle pure endpoint addresses as endpoints
  code: Remove dangling mentions of s3::endpoint_config
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  test: Add named constants for test_get_object_store_endpoints endpoint names
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
2025-12-24 06:59:02 +02:00
Botond Dénes
954f2cbd2f Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity
For deployments fronted by a reverse proxy (haproxy or privatelink), we want to
use proxy protocol v2 so that client information in system.clients is correct and so
that the shard-aware selection protocol, which depends on the source port, works
correctly. Add proxy-protocol enabled variants of each of the existing native transport
listeners.

Tests are added to verify this works. I also manually tested with haproxy.

New feature, no backport.

Closes scylladb/scylladb#27522

* github.com:scylladb/scylladb:
  test: add proxy protocol tests
  config, transport: support proxy protocol v2 enhanced connections
2025-12-24 06:58:00 +02:00
Nadav Har'El
e75c75f8cd test/cqlpy: fix two tests that couldn't fail because of typo
As noticed by copilot, two tests in test_guardrail_compact_storage.py
could never fail, because they used `pytest.fail` instead of the
correct `pytest.fail()` to fail. Unfortunately, Python has a footgun
where if it sees a bare function name without parenthesis, instead of
complaining it evaluates the function object and then ignores it,
and absolutely nothing happens.

So let's add the missing `()`. The test still passes, but now it at
least has a chance of failing if we have a regression.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27658
2025-12-24 06:49:54 +02:00
Yaron Kaikov
d671ca9f53 fix: remove return from finally block in s3_proxy.py
during any jenkins job that trigger `test.py` we get:
```
/jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block
```

The 'return' statement in the finally block was causing a SyntaxWarning.
Moving the return outside the finally block ensures proper exception
handling while maintaining the intended behavior.

Closes scylladb/scylladb#27823
2025-12-24 06:48:03 +02:00
Avi Kivity
fc81983d42 test: sstable_validation_test: actually test ms version
sstable_validation_test tests the `scylla sstable validate` command
by passing it intentionally corrupted sstables. It uses an sstable
cache to avoid re-creating the same sstables. However, the cache
does not consider the sstable version, so if called twice with the
same inputs for different versions, it will return an sstable with
the original version for both calls. As a results, `ms` sstables
were not tested. Fix this bug by adding the sstable version (and
the schema for good measure) to the cache key.

An additional bug, hidden by the first, was that we corrupted the
sstable by overwriting its Index.db component. But `ms` sstables
don't have an Index.db component, they have a Partitions.db component.
Adjust the corrupting code to take that into account.

With these two fixes, test_scylla_sstable_validate_mismatching_partition_large
fails on `ms` sstables. Disable it for that version. Since it was
previously practically untested, we're not losing any coverage.

Fixing this test unblocks further work on making pytest take charge
of running the tests. pytest exposed this problem, likely by running
it on different runners (and thus reducing the effectiveness of the
cache).

Fixes #27822.

Closes scylladb/scylladb#27825
2025-12-24 06:47:31 +02:00
Botond Dénes
cf70250a5c Update seastar submodule
* seastar 7ec14e83...f0298e40 (8):
  > Merge 'coroutine/try_future: call set_current_task() when resuming the coroutine' from Botond Dénes
    coroutine/try_future: call set_current_task() when resuming the coroutine
    core: move set_current_task() out-of-line
  > stop_signal: stop including reactor.hh
  > cmake: Mark hwloc headers as system includes to suppress warnings
  > build: explicitly enable vptr sanitizer
  > httpd: Add API to set tcp keepalive params
  > Merge 'Make datagram_channel::send() use temporary_buffer-s' from Pavel Emelyanov
    net: Remove no longer used to_iovec() helpers
    net,code: Update callers to use new datagram_channel::send()
    net: Introduce datagram_channel::send(span<temporary_buffer>) method
    posix-stack: Make UDP socket implementation use wrapped_iovec
    posix-stack: Introduce wrapped_iovec
  > code: Move pollable_fd_state::write_all(const char*) from API level 9
  > thread: Remove unused sched_group() helper

configure.py: added -lubsan to DEBUG sanitizer flags

Closes scylladb/scylladb#27511
2025-12-24 06:46:36 +02:00
Nadav Har'El
54f3e69fdc Fix for Statement has no effect
This problem and its fix was suggested by copilot, I'm just writing the
cover letter.
test/nodetool/test_status.py has the silly statement tokens == "?" which
has no effect. Looking around the code suggested to me (and also to
Copilot, nice) that the correct intent was assert tokens == "?" and not,
say, tokens = "?".

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27659
2025-12-24 06:43:26 +02:00
Piotr Dulikowski
9ed820cbf5 test: cluster: test for recovery after partial group0 command
Add a reproducer for scylladb/scylladb#26945. By using error injections,
the test triggers a situation where a command that removes an obsolete
CDC generation is partially applied, then the node is killed an brought
back. Thanks to the fix, restarting the node succeeds and does not
trigger any consistency checks in the group0 reload logic.
2025-12-23 20:50:43 +01:00
Piotr Dulikowski
71bc1886ee group0_state_machine: remove obsolete comment about group0 consistency
The comment is outdated. It is concerned about group0 consistency after
crash, and that re-applying committed commands may require a raft
quorum. First, 579dcf1 was introduced (long ago) which gets rid of the
need for quorum as the node persists the commit index before applying
the commands - so it knows up to which command it should re-apply on
restart. Second, the preceding commits in this PR makes use of this
mechanism for group0.

Remove the comment as the concern was fully addressed. Additionally,
remove a mention of the comment in raft_group0_client.cc - although it
claims that the comment is placed in `group0_state_machine::apply`, it
has been moved to `merge_and_apply` in 96c6e0d (both comments were
originally introduced in 6a00e79).
2025-12-23 20:44:17 +01:00
Piotr Dulikowski
b24001b5e7 group0_state_machine: don't update in-memory state machine until start
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
scylladb/scylladb#26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf1, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

Fixes: scylladb/scylladb#26945
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
f4efdf18a5 group0_state_machine: move reloading out of std::visit
In the next commit, we will adjust the logic so that it only reloads in
memory state only when a flag is set. By moving the reload logic to one
place in `merge_and_apply`, the next commit will be able to reach its
goal by only adding a single `if`.
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
6bdbd91cf7 service: raft: add state machine ref to raft_server_for_group
This reference will be used by the code that starts group0. It will
manually enable the in-memory state machine only after the group0 server
is fully started, which entails replaying the group0 commands that are,
locally, seen as committed - in order to repair any inconsistencies that
might have arisen due to some commands being applied only partially
(e.g. due to a crash).
2025-12-23 20:44:16 +01:00
Michał Hudobski
ce3320a3ff auth: add system table permissions to VECTOR_SEARCH_INDEXING
Due to the recent changes in the vector store service,
the service needs to read two of the system tables
to function correctly. This was not accounted for
when the new permission was added. This patch fixes that
by allowing these tables (group0_history and versions)
to be read with the VECTOR_SEARCH_INDEXING permission.

We also add a test that validates this behavior.

Fixes: SCYLLADB-73

Closes scylladb/scylladb#27546
2025-12-23 15:53:07 +02:00
Pawel Pery
caa0cbe328 vector_search_validator: move high availability tests from vector-store.git
Initially, tests for high availability were implemented in vector-store.git
repository. High availability is currently implemented in scylladb.git
repository so this repository should be the better place to store them. This
commit copies these tests into the scylladb.git.

The commit copies validator-vector-store/src/high_availability.rs (tests logic)
and validator-tests/src/common.rs (utils for tests) into the local crate
validator-scylla. The common.rs should be copied to be able for reviewer to see
common test code and this code most likely be frequent to change - it will be
hard to maintain one common version between two repositories.

The commit updates also README for vector_search_validator; it shortly describe
the validator modules.

The commit updates reference to the latest vector-store.git master.

As a next step on the vector-store.git high_availability.rs would be removed
and common.rs moved from validator-tests into validator-vector-store.

References: VECTOR-394

Closes scylladb/scylladb#27499
2025-12-23 15:53:07 +02:00
Yaron Kaikov
bad2fe72b6 .github/workflows: Add email validator workflow
This workflow validates that all commits in a pull request use email
addresses ending in @scylladb.com. For each commit with an author or
committer email that doesn't match this pattern, the workflow automatically
adds a comment to the pull request with a warning.

This serves two purposes:

1. Alert maintainers when external contributors submit code (which is
   acceptable, but good to be aware of)

2. Help ScyllaDB developers catch cases where they haven't configured
   their git email correctly

When a non-@scylladb.com email is detected, the workflow posts this
comment on the pull request:
```
    ⚠️ Non-@scylladb.com Email Addresses Detected

    Found commit(s) with author or committer emails that don't end with
    @scylladb.com.

    This indicates either:

    - An external contributor (acceptable, but maintainer should be aware)
    - A developer who hasn't configured their git email correctly

    For ScyllaDB developers:

    If you're a ScyllaDB employee, please configure your git email globally:

        git config --global user.email "your.name@scylladb.com"

    If only your most recent commit is invalid, you can amend it:

        git commit --amend --reset-author --no-edit
        git push --force

    If you have multiple invalid commits, you need to rewrite them all:

        git rebase -i <base-commit>
        # Mark each invalid commit as 'edit', then for each:
        git commit --amend --reset-author --no-edit
        git rebase --continue
        # Repeat for each invalid commit
        git push --force
```
Fixes: https://scylladb.atlassian.net/browse/RELENG-35

Closes scylladb/scylladb#27796
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ec15a1b602 table: Do not move foreign string when writing snapshot
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.

The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.

The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27755
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ecef158345 api: Use ranges library to process views in get_built_indexes()
No functional changes, just make the loop shorter and more
self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27742
2025-12-23 15:53:06 +02:00
Israel Fruchter
53abf93bd8 Update tools/cqlsh submodule
* tools/cqlsh 22401228...9e5a91d7 (7):
  > Add pip retry configuration to handle network timeouts
  > Clean up unwanted build artifacts and update .gitignore
  > test_legacy_auth: update to pytest format
  > Add support for disabling compression via CLI and cqlshrc
  > Update scylla-driver to 3.29.7 (#144)
  > Update scylla-driver version to 3.29.6
  > Revert "Migrate workflows to Blacksmith"

Closes scylladb/scylladb#27567

[avi: build optimized clang 21.1.7
      regenerate frozen toolchain with optimized clang from

      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-aarch64.tar.gz
      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#27734
2025-12-23 15:53:06 +02:00
Aleksandra Martyniuk
bbe64e0e2a test: rename duplicate tests
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.

Rename the tests so that they are unique.

Fixes: https://github.com/scylladb/scylladb/issues/27701.

Closes scylladb/scylladb#27720
2025-12-23 15:53:06 +02:00
Anna Stuchlik
7198191aa9 doc: fix the license information on DockerHub
This commit removes the OSS-related information from DockerHub.
It adds the link to the Source Available license.

Fixes https://github.com/scylladb/scylladb/issues/22440

Closes scylladb/scylladb#27706
2025-12-23 15:53:06 +02:00
Calle Wilund
d5f72cd5fc test::pylib::encryption_provider: Push up setting system_key_directory to all providers
Fixes #27694

Unless set by config, the location will default to /etc/scylla, which is not a good
place to write things for tests. Push the config properly and the directory (but
_not_ creation) to all provider basetype.

Closes scylladb/scylladb#27696
2025-12-23 15:53:06 +02:00
Dawid Mędrek
afde5f668a test: Implement describing Boost tests in JSON format
The Boost.Test framework offers a way to describe tests written in it
by running them with the option `--list_content`. It can be
parametrized by either HRF (Human Readable Format) or DOT (the Graphviz
graph format) [1]. Thanks to that, we can learn the test tree structure
and collect additional information about the tests (e.g. labels [2]).

We currently emply that feature of the framework to collect and run
Boost tests in Scylla. Unfortunately, both formats have their
shortcomings:

* HRF: the format is simple to parse, but it doesn't contain all
       relevant information, e.g. labels.
* DOT: the format is designed for creating graphical visualizations,
       and it's relatively difficult to parse.

To amend those problems, we implement a custom extension of the feature.
It produces output in the JSON format and contains more than the most
basic information about the tests; at the same time, it's easy to browse
and parse.

To obtain that output, the user needs to call a Boost.Test executable
with the option `--list_json_content`. For example:

```
$ ./path/to/test/exec -- --list_json_content
```

Note that the argument should be prepended with a `--` to indicate that
it targets user code, not Boost.Test itself.

---

The structure of the new format looks like this (top-level downwards):

- File name
- Test suite(s) & free test cases
- Test cases wrapped in test suites

Note that it's different from the output the default Boost.Test formats
produce: they organize information within test suites, which can
potentially span multiple files [3]. The JSON format makes test files
the primary object of interest and test suites from different files
are always considered distinct.

Example of the output (after applying some formatting):

```
$ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content
[{"file":"test/boost/canonical_mutation_test.cc", "content": {
  "suites": [],
  "tests": [
    {"name": "test_conversion_back_and_forth", "labels": ""},
    {"name": "test_reading_with_different_schemas", "labels": ""}
  ]
}}]
```

---

The implementation may be seen as a bit ugly, and it's effectively
a hack. It's based on registering a global fixture [4] and linking
that code to every Boost.Test executable.

Unfortunately, there doesn't seem to be any better way. That would
require more extensive changes in the test files (e.g. enforcing
going through the same entry point in all of them).

This implementation is a compromise between simplicity and
effectiveness. The changes are kept minimal, while the developers
writing new tests shouldn't need to remember to do anything special.
Everything should work out of the box (at least as long as there's
no non-trivial linking involved).

Fixes scylladb/scylladb#25415

---

References:
[1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html
[2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html
[3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html
[4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.html

Closes scylladb/scylladb#27527
2025-12-23 15:53:06 +02:00
Asias He
140858fc22 tablet-mon.py: Add repair support
Add repair support in the tablet monitor.

Fixes #24824

Closes scylladb/scylladb#27400
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
132aa753da sstables/storage_manager: Fix configured endpoints observer
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.

However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.

Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.

The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:43:11 +03:00
Pavel Emelyanov
f902eb1632 test/object_store: Add test to validate how endpoint config update works
There's a test for backup with non-existing endpoint/bucket/snapshot. It
checks that API call to backup sstables properly fails in that case.
This patch adds similar test for "unconfigured endpoint", but it adds
the endpoint configuration on-the-fly and expects that backup will
proceed after config update.

Currently the test fails, as config update only affect the config
itself, the storage_manager, that's in charge of maintaining endpoint
clients, is not really updated. Next patch will fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:41:38 +03:00
Pavel Emelyanov
e0cddc8c99 table: Move snapshot_file_set to table.cc
It's not used anywhere else now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
e31b72c61f table: Rename and move snapshot_on_all_shards() method
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
48b1ceefaf table: Ditch jsondir variable
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
a21aa5bdf6 table, sstables: Pass snapshot name to sstable::snapshot()
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
d0812c951e table: Use snapshot_writer in write_manifest()
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:13:47 +03:00
Pavel Emelyanov
fe8923bdc7 table: Use snapshot_writer in write_schema_as_cql()
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:17 +03:00
Pavel Emelyanov
9fee06d3bc table: Add snapshot_writer::sync()
And move calls to sync_directory() into it too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:16 +03:00
Pavel Emelyanov
7a298788c0 table: Add snapshot_writer::init()
And move the call to recursive_touch_directory into it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:05 +03:00
Pavel Emelyanov
db6a5aa20b table: Introduce snapshot_writer
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:00:26 +03:00
Pavel Emelyanov
8e247b06a2 table: Move final sync and rename seal_snapshot()
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
37746ba814 table: Hide write_schema_as_cql()
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
976bcef5d0 table: Hide table::seal_snapshot()
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
8a4daf3ef1 table: Open-code finalize_snapshot()
It makes is easier to modify this code further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
6975234d1c table: Fix indentation after previuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
4a88f15465 table: Use smp::invoke_on_all() to populate the vector with filenames
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
41ed11cdbe table: Don't touch dir once more on seal_snapshot()
The directory had been created way before that, in the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
07708acebd table: Open-code table::take_snapshot() into caller lambda
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
ef3b651e1b table: Move parts of table::take_snapshot to sstables_manager
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.

This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
0c45a7df00 table: Introduce table::take_snapshot()
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
2e2cd2aa39 table: Store the result of smp::submit_to in local variable
Tossing bits around not to make it in the next, larger, patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Botond Dénes
58b5d43538 Merge 'test: multi LWT and counters test during tablets resize and migration' from Yauheni Khatsianevich
This PR extends BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets with counters.

And introduces new LWT with counters test durng tablets resize and migration
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency

Refs: https://github.com/scylladb/qa-tasks/issues/1918
Refs: https://github.com/scylladb/qa-tasks/issues/1988
Refs https://github.com/scylladb/scylladb/issues/18068

Closes scylladb/scylladb#27170

* github.com:scylladb/scylladb:
  test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split,  and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency
  test/lwt: add counter-table support to BaseLWTTester
2025-12-23 07:29:35 +02:00
Botond Dénes
bfdd4f7776 Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho
Split prepare can run concurrently with repair.

Consider this:

1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
        5.1) that sstable is not marked as repaired yet
        5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable

If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.

To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.

This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.

The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.

Fixes #26041.
Fixes #27414.

Should be backported to 2025.4 that contains incremental repair

Closes scylladb/scylladb#26528

* github.com:scylladb/scylladb:
  test: Add reproducer for split vs intra-node migration race
  test: Verify split failure on behalf of repair during critical disk utilization
  test: boost: Add failure_when_adding_new_sstable_test
  test: Add reproducer for split vs incremental repair race condition
  compaction: Fail split of new sstable if manager is disabled
  replica: Don't split in do_add_sstable_and_update_cache()
  streaming: Leave sstables unsealed until attached to the table
  replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
  replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
  replica: Wire add_new_sstable_and_update_cache() into streaming consumer
  replica: Document old add_sstable_and_update_cache() variants
  replica: Introduce add_new_sstables_and_update_cache()
  replica: Introduce add_new_sstable_and_update_cache()
  replica: Account for sstables being added before ACKing split
  replica: Remove repair read lock from maybe_split_new_sstable()
  compaction: Preserve state of input sstable in maybe_split_new_sstable()
  Rename maybe_split_sstable() to maybe_split_new_sstable()
  sstables: Allow storage::snapshot() to leave destination sstable unsealed
  sstables: Add option to leave sstable unsealed in the stream sink
  test: Verify unsealed sstable can be compacted
  sstables: Allow unsealed sstable to be loaded
  sstables: Restore sstable_writer_config::leave_unsealed
2025-12-23 07:28:56 +02:00
Botond Dénes
bf9640457e Merge 'test: add crash detection during tests' from Cezar Moise
After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues.
The test directory is also scanned for coredumps.
If any of the above are found, the test will fail with an error.

The following checks are made:
- Any log line matching `Assertion.*failed` or containing `AddressSanitizer` is marked as a critical error
- Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log
- If any critical error is found, the log is also scanned for backtraces
- Any backtraces found are decoded and saved
- If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines
- Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line
- The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns`

refs: https://github.com/scylladb/qa-tasks/issues/1804

---

[Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/):
Following examples are run on a separate branch where changes have been made to enable these failures.

`test_unfinished_writes_during_shutdown`
- Errors are found in logs and are not ignored
```
failed on teardown with "Failed:
Server 2096: found 1 error(s) (log: scylla-2096.log)
  ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator  had probably migrated to another node)
Server 2101: found 4 error(s) (log: scylla-2101.log)
  ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"])
  ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)))
Server 2094: found 1 error(s) (log: scylla-2094.log)
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)"
```

`test_kill_coordinator_during_op`
- aborts caused by injection
- `ignore_cores_log_patterns` is not set
- while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")`
```
failed on teardown with "Failed:
Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1105-backtraces.txt
Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1106-backtraces.txt
Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1113-backtraces.txt
Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1148-backtraces.txt"
```
Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1)

Closes scylladb/scylladb#26177

* github.com:scylladb/scylladb:
  test: add logging to crash_coordinator_before_stream injection
  test: add crash detection during tests
  test.py: add pid to ServerInfo
2025-12-23 07:27:58 +02:00
Pavel Emelyanov
cd2568ad00 test: Merge and parametrize test_backup_to_non_existent_something tests
There are three tests in cluster/object_store suite that check how
backup fails in case either of its parameters doesn't really exists. All
three greatly duplicate each other, it makes sense to merge them into
one larger parametrized test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27695
2025-12-23 07:02:18 +02:00
Avi Kivity
7586c5ccbd Merge 'system.clients: add client_options map column' from Vladislav Zolotarov
This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data.

**Caching and client metadata refactoring:**

* The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) **per-connection**. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1.
So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now).
* These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard.
* Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once.
* This will would reduce the "variable" (per-connection) memory usage by **at least 50%**. And in case of "ScyllaDB Python Driver" driver version - by 78%!
* And all this for a price of a single `loading_shared_values` object **per-shard** (implements a hash table) and a minor overhead for each value **stored** in it.

* Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

* Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362)

**Virtual table and API enhancements:**

* Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879)

**API and interface changes:**

* Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

**Testing and validation:**

* Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90)

Closes scylladb/scylladb#25746

* github.com:scylladb/scylladb:
  transport/server: declare a new "CLIENT_OPTIONS" option as supported
  service/client_state and alternator/server: use cached values for driver_name and driver_version fields
  system.clients: add a client_options column
  controller: update get_client_data to use foreign_ptr for client_data
2025-12-22 20:02:40 +02:00
Emil Maskovsky
d60b908a8e test/raft: improve reporting in the randomized_nemesis_test digest functions
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.

Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.

This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.

Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030

Closes scylladb/scylladb#27791
2025-12-22 20:02:40 +02:00
Nikos Dragazis
20ff2fcc18 docs: Amend limitations for keyspace RF changes
The doc about DDL statements claims that an `ALTER KEYSPACE` will fail
in the presence of an ongoing global topology operation.

This limitation was specifically referring to RF changes, which Scylla
implements as global topology requests (`keyspace_rf_change`), and it
was true when it was first introduced (1b913dd880) because there was
no global topology request queue at that time, so only one ongoing
global request was allowed in the cluster.

This limitation was lifted with the introduction of the global topology
request queue (6489308ebc), and it was re-introduced again very
recently (2e7ba1f8ce) in a slightly different form; it now applies only
to RF changes (not to any request type) and only those that affect the
same keyspace. None of these two changes were ever reflected in the doc.

Synchronize the doc with the current state.

Fixes #27776.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#27786
2025-12-22 20:02:40 +02:00
Andrei Chekun
6ffdada0ea test.py: modify JUnit report for easier rerun on CI
This will allow to add custom XML attribute to the JUnit report. In this
case there will be path to the function that can be used to run with
pytest command. Parametrized tests will have path to the function
excluding parameter.

Closes scylladb/scylladb#27707
2025-12-22 20:02:40 +02:00
Anna Stuchlik
4c247a5d08 doc: document support for i8g and i8ge instances
Fixes https://github.com/scylladb/scylladb/issues/27703

Closes scylladb/scylladb#27754
2025-12-22 20:02:40 +02:00
copilot-swe-agent[bot]
288d4b49e9 Skip backtrace in lsa-timing logs for preemptible reclaim
Preemptible reclaim is only done from the background reclaimer,
so backtrace is not useful. It's also normal that it takes a long time.
Skip the backtrace when reclaim is preemptible to reduce log noise.

Fixes the issue where background reclaim was printing unnecessary
backtraces in lsa-timing logs when operations took longer than the
stall detection threshold.

Closes: #27692

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-22 20:02:40 +02:00
Pavel Emelyanov
e304d912b4 Merge 'db/view/view_building_worker: follow-ups' from Michał Jadwiszczak
This patch consists of a few smaller follow-ups to the view building worker:
- catch general execption in staging task registrator
- remove unnecessary CV broadcast
- don't pollute function context with conditionally compiled variable
- avoid creating a copy of tasks map
- fix some typos

Refs https://github.com/scylladb/scylladb/issues/25929
Refs https://github.com/scylladb/scylladb/pull/26897

This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts.

Closes scylladb/scylladb#26558

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: fix typos
  db/view/view_building_worker: remove unnnecessary empty lines
  db/view/view_building_worker: fix typo
  db/view/view_building_worker: avoid creating a copy of tasks map
  db/view/view_building_worker: wrap conditionally compiled code in a scope
  db/view/view_building_worker: remove unnecessary CV broadcast
  db/view/view_building_worker: catch general execption in staging task registrator
2025-12-22 20:02:40 +02:00
Botond Dénes
846a6e700b Merge 'get_snapshot_details: process also staging directory' from Benny Halevy
Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory,
and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file
as in the snapshot by looking to the fstat data for the dev id and inode number.

However, we do not look the names in the staging directory so staging sstable
would skew the results as the will falsely contribute to the live size, since they
wouldn't be found in the base directory.

This change processes both the staging directory and base table directory
and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily
detect hard links and be resilient against concurrent move of files from the staging sub-directory
back into the base table directory.

Fixes #27635

* Minor issue, no backport required

Closes scylladb/scylladb#27636

* github.com:scylladb/scylladb:
  table: get_snapshot_details: add FIXME comments
  table: get_snapshot_details: lookup entries also in the staging directory
  table: get_snapshot_details: optimize using the entry number_of_links
  table: get_snapshot_details: continue loop for manifest and schema entries
  table: get_snapshot_details: use directory_lister
2025-12-22 20:02:40 +02:00
Botond Dénes
af5e73def9 Merge 'test/cqlpy: remove unused variables' from Nadav Har'El
These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing.

All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually.

Closes scylladb/scylladb#27667

* github.com:scylladb/scylladb:
  test/cqlpy: remove unused variables
  test/cqlpy: use unique partition in test
2025-12-22 20:02:39 +02:00
Avi Kivity
8e462d06be build: apply sccache to rust builds too
sccache works for rust as well as for C++; use it for rust builds
as well.
2025-12-22 15:36:15 +02:00
Avi Kivity
9ac82d63e9 build: prevent double caching by compiler cache
To avoid both sccache and ccache being used simultaneously,
detect if ccache is in $PATH and use whatever compiler would
have been called instead.
2025-12-22 15:36:14 +02:00
Anna Stuchlik
9793a45288 doc: add a Vector Search page under Features
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.

The purpose of the page is to raise awareness of the feature.

Fixes https://scylladb.atlassian.net/browse/VECTOR-215

Closes scylladb/scylladb#27787
2025-12-22 15:29:45 +02:00
Avi Kivity
afb96b6387 build: allow selecting compiler cache, including sccache
Add a new configuration option for selecting the compiler
cache. Prefer sccache if found, since it supports rust as
well as C++, has better support for distributed compilation,
and is slated to receive module support soon.

cmake is also supported.
2025-12-22 15:27:15 +02:00
Alex
033579ad6f db: api: service: Fix ClientConnectorError in test_client_routes The bug was caused by capturing local variables by reference in lambdas passed to with_retry(), which is a coroutine. When the coroutine suspends, the lambda frame exits and the referenced locals are destroyed, leading to use-after-lifetime issues. This change fixes the problem by ensuring safe ownership across suspension points and also refactors how route_keys and route_entries are passed from the caller. Previously they were passed as const lvalue references, which cannot be moved and therefore ended up being repeatedly copied across function calls and lambda invocations. The new approach avoids unnecessary copies and makes the lifetime semantics explicit and safe.
Fixes: 27792

no backport needed private link is only in master branch

Closes scylladb/scylladb#27795
2025-12-22 14:52:47 +02:00
Yaniv Michael Kaul
c1da552fa4 test/pylib/scylla_cluster.py:get_scylla_2025_1_executable() - retry curl download of 2025.1
For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot.

Benign, I hope - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27746
2025-12-22 14:45:06 +02:00
Piotr Smaron
cb3b96b8f4 raft: correct lease->least typo in a comment
Funny, when researching if our raft implementation relies on the 'lease'
mechanism, I noticed this typo.

Closes scylladb/scylladb#27803
2025-12-22 14:39:55 +02:00
Avi Kivity
b105ad8379 build: drop -fexperimental-assignment-tracking clang option
`-fexperimental-assignment-tracking` was added in fdd8b03d4b to
make coroutine debugging work.

However, since then, it became unnecessary, perhaps due to 87c0adb2fe,
or perhaps to a toolchain fix.

Drop it, so we can benefit from assignment tracking (whatever it is),
and to improve compatibility with sccache, which rejects this option.

I verified that the test added in fdd8b03d4b fails without the option
and passes with this patch; in other words we're not introducing a
regression here.

Closes scylladb/scylladb#27763
2025-12-22 14:33:48 +02:00
Michael Litvak
9f8aea21e3 docs: update RF-rack restrictions
Update the documentation about restrictions to tablets keyspaces related
to RF-rack.

* MV/SI require the keyspace to be RF-rack-valid
* topology operations are restricted if a keyspace has views to preserve
  RF-rack-validity
2025-12-22 09:21:07 +01:00
Michael Litvak
75b5285cdf cql3: don't apply RF-rack restrictions on vector indexes
When creating an index we validate that the keyspace is RF-rack-valid
and print a warning that the keyspace must remain RF-rack-valid.

This should apply only to indexes that are based on materialized views
for which there are consistency concerns when the keyspace is not
RF-rack-valid.

vector indexes are not based on materialized views, hence these
restrictions should not apply to them.
2025-12-22 09:21:07 +01:00
Michael Litvak
06343b58a2 cql3: add warning when creating mv/index with tablets about rf-rack
Creating a MV or index in a tablets-based keyspace now forces additional
restrictions on the keyspace. The keyspace must be RF-rack-valid and it
must remain RF-rack-valid while the view exists.

Add a CQL warning about these restrictions.
2025-12-22 09:21:06 +01:00
Michael Litvak
df801d16da service/tablet_allocator: always allow tablet merge of tables with views
allow tablet merge of tables with views even if the
rf_rack_valid_keyspaces option is not set, because now keyspaces that
have views are enforced to always be rf-rack-valid, regardless of the
option value.
2025-12-22 09:14:30 +01:00
Michael Litvak
07d85af433 locator: extend rf-rack validation for rack lists
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.

If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
2025-12-22 09:14:30 +01:00
Michael Litvak
d40d06c7ad test: test rf-rack validity when creating keyspace during node ops
add tests that attempt to create a keyspace during different stages of
node join or remove, and verify that the rf-rack condition can't be
broken - either creating the keyspace should fail or the node operation
should fail, depending on the stage.
2025-12-22 09:14:30 +01:00
Michael Litvak
a738905a4b locator: fix rf-rack validation during node join/remove
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
   there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
   node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks

To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.

Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
  considered. We can create a keyspace with rf=3 as if the node doesn't
  exist, and then node join will fail in the group0 operation that
  assigns bootstrap tokens, because during this operation we check
  rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
  serialized with keyspace creation. During this operation we check that
  adding the node as a token owner will maintain rf-rack validity for all
  keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
  normal token owner: it is considered as a transitioning token owner by
  the rf-rack check and the rack is considered a transitioning rack. We
  can't count the rack as a normal rack because the node join may still
  fail and rollback. Trying to create a keyspace with either rf=3 or
  rf=4 will fail because we can end up with either 3 or 4 racks.

Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
2025-12-22 09:14:30 +01:00
Michael Litvak
9940dcefa7 test: test topology restrictions for views with tablets
Add tests that verify the restrictions on topology operations when there
are keyspaces with tablets and materialized views.

For such keyspaces, RF=Racks must be enforced while they have
materialized views, therefore adding a node in a new rack or removing a
node that would eliminate a rack should be rejected.
2025-12-22 09:14:30 +01:00
Michael Litvak
870aad7f71 test: add test_topology_ops_with_rf_rack_valid
add new tests for testing that RF-rack validity is maintained when doing
topology operations that may break them, such as adding nodes in new
racks or removing nodes.
2025-12-22 09:14:30 +01:00
Michael Litvak
8f1a566be8 topology coordinator: restrict node join/remove to preserve RF-rack validity
when a new node joins or an existing node is removed / decommissioned,
check if the operation would violate the RF-rack-validity of some
keyspace. if so - reject the operation in order to preserve
RF-rack-validity.

Fixes scylladb/scylladb#23345
Fixes scylladb/scylladb#26820
2025-12-22 09:14:30 +01:00
Michael Litvak
aa8db3b8da topology coordinator: add validation to node remove
add validation to node remove / decommission, similar to node validation
when a node joins.

when starting node remove or decommission, the validation function
checks if the operation is valid and can proceed. if not, it's aborted
with an error message.

we change the return type of validate_joining_node so that it will be
similar and consistent with the new validate_removing_node.
2025-12-22 09:14:29 +01:00
Michael Litvak
9e1f78d162 locator: extend rf-rack validation functions
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.

This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.

The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
2025-12-22 09:14:29 +01:00
Michael Litvak
8df61f6d99 view: change validate_view_keyspace to allow MVs if RF=Racks
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.

Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.

Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.

We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.

We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.

We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.

* Create MV/index: previously the test checked that it's allowed only if
  the config option `rf_rack_valid_keyspaces` is set. This is changed
  now so it's always allowed to create MV/index if the keyspace is
  RF-rack-valid. Update the test to verify that we can create MV/index
  when the keyspace is RF-rack-valid, even if the rf_rack option is not
  set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
  can't be altered to become RF-rack-invalid.
2025-12-22 09:14:29 +01:00
Michael Litvak
de1bb84fca db: enforce rf-rack-validity for keyspaces with views
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.

Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
2025-12-22 09:13:49 +01:00
Michael Litvak
75b269229d replica/db: add enforce_rf_rack_validity_for_keyspace helper
Add the helper function enforce_rf_rack_validity_for_keyspace that
returns true if RF-rack-validity should be enforced for a keyspace, and
use it wherever we need to check this instead of checking the config
option directly.

This is useful because this condition is used in multiple places, and
having it defined in a single helper function will make it easier to
see and change the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
8b0b0c4d80 db: remove enforce parameter from check_rf_rack_validity
simple refactoring: the enforce parameter is always given the value of
the `rf_rack_valid_keyspaces` option. remove the parameter and use the
option value directly from the db config.

this will be useful for a later change to the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
61ae653693 test: adjust test to not break rf-rack validity
the test test_unfinished_writes_during_shutdown starts 3 nodes in 3
racks and creates a keyspace with RF=3, then adds a new node in a 4th
rack. this breaks rf-rack validity for the keyspace.

we change it instead to add the new node in an existing rack. it doesn't
matter for the test - the test only wants to add a new node to trigger
some topology change.
2025-12-22 09:13:48 +01:00
Karol Nowacki
addac8b3f7 vector_search: test: Fix flaky DNS resolution test
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:

1. Race between initial refresh and manual trigger: The test assumes
    a specific resolution sequence, but timing variations between the
    initial DNS refresh (on client creation) and the first manual
    trigger (in the test loop) can cause unexpected delayed scheduling.

2. Extra triggers from resolve_hostname fiber: During the client
    refresh phase, the background DNS fiber clears the client list.
    If resolve_hostname executes in the window after clearing but
    before the update completes, pending triggers are processed,
    incrementing the resolution count unexpectedly. At count 6, the
    mock resolver returns a valid address (count % 3 == 0), causing
    the test to fail.

The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.

Fixes: #27074

Closes scylladb/scylladb#27685
2025-12-21 20:02:16 +02:00
Vlad Zolotarov
ea95cdaaec transport/server: declare a new "CLIENT_OPTIONS" option as supported
Declare support for a 'CLIENT_OPTIONS' startup key.
This key is meant to be used by drivers for sending client-specific
configurations like request timeouts values, retry policy configuration, etc.
The value of this key can be any string in general (according to the CQL binary protocol),
however, it's expected to be some structured format, e.g. JSON.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
28cbaef110 service/client_state and alternator/server: use cached values for driver_name and driver_version fields
Optimize memory usage changing types of driver_name and driver_version be
a reference to a cached value instead of an sstring.

These fields very often have the same value among different connections hence
it makes sense to cache these values and use references to them instead of duplicating
such strings in each connection state.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
85adf6bdb1 system.clients: add a client_options column
This new column is going to contain all OPTIONS sent in the
STARTUP frame of the corresponding CQL session.

The new column has a `frozen<map<text, text>>` type, and
we are also optimizing the amount of required memory for storing
corresponding keys and values by caching them on each shard level.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:15 -05:00
Vlad Zolotarov
3a54bab193 controller: update get_client_data to use foreign_ptr for client_data
get_client_data() is used to assemble `client_data` objects from each connection
on each CPU in the context of generation of the `system.clients` virtual table data.

After collected, `client_data` objects were std::moved and arranged into a
different structure to match the table's sorting requirements.

This didn't allow having not-cross-shard-movable objects as fields in the `client_data`,
e.g. lw_shared_ptr objects.

Since we are planning to add such fields to `client_data` in following patches this patch
is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>`
objects instead of naked `client_data` ones.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-19 11:01:41 -05:00
Anna Stuchlik
f65db4e8eb doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756
2025-12-19 12:53:40 +01:00
Botond Dénes
df2ac0f257 Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic
This PR migrates schema management tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following:
- `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected
- drop column works as intended - interrupts `populate` and proceeds to flush
- flush **probably** works as intended - logs are consistent with what we expect and what I got in local test runs
- `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start

Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests.

Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the **probably** part of the flush analysis step described above. If the issue reoccurs, we will have more information.

The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated.

Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved.
Small performance improvements:
- Regex compilation extracted from the stress function to the module level, to avoid recompilation.
- Do not materialize list in `stress_object` for loop. Use a generator expression.

The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

scylla-dtest PR that removes migrated tests:
[schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427)

Fixes #26932

This is a migration of existing tests to this repository. No need for backport.

Closes scylladb/scylladb#27106

* github.com:scylladb/scylladb:
  test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests
  test: dtest: schema_management_test.py: fix large partition add column test
  test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare`
  test: dtest: schema_management_test.py: test enhancements
  test: dtest: schema_management_test.py: make the tests work
  test: dtest: migrate setup and tools from dtest
  test: dtest: copy unmodified schema_management_test.py
  replica: database: flush_all_tables log on completion
2025-12-19 12:30:00 +02:00
Botond Dénes
093e97a539 Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: https://github.com/scylladb/scylladb/issues/27715

No backport, fix for test on master.

Closes scylladb/scylladb#27735

* github.com:scylladb/scylladb:
  test: remove unused `get_processed_tasks_for_group`
  test: increase num of requests in driver_service_level tests
2025-12-19 10:54:14 +02:00
Emil Maskovsky
fa6e5d0754 test/random_failures: fix handling of banned notification
After 39cec4a node join may fail with either "init - Startup failed"
notification or occasionally because it was banned, depending on timing.

The change updates the test to handle both cases.

Fixes: scylladb/scylladb#27697

No backport: This failure is only present in master.

Closes scylladb/scylladb#27768
2025-12-19 09:55:31 +02:00
Emil Maskovsky
08518b2c12 test/raft: fix test_joining_old_node_fails flakiness
When a node without the required feature attempts to join a Raft-based
cluster with the feature enabled, there is a race between the join
rejection response ("Feature check failed") and the ban notification
("received notification of being banned"). Depending on timing, either
message may appear in the joining node's log.

This starts to happen after 39cec4a (which introduced informing the
nodes about being banned).

Updated the test to accept both error messages as valid, making the test
robust against this race condition, which is more likely in debug mode
or under slow execution.

Fixes: scylladb/scylladb#27603

No backport: This failure is only present in master.

Closes scylladb/scylladb#27760
2025-12-19 09:44:09 +02:00
Emil Maskovsky
2a75b1374e test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737
2025-12-19 09:42:19 +02:00
Łukasz Paszkowski
2cb9bb8f3a test_user_writes_rejection: Disable speculative retries
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes

By default, tables use speculative retries to determine when coordinators may
query additional replicas.

Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
  already contains data, making it appear as if data reached the critically
  utilized node
- the second check may fail if the additional request is sent to the critically
  utilized node, making it appear as if data did not reach the healthy node

The patch fixes the flakiness by disabling the speculative retries.

Fixes https://github.com/scylladb/scylladb/issues/27212

Closes scylladb/scylladb#27488
2025-12-19 09:39:09 +02:00
Botond Dénes
c899c117c7 scylla-gdb.py: scylla read-stats: include all permit lists
The current code which collects permit stats is out-of-date (by a few
years), as it only iterates through _permit_list. There are 4 additional
lists that permits can be part of now (all intrusive). Include all of
these in the stat collection.
As a bonus, also print the semaphore pointer in the printout, so the
user can hand-examine it, should they wish to.
2025-12-18 18:26:07 +02:00
Botond Dénes
db9f9e1b7e scylla-gdb.py: scylla fiber: add --direction command-line param
Can be "forward", "backward" or "both" (default).
Allows traversing the fiber in just one direction. Useful when scylla
fiber fails to traverse through a task and the user has to locate the
next one in the chain manually. When resuming from this next item, the
user might want to skip the already seen part of the fiber, to save time
on the invokation.
2025-12-18 18:26:07 +02:00
Botond Dénes
4c6e508811 scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
Traversing through coroutines forward (finding task waiting on this
coroutine) is already supported. This patch adds support for traversing
through coroutines backwards (finding task waited on by coroutine).
Coroutines need special handling: the future<> object is most likely
allocated on the coroutine frame, so we have to search throgh that to
find it. When doing so the first two pointers on the frame have to be
skipped: these are pointers to .resume and .destroy respectively and
will halt the search algorithm if seen.
2025-12-18 18:26:06 +02:00
Dario Mirovic
f1d63d014c test: dtest: schema_management_test.py: speed up TestLargePartitionAlterSchema tests
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

The patch also introduces a few small improvements:
- `cs_run` renamed to `run_stress` for clarity
- Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use
- `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly
- Added explanation comment on why we do `data[i].append(None)`
- Replaced `alter_table` inner function with its body, for simplicity
- Removed unnecessary `ck_rows` variable in `populate`
- Removed unnecessary `isinstance(self.cluster. ScyllaCluster)`
- Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed
- Replaced functional programming style expressions for `new_versions` and `columns_list` with
  comprehension/generator statement python style code, improving readability

Refs #26932

fix
2025-12-18 17:07:27 +01:00
Michael Litvak
33f7bc28da docs: document restrictions of colocated tables
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.

Extend the documentation to explain what colocated tables are and
document these restrictions.

Fixes scylladb/scylladb#27261

Closes scylladb/scylladb#27516
2025-12-18 15:38:29 +01:00
Cezar Moise
0ef8ca4c57 test: add logging to crash_coordinator_before_stream injection
In order to have the test ignore crashes caused by the injection,
it needs to log its occurence.
2025-12-18 16:28:13 +02:00
Cezar Moise
95d0782f89 test: add crash detection during tests
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
2025-12-18 16:28:13 +02:00
Dario Mirovic
f831ca5ab5 test: dtest: schema_management_test.py: fix large partition add column test
`large_partition_with_add_column_test` and `large_partition_with_drop_column_test`
were added on August 17th, 2020 in scylladb/scylla-dtest#1569.

Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed
to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051.
Since then this test has not been running.

This patch fixes it - the test is updated and renamed and the testing environment
now properly picks it up.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
1fe0509a9b test: dtest: schema_management_test.py: add TestSchemaManagement.prepare
Extract repeated cluster initialization code in `TestSchemaManagement`
into a separate `prepare` method. It holds all the common code for
cluster preparation, with just the necessary parameters.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
e7d76fd8f3 test: dtest: schema_management_test.py: test enhancements
Extract regex compilation from the stress functions to the module level,
to avoid unnecessary regex compilation repetition.

Add descriptions to the stress functions.

Do not materialize list in `stress_object` for loop. Use a generator expression.

Make `_set_stress_val` an object method.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
700853740d test: dtest: schema_management_test.py: make the tests work
Remove unused function markers.
Add wait_other_notice=True to cluster start method in
TestSchemaHistory.prepare function to make the test stable.

Enable the test in suite.yaml for dev and debug modes.

Fixes #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
3c5dd5e5ae test: dtest: migrate setup and tools from dtest
Migrate several functionalities from dtest. These will be used by
the schema_management_test.py tests when they are enabled.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
5971b2ad97 test: dtest: copy unmodified schema_management_test.py
Copy schema_management_test.py from scylla-dtest to
test/cluster/dtest/schema_management_test.py.

Add license header.

Disable it for debug, dev, and release mode.

Refs #26932
2025-12-18 12:54:42 +01:00
Dario Mirovic
f89315d02f replica: database: flush_all_tables log on completion
In database::flush_all_tables add log on completion.
This slightly improves the readability of logs when debugging an issue.

Refs #26932
2025-12-18 12:54:42 +01:00
Patryk Jędrzejczak
d5c205194b Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.

The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.

Fixes:  scylladb/scylladb#25530

No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.

Closes scylladb/scylladb#26931

* https://github.com/scylladb/scylladb:
  topology: make removenode use left_token_ring state for global barrier
  topology: allow removing nodes not having tokens
  features: add feature flag for removenode via left token ring
2025-12-18 09:34:38 +01:00
Andrzej Jackowski
6ad10b141a test: remove unused get_processed_tasks_for_group
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
2025-12-17 20:45:53 +01:00
Andrzej Jackowski
8cf8e6c87d test: increase num of requests in driver_service_level tests
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: scylladb/scylladb#27715
2025-12-17 20:45:48 +01:00
Michael Litvak
3a06c32749 schema_registry: fix learning a schema with cdc schema
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.

This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.

We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.

Fixes scylladb/scylladb#27610

Closes scylladb/scylladb#27704
2025-12-17 20:01:00 +02:00
Michał Jadwiszczak
74ab5addd3 test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.

However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.

This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.

Fixes scylladb/scylladb#26683

Closes scylladb/scylladb#27389
2025-12-17 17:29:15 +01:00
Michael Litvak
55f4a2b754 migration_listener: fix deadlock in nested notifications
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
   acquires a read lock on the vector of listeners, and calls the
   callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
   write lock on the vector of listeners. it waits because the lock is
   held.
A: the callback function calls another notification and calls
   thread_for_each which tries to acquire the read lock again. but it
   waits since there is a waiter.

Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.

Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.

Fixes scylladb/scylladb#27364

Closes scylladb/scylladb#27637
2025-12-17 14:00:28 +01:00
Emil Maskovsky
1642c686c2 topology: make removenode use left_token_ring state for global barrier
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring`
  state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
  shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.

Fixes:  scylladb/scylladb#25530
2025-12-17 13:31:11 +01:00
Emil Maskovsky
9431826c52 topology: allow removing nodes not having tokens
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).

This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).

Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
2025-12-17 13:31:11 +01:00
Emil Maskovsky
ba6fabfc88 features: add feature flag for removenode via left token ring
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.

To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
2025-12-17 13:31:11 +01:00
Avi Kivity
cfd91545f9 test: add proxy protocol tests
Test that the new configuration options work and that we can
connect to them. Use direct connections with an inline implementation
of the proxy protocol and the CQL native protocol, since we want
to maintain direct control over the source port number (for shard-aware
ports). Also test we land on the expected shard.
2025-12-17 14:18:04 +02:00
Avi Kivity
1382b47d45 config, transport: support proxy protocol v2 enhanced connections
We have four native transport ports: two for plain/TLS, and two
more for shard-aware (plain/TLS as well). Add four more that expect
the proxy protocol v2 header. This allows nodes behind a reverse
proxy to record the correct source address and port in system.clients,
and the shard-aware port to see the correct source port selection
made my the client.
2025-12-17 14:18:04 +02:00
Pavel Emelyanov
a6618f225c object_storage_endpoint_param: Make it formattable for real
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.

Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.

Fixes #27718

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27687
2025-12-17 11:48:39 +01:00
Dani Tweig
0bfd07a268 github: synching scylladb.git new milestones with Jira releases
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.

Fixes: PM-100

Closes scylladb/scylladb#27717
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Anna Stuchlik
7061384a27 doc: replace "Scylla" with "ScyllaDB" in the Alternator docs
Fixes https://github.com/scylladb/scylladb/issues/27708

Closes scylladb/scylladb#27709
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
7bc59e93b2 Fix lambda-coroutine fiasco in hint_endpoint_manager.cc
Found by copilot.

No issue was observed yet.

Fixes #27520

Closes scylladb/scylladb#27477
2025-12-16 20:16:41 +03:00
Łukasz Paszkowski
a61c221902 test/pylib/suite/python.py: Handle extra_cmdline_options correctly
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.

This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`

Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.

Closes scylladb/scylladb#27523
2025-12-16 20:14:43 +03:00
Benny Halevy
798714183e table: get_snapshot_details: add FIXME comments
Ref https://github.com/scylladb/seastar/pull/3163

We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:45:56 +02:00
Benny Halevy
f5ca3657e2 table: get_snapshot_details: lookup entries also in the staging directory
Since the sstables in the snapshot may still be in the staging dir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
dc00461adf table: get_snapshot_details: optimize using the entry number_of_links
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
be6d87648c table: get_snapshot_details: continue loop for manifest and schema entries
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
004c08f525 table: get_snapshot_details: use directory_lister
It is more efficient to use the coroutine generator
to list the directory.

Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
David Garcia
386ec0af4e docs: fix multiversion build
Installs sphinx-scylladb-markdown = "^0.1.4" to fix the multiversion build.

Details in https://github.com/scylladb/sphinx-scylladb-theme/pull/1552

fix: update poetry

Closes scylladb/scylladb#27619
2025-12-16 19:41:03 +03:00
Pavel Emelyanov
c4496dd63c Merge 'test/cqlpy: rename tests with duplicate name' from Nadav Har'El
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.

Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.

Closes scylladb/scylladb#27644

* github.com:scylladb/scylladb:
  test/cqlpy: rename test with duplicate name
  test/cqlpy: rename test with duplicate name
2025-12-16 19:32:20 +03:00
Nadav Har'El
84df5cfaf8 test/alternator: delete unnecessary "pass"
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27645
2025-12-16 19:29:23 +03:00
Botond Dénes
f06db096bd test/boost/reader_concurrency_semaphore_test: un-flake memory limit engages test
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).

Fixes: #27248

Closes scylladb/scylladb#27650
2025-12-16 19:24:49 +03:00
Pavel Emelyanov
31f90c089c Merge 'test/alternator: remove unused variable assignments and statements' from Nadav Har'El
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).

These patches don't change anything in the functionality of any of the tests. It's all cosmetic.

Closes scylladb/scylladb#27655

* github.com:scylladb/scylladb:
  test/alternator: remove unnecessary duplicate statement
  test/alternator: remove unused variable assignments
2025-12-16 19:23:34 +03:00
Nadav Har'El
c58739de6a Fix for Unused local variable
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27665
2025-12-16 19:20:53 +03:00
Benny Halevy
9e18cfbe17 sstable: add _mutate_sem to serialize link/move with components rewrite
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.

Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.

Fixes #25919

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 17:06:45 +02:00
Piotr Dulikowski
7900aa5319 Merge 'server: fix scheduling group update timing in system.clients' from Alex Dathskovsky
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.

fixes: #26060
fixes: #27295

backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table

Closes scylladb/scylladb#26404

* github.com:scylladb/scylladb:
  test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
  server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
  server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
  server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
  server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
2025-12-16 15:39:49 +01:00
Aleksandra Martyniuk
9d20f0a3d2 test: add est_rack_list_conversion_with_two_replicas_in_rack 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
0476e8d272 test: test creating tablet_rack_list_colocation_plan 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
e48789cf6c test: add test_numeric_rf_to_rack_list_conversion test 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
9039dfa4a5 tasks: service: add global_topology_request_virtual_task
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.

Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.

The progress of the rf change isn't counted.
2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk
1884e655d6 cql3: statements: allow altering from numeric rf to rack list
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
640c491388 service: topology_coordinator: pause keyspace_rf_change request
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.

The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
cd83d1d4dc service: implement make_rack_list_colocation_plan
The make_rack_list_colocation_plan consists of two phases.

In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.

In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.

find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
2025-12-16 13:29:05 +01:00
Aleksandra Martyniuk
bbe0b01b14 service: add tablet_rack_list_colocation_plan
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.

Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).

Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.

make_rack_list_colocation_plan will be implemented in the following change.
2025-12-16 13:27:50 +01:00
Aleksandra Martyniuk
2e7ba1f8ce cql3: reject concurrent alter of the same keyspace
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.

This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
2025-12-16 13:27:48 +01:00
Aleksandra Martyniuk
b3a0e4c2dc test: check paused rf change requests persistence 2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
08e5f35527 db: service: add paused_rf_change_requests to system.topology
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.

We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.

In this commit no data is kept in the new column.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
d66a36058b service: pass topology and system_keyspace to load_balancer ctor
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
6681c0f33f service: tablet_allocator: extract load updates
Extract consider_scheduled_load and consider_planned_load so that
they can be reused.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
13e9ee3f6f service: tablet_allocator: extract ensure_node
Extract ensure_node method so that it can be reused.
2025-12-16 13:25:38 +01:00
Tomasz Grabiec
71e6ef90f4 tasks, system_keyspace: Introduce get_topology_request_entry_opt()
It's a cleanup. Better to return std::nullopt than faking an entry
with an id when require_entry == false.
2025-12-16 13:25:34 +01:00
Tomasz Grabiec
902803babd node_ops: Drop get_pending_ids()
Every pending request should also have an entry in
system.topology_requests so it's redundant.

And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
2025-12-16 13:25:11 +01:00
Dawid Mędrek
df0830044d cql3: Extend DESC INDEX by view properties
We're extending the logic of DESCRIBE INDEX to include properties of the
underlying materialized view. Tests are provided to ensure the
implementation works as intended.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
d0a42852c0 cql3: Forbid using CLUSTERING ORDER BY when creating index
This is a temporary solution as handling this property may require
a bit more attention or at least a bit more focus. For now, let's
forbid using it so it's clear it won't get applied. A simple test
is provided to cover it.

We document the restriction.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
6541362a0b cql3: Extend CREATE INDEX by MV properties
After the previous patch that extended the grammar and provided
basic functionalities to accommodate properties of materialized views
in indexes, this commit takes another step and actually applies them
to the underlying view when it's being created.

We're providing validation tests for each property, with the single
exception of CLUSTERING ORDER BY. That one will be handled separately
in an upcoming commit.

We also update the user documentation.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
e1f1071bc5 cql3/statements/create_index_statement: Allow for view options
We're allowing CREATE INDEX to accept the same set of properties as
materialized views do. Our goal is to give the user an ability to
configure the underlying materialized view of an index directly,
when creating it.

This commit doesn't do anything except for extending the grammar
and passing the right pieces of information to the right destinations.
There's no validation and the options have no effect yet. That will
be done in the following patch.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
2cc01eeffb cql3/statements/create_index_statement: Rename member 2025-12-16 11:43:37 +01:00
Dawid Mędrek
b3099e2f2c cql3/statements/index_prop_defs: Re-introduce index_prop_defs
The type represents a mix of both index-specific and view properties.
Since we cannot easily distinguish which properties belong to which
entity, let's use this abstraction and filter them from the C++ level.

This is a prerequisite for extending the capabilities of CREATE INDEX
by allowing it to configuring the underlying materialized view.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
6dfebc0b6e cql3/statements/property_definitions: Add extract_property()
The method will be useful in an upcoming commit where we'll want to
split a type inherited from `property_definitions` into two separate
entities.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
62a8dd1b7f cql3/statements/index_prop_defs.cc: Add namespace 2025-12-16 11:43:37 +01:00
Dawid Mędrek
dcf2c71204 cql3/statements/index_prop_defs.hh: Rename type
We rename the type `index_prop_defs` to `index_specific_prop_defs`.
The rationale for the change is to distinguish between properties
related directly to a index and properties related to the underlying
view (if applicable).

The type `index_prop_defs` will be re-introduced in an upcomming commit
where it'll encompass both index-related and view-related properties.
This is a prerequisite for it.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
e9108677f7 cql3/statements/view_prop_defs.cc: Move validation logic into file
We're moving the rest of the validation logic that can be moved from
`cql3/statements/{create,alter}_view_statement.cc` to the new file.
2025-12-16 11:43:35 +01:00
Dawid Mędrek
f5f7aeaa0a cql3/statements: Introduce view_prop_defs.{hh,cc}
We're introducing a new type wrapping properties that can be used with
materialized views. Doing that, we achieve the following things:

(1) We can keep validation logic in one place.
(2) We differentiate between properties of a regular table and
    properties of a materialized view.
(3) It provides better modularization and allows for reusing the code.
(4) It gets rid of inconsistencies in the existing code, e.g.
    CREATE MV using one type for properties, while ALTER MV another.

The actual end goal of this commit is to be able to reuse at least part
of the validation logic of MVs in CREATE INDEX and, when it gets added,
ALTER INDEX: we want to endow those statements with an ability to modify
the underlying materialized view without having to modify it directly.

This patch does NOT implement the whole validation logic yet. It will be
done in a following commit.

Refs scylladb/scylladb#16454
2025-12-16 11:43:03 +01:00
Tomasz Grabiec
4ed17c9e88 node_ops: Drop redundant get_status_helper() 2025-12-16 11:09:11 +01:00
Patryk Jędrzejczak
73db5c94de Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.

This patch series contains:
 - Introduction of `CLIENT_ROUTES` feature flag.
 - Implementation of raft-based `system.client_routes` table
 - Implementation of `v2/client-routes` POST/DELETE/GET endpoints
 - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
 - New tests that verifies the aforementioned features

Ref: scylladb/scylla-enterprise#5699

For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.

Closes scylladb/scylladb#27323

* https://github.com/scylladb/scylladb:
  docs: describe CLIENT_ROUTES_CHANGE extension
  test: add test for CLIENT_ROUTES event
  service: transport: add CLIENT_ROUTES_CHANGE event
  test: add cluster tests for client routes
  test: add API tests for client_routes endpoints
  test: add `timeout` parameter to `delete` in RESTClient
  test: allow json_body in send
  api: implement client_routes endpoints
  api: add client_routes.json
  service: main: add client_routes_service
  db: add system.client_routes table
  gms: add CLIENT_ROUTES feature
2025-12-16 10:38:27 +01:00
Botond Dénes
85f05fbe1b Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk"
This reverts commit 866c96f536, reversing
changes made to 367633270a.

This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.

Fixes: #27496

Closes scylladb/scylladb#27518
2025-12-16 11:34:40 +02:00
Botond Dénes
83f46fa7f5 doc: add video link for TTL
Closes: #26210
2025-12-16 10:43:03 +02:00
Anna Stuchlik
ea6f2a21c6 doc: remove references to ScyllaDB versions 4.3 and 4.4
We should never refer to the no longer supported OSS versions.
This is a leftover - other mentions were removed long time ago.

Fixes https://github.com/scylladb/scylladb/issues/19569

Closes scylladb/scylladb#27656
2025-12-16 06:58:53 +02:00
Yaniv Kaul
30c4bc3f96 Fix for __iter__ method returns a non-iterator
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.

Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27642
2025-12-16 06:57:19 +02:00
Piotr Smaron
77fa936edc doc: audit: update to present how to enable both syslog and table
Supporting both sinks have been introduced in
https://github.com/scylladb/scylladb/pull/26613, but it missed the docs
changes, so here they are.

Closes scylladb/scylladb#27607
2025-12-16 06:56:39 +02:00
Piotr Smaron
0ec485845b Clarify documentation build instructions
Closes scylladb/scylladb#27606
2025-12-16 06:56:00 +02:00
Botond Dénes
dace39fd6c Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund
Fixes #26744

If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.

However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.

Closes scylladb/scylladb#27556

* github.com:scylladb/scylladb:
  test::cluster::dtest::tools::files: Remove file
  commitlog_replay: Handle fully corrupt files same as partial corruption.
  test::pylib::suite::base: Split options.name test specifier only once
2025-12-16 06:55:42 +02:00
Calle Wilund
5f8f724d78 repair: Don't use off-strategy as repair destination with tablet tables
Fixes #17384

Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.

v2:
Bypass off-strat for whatever reason iff dest is tablets.

Closes scylladb/scylladb#27500
2025-12-16 06:54:07 +02:00
Michał Chojnowski
df93ea626b test/scylla_gdb: use gcore instead of signal SIGSEGV to generate a coredump on failure
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).

We can use gdb's `gcore` command to produce a coredump in a more
predictable way.

Refs scylladb/scylladb#22501

Closes scylladb/scylladb#27498
2025-12-16 06:53:43 +02:00
Botond Dénes
74347625f9 Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El
This series adds an xfailing reproducers for two issue: #8070 and #27037:

27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.

8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.

The first two patches are two small cleanups (including fixes #27372)  that I did while preparing the real tests - which are in the final two patches.

Closes scylladb/scylladb#27376

* github.com:scylladb/scylladb:
  test/alternator: add reproducer for bug with storing invalid values
  test/alternator: reproducer for issue 27375
  utils/rjson: fix error messages from rjson::parse()
  test/alternator: extract get_signed_request() to util.py
2025-12-16 06:53:14 +02:00
Andrzej Jackowski
f1fc5cc808 docs: describe CLIENT_ROUTES_CHANGE extension
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
61bbea51ad test: add test for CLIENT_ROUTES event
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
c2b1b10ca0 service: transport: add CLIENT_ROUTES_CHANGE event
Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh
connections when `system.client_routes` is modified. Some deployments
(e.g., Private Link) require specific address/port mappings that can
change without topology changes and drivers need to adapt promptly
to avoid connectivity issues.

This new EVENT type carries a change indicator plus the affected
`connection_ids` and `host_ids`. The only change value is
`UPDATE_NODES`, meaning one or more client routes were inserted,
updated, or deleted.

Drivers subscribe using the existing events mechanism, so no additional
`cql_protocol_extension` key is required.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
ec87b92ba1 test: add cluster tests for client routes
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:17:15 +01:00
Andrzej Jackowski
9c9371511f test: add API tests for client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:46:14 +01:00
Andrzej Jackowski
2e80997630 test: add timeout parameter to delete in RESTClient
The parameter was missing and is needed to implement a test case
later in this patch series.
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
1143acaf5b test: allow json_body in send
Needed to test `/v2/connection_metadata` endpoints that receive
JSON input.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
e153cc434f api: implement client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:36:47 +01:00
Nadav Har'El
4e106b9820 test/cqlpy: remove unused variables
Copilot detected a few cases of cqlpy tests setting a variable which
they don't use. In all the cases in this patch, we can just remove
the variable. Although the AI found all these unused variables, I
verified each case carefully before changing it in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:11:04 +02:00
Nadav Har'El
64d9c370ee test/alternator: remove unnecessary duplicate statement
copilot noticed that test/alternator/test_scan.py had a duplicate
statement (call to full_scan()). It doesn't break the test, but also
adds nothing but confusion - so let's just remove it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:45 +02:00
Nadav Har'El
a3959fe3db test/alternator: remove unused variable assignments
copilot noticed in that in in many of Alternator tests, we have some
unnecessary assignments. For example, in a few places, we use the idiom:

     with pytest.raises(...):
         ret = ...

The "ret=" part is unnecessary, as this test expects the statement to
fail (hence the raises()), and ret is never assigned. The assignment
was only there because we copied this statement from another place in
the test, which does expect the statement to pass and wants to validate
the returned value.

So we should just drop the "ret=" from these tests.

Another common occurance is that we used the idiom

     response = table.do_something()

Without checking the response and no intention to check it (either we
know it will work, or we just want to check it doesn't throw). So we
can drop the "response=" here too.

All of the unused variables in this patch were discovered by Copilot,
but I reviewed each of them carefully myself and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:05 +02:00
Nadav Har'El
4fa4f40712 test/cqlpy: use unique partition in test
It is traditional to use a unique (or random) partition key in cqlpy
tests, to allow multiple tests to share the same table and make the test
suite a bit faster. One of the tests, test_multi_column_relation_desc,
set up a unique key "k", but then forgot to use it and used partition
key 0 instead. Fix the test to use this k.

This problem was spotted by Copilot, who saw the unused variable k.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 17:08:51 +02:00
Yauheni Khatsianevich
07867a9a0d test: new LWT with counters test during tablets migration/resize
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency
2025-12-15 14:32:30 +01:00
Patryk Jędrzejczak
844545bb74 Merge 'treewide: fix cases of improper re-throwing of std::exception_ptr' from Emil Maskovsky
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.

Resolved at various places found by clang-tidy:

1. db::schema_applier

When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.

However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.

2. directories

The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).

3. storage_service

Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501

No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.

Closes scylladb/scylladb#27613

* https://github.com/scylladb/scylladb:
  db: schema_applier: improve exception-safe cleanup
  directories: fix exception rethrowing
  storage_service: use coroutine-friendly exception propagation in join_node_response_handler
2025-12-15 13:56:45 +01:00
Nadav Har'El
ccacea621f test/cqlpy: fix flaky test test_view_in_system_tables
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.

For historic reasons  this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.

But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.

The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.

Fixes #27296

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27626
2025-12-15 15:29:08 +03:00
Nadav Har'El
f287484f4d test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/CreateTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 16 tests in this file, instead of 15
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 14:19:24 +02:00
Benny Halevy
f4a4671ad6 table: seal_snapshot: avoid oversized allocation when dumping manifest.json
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.

This is not scalable and may cause large allocation for large enough number
of files.

Fixes #24216

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27542
2025-12-15 15:19:24 +03:00
Dawid Mędrek
8691d287be cql3/statements/create_view_statement.cc: Move validation of ID
The end goal we have in mind in this commit is to extract the validation
logic of the options used for creating and altering an MV to a separate
place and be able to call from different places in the code.

It will be useful when extending the capabilities of the CREATE INDEX
statement.

In this patch, we move the part of validation responsible for checking
the ID option to keep it close to the other parts of validation of the
options in their "raw" form.
2025-12-15 13:18:48 +01:00
Dawid Mędrek
11c109c623 schema/schema.hh: Do not include index_prop_defs.hh
One of the upcoming commits will lead to a cyclic dependency
of headers because `schema.hh` includes `index_prop_defs.hh`.
To prevent that, we remove the include and replace it with
a manually added alias.

This is not a perfect solution, but doing it properly would
require comprehensive changes. We can do that in a separate
task.
2025-12-15 13:18:48 +01:00
Andrzej Jackowski
70a0418102 api: add client_routes.json
Add the JSON definitions for the POST, GET, and DELETE endpoints used
to modify client routes. These endpoints are intended for Cloud to
update the `system.client_routes` table.

The API is implemented in `/v2/` because the endpoints process arrays
of objects. Handling of such structures was improved between
Swagger 1.2 and 2.0 versions. There are already similar
`get_metrics_config` and `set_metrics_config` endpoints that operate
on similar structures and they are also in /v2/.

The introduced JSON files start with `, ` but it's intended because
the files are concatenated to the existing (metrics) JSON files,
and they need to represent valid JSON after the concatenation.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:46 +01:00
Andrzej Jackowski
6fcc1ecf94 service: main: add client_routes_service
Introduce `client_routes_service` for managing
`system.client_routes` table.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:40 +01:00
Andrzej Jackowski
8dde70d04c db: add system.client_routes table
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.

The table is Raft-managed to provide consistent replication across
nodes.

Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:05 +01:00
Andrzej Jackowski
2e7070d3b7 gms: add CLIENT_ROUTES feature
The feature will be used later in this patch series:
 - To avoid unnecessary operations when the feature is not enabled
 - To guard new API endpoints from being used before the cluster is
   ready to use them.
 - To implement update tests (by disabling/enabling the feature)

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:04 +01:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Nadav Har'El
a9442e6d56 test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/DeleteTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 52 tests in this file, instead of 51
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 12:02:59 +02:00
Dawid Mędrek
1e14c08eee locator/token_metadata: Remove get_host_id()
The function is declared, but it's not defined or used anywhere.

Closes scylladb/scylladb#27374
2025-12-15 10:36:52 +01:00
Michael Litvak
b9ec1180f5 alternator: require rf_rack_valid_keyspaces when creating index
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.

The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.

Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.

Fixes scylladb/scylladb#27612

Closes scylladb/scylladb#27622
2025-12-15 10:36:57 +02:00
Michał Hudobski
12483d8c3c vector_search: throw an error when we restrict primary in vector search
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.

Fixes: VECTOR-331

Closes scylladb/scylladb#27143
2025-12-15 09:45:56 +02:00
Jenkins Promoter
d5641398f5 Update pgo profiles - aarch64 2025-12-15 05:16:31 +02:00
Alex
d21faab9dc test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft.
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
2025-12-14 18:46:06 +02:00
Alex
30f6a40ae6 server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant.
This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP).
The lookup function is fully synchronized, non-coroutine, and returns immediately.
For that reason, it’s better to perform this lookup outside of the switch_tenant function.
2025-12-14 18:46:06 +02:00
Alex
5579489c4c server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability.
To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
2025-12-14 18:46:05 +02:00
Alex
17c9d640fe server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group.
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
2025-12-14 16:27:40 +02:00
Alex
f98af582a7 server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation.
The new functionality will allow usage of service level cache
2025-12-14 16:09:18 +02:00
Nadav Har'El
c06e63daed Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski
This patch series contains the following changes:
 - Incorporation of `crypt_sha512.c` from musl to out codebase
 - Conversion of `crypt_sha512.c` to C++ and coroutinization
 - Coroutinization of `auth::passwords::check`
 - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
   computing SHA 512 passwords of length <=255
 - Addition of yielding in the aforementioned hashing implementation.

The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711

No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.

Closes scylladb/scylladb#26860

* github.com:scylladb/scylladb:
  test/boost: add too_long_password to auth_passwords_test
  test/boost: add same_hashes_as_crypt_r to auth_passwords_test
  auth: utils: add yielding to crypt_sha512
  auth: change return type of passwords::check to future
  auth: remove code duplication in verify_scheme
  test/boost: coroutinize auth_passwords_test
  utils: coroutinize crypt_sha512
  utils: make crypt_sha512.cc to compile
  utils: license: import crypt_sha512.c from musl to the project
  Revert "auth: move passwords::check call to alien thread"
2025-12-14 14:01:01 +02:00
David Garcia
c1c3b2c5bb docs: fix local build
prevents early exits in metrics docs generation to break the local build.

Fixes #27497

Closes scylladb/scylladb#27615
2025-12-14 11:48:48 +02:00
Raphael S. Carvalho
a0a7941eb1 test: Add reproducer for split vs intra-node migration race
This is a problem caught after removing split from
add_sstable_and_update_cache(), which was used by
intra node migration when loading new sstables
into the destination shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
e3b9abdb30 test: Verify split failure on behalf of repair during critical disk utilization
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
bc772b791d test: boost: Add failure_when_adding_new_sstable_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
77a4f95eb8 test: Add reproducer for split vs incremental repair race condition
Refs #26041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:16 -03:00
Raphael S. Carvalho
992bfb9f63 compaction: Fail split of new sstable if manager is disabled
If manager has been disabled due to out of space prevention, it's
important to throw an exception rather than silently not
splitting the new sstable.

Not splitting a sstable when needed can cause correctness issue
when finalizing split later.
It's better to fail the writer (e.g. repair one) which will be
retried than making caller think everything succeeded.

The new replica::table::add_new_sstable_and_update_cache() will
now unlink the new sstable on failure, so the table dir will
not be left with sstables not loaded.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ee3a743dc4 replica: Don't split in do_add_sstable_and_update_cache()
Now, only sstable loader on boot and refresh from upload uses this
procedure. The idea is that maybe_split_new_sstable() will throw
when compaction cannot run due to e.g. out of space prevention.
It could fail repair writer, but we don't want it to fail boot.

As for refresh from upload, it's not supposed to work when tablet
map at the time of backup is not the same when restoring.
Even before this, refresh would fail if split already executed,
split would only happen if split was still ongoing. We need
token range stability for local restore. The safe variant will
always be load and stream.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
48d243f32f streaming: Leave sstables unsealed until attached to the table
We want the invariant that after ACK, all sealed sstables will be split.

This guarantee that on restart, no unsplit sstables will be found
sealed.

The paths that generate unsplit sstables are streaming and file
streaming consumers. It includes intra-node streaming, which
is local but can clone an unsplit sstable into destination.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
d9d58780e2 replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ddb27488fa replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
10225ee434 replica: Wire add_new_sstable_and_update_cache() into streaming consumer
After the wiring, failure to attach the new sstable in the streaming
consumer will unlink the sstable automatically.

Fixes #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
a72025bbf6 replica: Document old add_sstable_and_update_cache() variants
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
3f8363300a replica: Introduce add_new_sstables_and_update_cache()
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().

Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
63d1d6c39b replica: Introduce add_new_sstable_and_update_cache()
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.

This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.

This new procedure will be later wired into streaming consumers, in
order to fix those problems.

Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.

Refs #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
27d460758f replica: Account for sstables being added before ACKing split
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:

1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
794e03856a replica: Remove repair read lock from maybe_split_new_sstable()
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
2dae0a7380 compaction: Preserve state of input sstable in maybe_split_new_sstable()
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1fdc410e24 Rename maybe_split_sstable() to maybe_split_new_sstable()
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1a077a80f1 sstables: Allow storage::snapshot() to leave destination sstable unsealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c5e840e460 sstables: Add option to leave sstable unsealed in the stream sink
That will be needed for file streaming to leave output sstable unsealed.

we want the invariant where all sealed sstables are split after split
was ACKed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c10486a5e9 test: Verify unsealed sstable can be compacted
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
ab82428228 sstables: Allow unsealed sstable to be loaded
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
b1be4ba2fc sstables: Restore sstable_writer_config::leave_unsealed
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Emil Maskovsky
5e7456936e db: schema_applier: improve exception-safe cleanup
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.

However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.

This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
2025-12-12 18:18:31 +01:00
Emil Maskovsky
e6f5f2537e directories: fix exception rethrowing
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.

But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.

This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
2025-12-12 18:10:20 +01:00
Emil Maskovsky
76aacc00f2 storage_service: use coroutine-friendly exception propagation in join_node_response_handler
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
2025-12-12 17:59:54 +01:00
Cezar Moise
7c8ab3d3d3 test.py: add pid to ServerInfo
Adding pid info to servers allows matching coredumps with servers

Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
2025-12-12 15:11:03 +02:00
Botond Dénes
cb7f2e4953 docs: scylla-sstable-script-api.rst: add introduction and title 2025-12-12 13:50:12 +02:00
Botond Dénes
dd5b6770c8 docs: scylla-sstable.rst: extract script API to separate document
The script API is 500+ lines long in an already too long and hard to
navigate document. Extract it to a separate document, making both
documents shorter and easier to navigate.
2025-12-12 13:44:32 +02:00
Botond Dénes
7e7e378a4b Merge 'Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"' from null
Reverts commit 8192f45e84.

The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors.

**Root Cause**

Commit 19b6207f modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`:

```cpp
// Before (working)
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);

// After (broken)
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
```

This triggers exceptions during snapshot that propagate through Raft state machine, causing:
- Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369`
- Schema changes fail to propagate
- Nodes report non-existent keyspaces for valid schemas

**Changes**

Reverts 15 files (200 deletions, 74 insertions):
- Removes `use_sstable_identifier` from truncate/snapshot code paths
- Reverts `snapshot_options` struct back to simple `skip_flush` boolean
- Removes REST API and nodetool `--use-sstable-identifier` parameter
- Removes feature tests from `test/boost/database_test.cc`

No backport required - the original feature was merged to master only and never released.

<!-- START COPILOT ORIGINAL PROMPT -->

<details>

<summary>Original prompt</summary>

----

*This section details on the original issue you should resolve*

<issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title>
<issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport
Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/)

Error message:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
```

Stacktrace:
```
manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10>

    @pytest.mark.asyncio
    async def test_table_drop_with_auto_snapshot(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        cfg = { 'auto_snapshot': True }
        servers = await manager.servers_add(3, config = cfg)

        cql = manager.get_cql()

        # Increases the chance of tablet migration concurrent with schema change
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)

        for i in range(3):
            await cql.run_async("DROP KEYSPACE IF EXISTS test;")
            await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };")
            await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));")
>           await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');")
E           cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"

test/cluster/test_tablets2.py:173: InvalidRequest
```
</issue_description>

<agent_instructions>this issue was exposed by commit 8192f45e84, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions>

<comments>
<comment_new><author>@yaronkaikov</author><body>
@denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new>
<comment_new><author>@nyh</author><body>
This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no?

In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do.

But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been....

But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new>
<comment_new><author>@nyh</author><body>
Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug:
1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but
2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails.

Copilot claims that **execute_async() does have this bug!**

> For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version.
> If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement.
> According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](7f12a5e1c6/cassandra/cluster.py (L487)), schema agreement is not ch...

</details>

<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes scylladb/scylladb#27501

<!-- START COPILOT CODING AGENT TIPS -->
---

Closes scylladb/scylladb#27604

* github.com:scylladb/scylladb:
  Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
  Initial plan
2025-12-12 13:20:49 +02:00
Botond Dénes
3d73a9781e docs: scylla-sstable: prepare for script API extract
We are about to extract the script API to a separate document. In
preparation convert soon-to-be cross-document references, so they keep
working after the extraction.
2025-12-12 13:15:48 +02:00
Piotr Smaron
982339e73f doc: audit: set audit as enabled by default 2025-12-12 09:18:54 +01:00
Piotr Smaron
d1a04b3913 Reapply "audit: enable some subset of auditing by default"
This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13.

Fixes: https://github.com/scylladb/scylladb/issues/26020
2025-12-12 09:18:54 +01:00
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
copilot-swe-agent[bot]
0ff89a58be Initial plan 2025-12-12 03:48:12 +00:00
Yaron Kaikov
f7ffa395a8 workflows: trigger CI automatically when conflicts label is removed
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.

The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84

Closes scylladb/scylladb#27521
2025-12-11 16:48:06 +02:00
Piotr Smaron
3fa3b920de Update CODEOWNERS to remove redundant entries
Removing myself as I have no maintainer's permissions to review the code

Closes scylladb/scylladb#27576
2025-12-11 16:47:08 +02:00
Botond Dénes
e7ca52ee79 Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy
Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414

** Backport to 2025.4 where 611918056a was introduced **

Closes scylladb/scylladb#27530

* github.com:scylladb/scylladb:
  api: storage_service/tablets/repair: disable incremental repair by default
  docs: nodetool-commands: cluster: repair: fix incremental-mode example
2025-12-11 15:23:09 +02:00
Botond Dénes
730eca5dac Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from null
Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation).

**Changes made:**
1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers

**Rationale:**

As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating.

Fixes #27475

Closes scylladb/scylladb#27476

* github.com:scylladb/scylladb:
  replica: Remove unnecessary noexcept
  replica: Remove noexcept from compaction_groups() functions
  replica: Remove noexcept from storage_group::for_each_compaction_group
2025-12-11 15:17:35 +02:00
Benny Halevy
c8cff94a5a api: storage_service/tablets/repair: disable incremental repair by default
Change the default incremental_mode to `disabled` due to
https://github.com/scylladb/scylladb/issues/26041 and
https://github.com/scylladb/scylladb/issues/27414

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:21 +02:00
Benny Halevy
5fae4cdf80 docs: nodetool-commands: cluster: repair: fix incremental-mode example
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.

Fixes #27587

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:11 +02:00
Marcin Maliszkiewicz
8bbcaacba1 auth: always catch by const reference
This is best practice.

Closes scylladb/scylladb#27525
2025-12-11 12:42:30 +01:00
Yaron Kaikov
3dfa5ebd7f Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572
2025-12-11 12:23:16 +02:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Nadav Har'El
0c64e3be9a Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.

Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.

Backport: no, it's a refactor

Closes scylladb/scylladb#27394

* github.com:scylladb/scylladb:
  fix rjson::value to bytes conversion with missing GetStringLength call
  alternator: change type from string to string_view in should_add_capacity
  fix rjson::value to string_view conversion with missing GetStringLength call
  use rjson::to_string_view when rjson::value gets converted using GetStringLength
  use rjson::to_sstring and rjson::to_string for various string conversions
  utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
  utils: move rjson::to_string_view func to string related place
  utils: add to_sstring and to_string rjson helper
2025-12-11 12:05:41 +02:00
Nadav Har'El
b3b0860e7c test/alternator: add reproducer for bug with storing invalid values
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.

The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".

Refs #8070.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:58:22 +02:00
Nadav Har'El
db15c212a6 test/alternator: reproducer for issue 27375
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.

The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.

Refs #27375.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:34:19 +02:00
Nadav Har'El
3595941020 utils/rjson: fix error messages from rjson::parse()
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:

  Parsing JSON failed: Missing a name for object member. at 1452254

These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.

The fix is trivial: add the missing initialization of the counter.

The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.

Fixes #27372

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:17:01 +02:00
Nadav Har'El
102516a787 test/alternator: extract get_signed_request() to util.py
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.

Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.

This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:16:42 +02:00
Marcin Maliszkiewicz
d5b63df46e transport: remove redundant futurize_invoke from counted data sink and source
Closes scylladb/scylladb#27526
2025-12-11 10:32:16 +03:00
Dario Mirovic
f545ed37bc test: dtest: audit_test.py: fix audit error log detection
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.

The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.

The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.

Fixes #27322

Closes scylladb/scylladb#27378
2025-12-11 10:17:07 +03:00
Benny Halevy
5f13880a91 utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532
2025-12-10 23:25:54 +01:00
Calle Wilund
8c4ac457af test::cluster::dtest::tools::files: Remove file
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
2025-12-10 15:37:04 +01:00
Calle Wilund
e48170ca8e commitlog_replay: Handle fully corrupt files same as partial corruption.
Fixes #26744

If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.

Changed test to consistently cause this exact error instead.
2025-12-10 15:37:04 +01:00
Andrzej Jackowski
11ad32c85e test/boost: add too_long_password to auth_passwords_test
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.

Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.

Refs: scylladb/scylladb#26842
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4c8c9cd548 test/boost: add same_hashes_as_crypt_r to auth_passwords_test
The test verifies that the old and new implementation of SHA-512
hashing returns exactly the same values.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
98f431dd81 auth: utils: add yielding to crypt_sha512
This change allows yielding during hashing computations to prevent
stalls.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4ffdb0721f auth: change return type of passwords::check to future
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.

The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
775906d749 auth: remove code duplication in verify_scheme
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
11eca621b0 test/boost: coroutinize auth_passwords_test
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
d7818b56df utils: coroutinize crypt_sha512
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
033fed5734 utils: make crypt_sha512.cc to compile
The purpose of this change is to allow the usage of Seastar futures in
crypt_sha512 later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
c6c30b7d0a utils: license: import crypt_sha512.c from musl to the project
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
 - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
 - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT

Import commit:
  commit 1b76ff0767d01df72f692806ee5adee13c67ef88
  Author: Alex Rønne Petersen <alex@alexrp.com>
  Date:   Sun Oct 12 05:35:19 2025 +0200

  s390x: shuffle register usage in __tls_get_offset to avoid r0 as address

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
5afcec4a3d Revert "auth: move passwords::check call to alien thread"
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.

This reverts commit 9574513ec1.
2025-12-10 15:36:09 +01:00
Calle Wilund
9b5f3d12a3 test::pylib::suite::base: Split options.name test specifier only once
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).

Simply splitting only on first '::' fixes this. Should not affect any
other tests.
2025-12-10 15:35:12 +01:00
Tomasz Grabiec
0e51a1f812 replica: Remove unnecessary noexcept
Can potentially lead to unnecessary abort.

compaction_groups() and for_each_compaction_group() can throw.

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:51:35 +01:00
Tomasz Grabiec
8b807b299e replica: Remove noexcept from compaction_groups() functions
They can throw during merge, when the number of compaction groups
is higher than 3.

Callers can deal with that, so we shouldn't abort.
2025-12-10 14:48:23 +01:00
Tomasz Grabiec
07ff659849 replica: Remove noexcept from storage_group::for_each_compaction_group
They don't really have to be noexcept.

And "action" may actually throw, leading to abort.

It was observed to throw when creating memtable readers:

terminate called after throwing an instance of 'utils::memory_limit_reached'
   what():  kill limit triggered on semaphore sl:users by permit xxx
Aborting on shard 4, in scheduling group sl:users.

std::terminate() at ??:0
__clang_call_terminate at main.cc:0
replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920
 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196
 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243
 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673
 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114
 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143
query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91
 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164
 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>*) at ./replica/table.cc:3583
replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533
 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const*, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158
seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215
 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122
 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267
seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591

Fixes #27475

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:48:11 +01:00
Yaron Kaikov
d3e199984e auto-backport.py: modify instruction for making PR ready for review
Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review

Fixes: https://scylladb.atlassian.net/browse/RELENG-152

Closes scylladb/scylladb#27547
2025-12-10 14:53:38 +02:00
Pavel Emelyanov
a63508a6f7 object_storage: Temporarily handle pure endpoint addresses as endpoints
To keep backward compatibility, support

- old configs -- where endpoint is just an address and port is separate.
  When it happens, format the "new" endpoint name

- lookup by address-only. If it happens, scan all endpoints and see if
  any one matches the provided address

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
ca4564e41c code: Remove dangling mentions of s3::endpoint_config
Collect some code that's unused after previous changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
f93cafac51 docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
a3ca4fccef object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Tests, that generate the config files, and docs are updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
5953a89822 test: Add named constants for test_get_object_store_endpoints endpoint
names

Instead of hard-coded 'a' and 'b' here and there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
932b008107 s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
e47f0c6284 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Nadav Har'El
8822c23ad4 Merge 'test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Dario Mirovic
…eshold

The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.

Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.

Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.

Fixes #27247
Fixes #27325

Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.

Closes scylladb/scylladb#27412

* github.com:scylladb/scylladb:
  test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
  test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
2025-12-10 10:53:30 +02:00
Marcin Maliszkiewicz
be9992cfb3 fix rjson::value to bytes conversion with missing GetStringLength call 2025-12-09 19:27:22 +01:00
Marcin Maliszkiewicz
daf00a7f24 alternator: change type from string to string_view in should_add_capacity
It avoids allocation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
62962f33bb fix rjson::value to string_view conversion with missing GetStringLength call
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
060c2f7c0d use rjson::to_string_view when rjson::value gets converted using GetStringLength
This commit is only cosmetics, changes calls to GetStringLength
into rjson::to_string_view with the same underlying implementation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
64149b57c3 use rjson::to_sstring and rjson::to_string for various string conversions
In some cases we ommit size checking which is wrong
as according to rapid json documentation strings may
contain \0 byte in the middle.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
4b004fcdfc utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
So that we can use our common utility functions.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
5e38b3071b utils: move rjson::to_string_view func to string related place 2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
225b3351fc utils: add to_sstring and to_string rjson helper
So that conversion code is common and it's easier
to avoid accidental type conversions. Additionally
according to rapid json library size must be checked
explicitly, this also avoids extra iteration in char*
to (s)string conversion.
2025-12-09 19:27:21 +01:00
Avi Kivity
80c6718ea8 build: update toolchain to Fedora 43 with clang 21.1.6
Rebase to Fedora 43 with clang 21.1 and libstdc++ 15.

Fedora container image registry moved to registry.fedoraproject.org as
it seems to be updated more regularly.

Added python3-devel to the dependencies as some packages scylla-cqlsh
depends on aren't yet available in the form of wheels for Python 3.14,
and so have to be built locally. In any case it's better to reduce
dependency on those wheels even if the ones currently missing appear
eventually.

Added libev-devel to the dependencies so that the python driver
builds correctly even if "wheels" are not published. This reduces
our dependency on the python driver's binary release schedule.
Without libev-devel, TLS does not work correctly.

We no long remove the clang and clang-libs packages. Doxygen
started depending on clang-libs, and removing them removes
doxygen, breaking the build when it looks for that. The build
will still pick up the optimized clang, since /usr/local/bin
is earlier in the path. We keep the clang package, since it allows
us to mess a little less with the directory structure.

Optimized clang binaries generates and stored in

  https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz

With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small
performance improvement (instructions_per_op) in perf-simple-query:

clang 21:

259353.60 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35720 insns/op,   17427 cycles/op,        0 errors)
265940.08 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35725 insns/op,   17042 cycles/op,        0 errors)
262650.01 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35720 insns/op,   17240 cycles/op,        0 errors)
262881.22 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35675 insns/op,   17222 cycles/op,        0 errors)
264898.68 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35732 insns/op,   17070 cycles/op,        0 errors)
throughput:
	mean=   263144.72 standard-deviation=2528.69
	median= 262881.22 median-absolute-deviation=1753.96
	maximum=265940.08 minimum=259353.60
instructions_per_op:
	mean=   35714.47 standard-deviation=22.34
	median= 35720.38 median-absolute-deviation=10.20
	maximum=35732.14 minimum=35675.50
cpu_cycles_per_op:
	mean=   17200.12 standard-deviation=154.62
	median= 17221.70 median-absolute-deviation=129.77
	maximum=17427.33 minimum=17041.57

clang 20:

254431.39 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35883 insns/op,   17708 cycles/op,        0 errors)
259701.02 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35883 insns/op,   17351 cycles/op,        0 errors)
261166.92 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35912 insns/op,   17270 cycles/op,        0 errors)
260656.31 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35869 insns/op,   17289 cycles/op,        0 errors)
259628.13 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35946 insns/op,   17370 cycles/op,        0 errors)
throughput:
	mean=   259116.75 standard-deviation=2698.56
	median= 259701.02 median-absolute-deviation=1539.55
	maximum=261166.92 minimum=254431.39
instructions_per_op:
	mean=   35898.42 standard-deviation=30.69
	median= 35882.97 median-absolute-deviation=15.90
	maximum=35945.63 minimum=35869.02
cpu_cycles_per_op:
	mean=   17397.49 standard-deviation=178.35
	median= 17351.35 median-absolute-deviation=108.79
	maximum=17707.63 minimum=17269.68

Closes scylladb/scylladb#26773
2025-12-09 15:16:31 +02:00
Pavel Emelyanov
855b91ec20 scripts: Make PR merging check more granular
Currently we have 3 explicit checks, and some of them are configurable:
- Jenkins job being stable. Can be disabled with --force
- Whether submodule update is happenning. It's not allowed by default, and
  should be enabled with --allow-submodule option
- Target branch checking (recently merged #27249). Happens unconditionally

This PR unifies all checks in two ways.

First, each restriction can be lifted with --allow-foo options. The existing
--allow-submodule stays and two options are added:

- --allow-unstable to skip jenkins job check (like --force works now)
- --allow-any-branch to skip target branch check

Second, the --force option lifts all the known restrictions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27294
2025-12-09 13:58:21 +02:00
Nadav Har'El
95e303faf3 Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.

Fixes https://github.com/scylladb/scylladb/issues/26313

Closes scylladb/scylladb#27383

* github.com:scylladb/scylladb:
  mv: replace the simple/complex rack-aware pairing with exact rack matching
  mv: split out vnode pairing code from get_view_natural_endpoint
  mv: unify self-pairing and rack-aware pairing into one bool
  mv: remove the workaround for left nodes when sending view updates
2025-12-09 13:19:13 +02:00
Nadav Har'El
8ba595e472 Merge 'alternator: fix batch writes during intranode tablet migrations' from Petr Gusev
Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false.

The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`.

Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation.

This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem.

Fixes: scylladb/scylladb#27353

backport: need to be backported to 2025.4

Closes scylladb/scylladb#27396

* github.com:scylladb/scylladb:
  alternator/executor.cc: eliminate redundant dk copy
  alternator/executor.cc: release cas_shard on the original shard
  alternator/executor.cc: move shard check into cas_write
  alternator/executor.cc: make cas_write a private method
  alternator/executor.cc: make do_batch_write a private method
  alternator/executor.cc: fix indent
  test_alternator: add test_alternator_invalid_shard_for_lwt
2025-12-09 11:25:15 +02:00
Petr Gusev
608eee0357 alternator/executor.cc: eliminate redundant dk copy
A small refactoring/optimization.
2025-12-09 10:21:06 +01:00
Petr Gusev
0bcc2977bb alternator/executor.cc: release cas_shard on the original shard
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.

The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.

We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.
2025-12-09 10:21:06 +01:00
Petr Gusev
3a865fe991 alternator/executor.cc: move shard check into cas_write
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.

We enable test_alternator_invalid_shard_for_lwt since now it should
pass.

Fixes scylladb/scylladb#27353
2025-12-09 10:21:01 +01:00
Pavel Emelyanov
fb32e1c7ee Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes

The change is purely aesthetic, no need to backport

Closes scylladb/scylladb#27101

* github.com:scylladb/scylladb:
  streaming: remove unnecessary lambda creating sstable token range
  streaming: simplify get_sstables_for_tablets logic
  streaming: switch to range-based for loop
  streaming: drop sstable skip microoptimization in tablet loop
  streaming: replace reverse iterators with reverse view in sstables scan
  streaming: return from get_sstables_for_tablets earlier
  streaming: add get_sstables_by_tablet_range tests
  test,sstables: add helper to set sstable first and last keys
  streaming: refactor get_sstables_for_tablets to make it accessible
  streaming: refactor get_sstables_for_tablets to make it testable
  streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
2025-12-09 10:53:57 +03:00
Patryk Jędrzejczak
b6895f0fa7 test: make test_broken_bootstrap faster
This change makes the test ~20 s faster. It's a forgotten follow-up:
https://github.com/scylladb/scylladb/pull/18927#discussion_r1627331946

Closes scylladb/scylladb#27445
2025-12-09 09:25:42 +02:00
Dario Mirovic
c30b326033 test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.

Refs #27272
Refs #27325
2025-12-09 01:35:42 +01:00
Dario Mirovic
807fc68dc5 test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.

Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.

Fixes #27247
Fixes #27325
2025-12-09 01:34:48 +01:00
Michał Jadwiszczak
51843195f7 test/boost/view_build_test: increase number of retires
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.

This patch increases the number of retries to 30.

Fixes scylladb/scylladb#27370

Closes scylladb/scylladb#27493
2025-12-08 23:14:01 +02:00
Gleb Natapov
7038b8b544 test/scylla_cluster: fix the check that a process failed to start
If the process is running returncode will be Node, otherwise it will
have some value (which can be 0 s well) and the current code treats 0
as if the process is still running.

Closes scylladb/scylladb#27490
2025-12-08 18:23:29 +01:00
Tomasz Grabiec
7df610b73d sstables: Remove host id mismatch warning for sstable streaming
Tablet migration transfers sstable files without changing origin
host-id.  As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.

So it's a normal situation, and it's confusing to see this warning in
logs.

Fixes #26957

Closes scylladb/scylladb#27433
2025-12-08 18:39:22 +02:00
Piotr Dulikowski
386309d6a0 Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.

To get the type, the method scans db::config option, but there's much simpler way to get one.

Code cleanup, no need to backport

Closes scylladb/scylladb#27381

* github.com:scylladb/scylladb:
  sstables_loader: Provide endpoint type for get_sstables_from_object_store()
  storage_manager: Introduce get_endpoint_type() method
  storage_manager: Split get_endpoint_client()
2025-12-08 16:55:20 +01:00
Yauheni Khatsianevich
f12adfc292 test/lwt: add counter-table support to BaseLWTTester
Extend BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets
 with counters.
2025-12-08 15:57:33 +01:00
Amnon Heiman
a213e41250 scylla-node-exporter: Add ethtool to node exporter
AWS suggests following multiple network performance metrics:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics

This patch enables the ethtool collector with the specific list of
metrics

Ater this patch the relevant metris looks like:

$ curl http://localhost:9100/metrics |& grep ethtool
node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0
node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0
node_ethtool_conntrack_allowance_available{device="ens5"} 51303
node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0
node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1
node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0
node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436
node_scrape_collector_success{collector="ethtool"} 1

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#27358
2025-12-08 14:27:10 +02:00
Dawid Mędrek
58dc414912 test/cluster/mv: Rewrite test_view_building_scheduling_group
We rewrite the test to avoid flakiness. Instead of looking at the
metrics, we make a trade-off and start depending on a less reliable
mechanism -- logs. We grep all relevant messages printed by Scylla
in TRACE mode and make sure that they were all printed from a context
using the streaming scheduling group.

Although it's a "less proper" way of testing, it should be much more
dependable and avoid flakiness.

Fixes scylladb/scylladb#25957

Closes scylladb/scylladb#26656
2025-12-08 14:24:25 +02:00
Ferenc Szili
d883ff2317 test: fix flakyness caused by TRUNCATE retries
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.

When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.

In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.

Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
  bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
  fails the test

This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.

Fixes: #26347

Closes scylladb/scylladb#27245
2025-12-08 14:13:26 +02:00
dependabot[bot]
1f777da863 build(deps): bump sphinx-scylladb-theme from 1.8.9 to 1.8.10 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.9 to 1.8.10.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#27468
2025-12-08 13:40:51 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Andrei Chekun
0115a21b9a test.py: fail test when timeout reached for boost test
There is a bug in current pytest's boost implementation. When timeout
reached process will be killed, but it was not correctly propagated,
that lead to a false positive result. This will fail test case when
timeout for the process is reached.
This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237

Closes scylladb/scylladb#27463
2025-12-08 11:49:46 +01:00
Ernest Zaslavsky
71834ce7dd streaming: remove unnecessary lambda creating sstable token range
The `sstable_token_range` lambda was only used once to create a token
range for an SSTable. Inline the construction directly where needed,
removing the extra lambda. This simplifies the code without changing
behavior.
2025-12-08 12:30:24 +02:00
Ernest Zaslavsky
df21112c39 streaming: simplify get_sstables_for_tablets logic
Remove the use of the `overlaps` helper and unnest nested conditionals
in get_sstables_for_tablets. Straightforward `before` and `after` checks
are sufficient to decide how each SSTable should be handled.
2025-12-08 12:30:24 +02:00
Ernest Zaslavsky
bd339cc4d8 streaming: switch to range-based for loop
Replace the explicit iterator loop with a range-based for loop. This
simplifies the code, enforces constness, and avoids the unnecessary
use of postfix increment. The behavior remains unchanged,but
readability and maintainability are improved.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
91bf23eea1 streaming: drop sstable skip microoptimization in tablet loop
Remove the microoptimization that advanced over SSTables ending before a
tablet range. This approach is misleading since SSTables are not sorted
by their end token, and the extra logic adds complexity with little to
no benefit. The streaming path here is not performance‑critical, so the
simpler loop is preferable.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
f925ed176b streaming: replace reverse iterators with reverse view in sstables scan
Use a reverse view over the SSTables vector instead of reverse iterators.
This avoids awkward rbegin/rend usage and the mental overhead of tracking
inverted sort order. With a view, we can use standard begin/end iteration
while preserving the intended scan direction.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
68dcd1b1b2 streaming: return from get_sstables_for_tablets earlier
Check if tablets or sstables list is empty and if so, return immediately
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
6fd5160947 streaming: add get_sstables_by_tablet_range tests
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
3fc914ca59 test,sstables: add helper to set sstable first and last keys
Introduce a utility helper to set the first and last decorated keys on
an SSTable. This is intended for testing purposes, making it easier to
construct SSTables with defined boundaries in unit tests.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
6ef7ad9b5a streaming: refactor get_sstables_for_tablets to make it accessible
Create `get_sstables_for_tablets_for_tests` friend free function
for testing purposes. Adding this free function allows
direct testing without requiring the full streamer context.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
581b8ace83 streaming: refactor get_sstables_for_tablets to make it testable
Make the `get_sstables_for_tablets` member function `static`. This
is a step toward improved testability, allowing the function to be
invoked directly without requiring a full instance of the streamer.
2025-12-08 12:30:23 +02:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Petr Gusev
c6eec4eeef alternator/executor.cc: make cas_write a private method
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.
2025-12-08 10:29:54 +01:00
Petr Gusev
9bef142328 alternator/executor.cc: make do_batch_write a private method
We will need to access executor::_stats field on other shards.
2025-12-08 10:29:54 +01:00
Petr Gusev
74bf24a4a7 alternator/executor.cc: fix indent 2025-12-08 10:29:28 +01:00
Petr Gusev
e60bcd0011 test_alternator: add test_alternator_invalid_shard_for_lwt
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.

This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
2025-12-08 10:29:28 +01:00
Avi Kivity
45c16553eb Revert "Update tools/cqlsh submodule"
This reverts commit ff1b212319. In this
commit, the python driver was updated to 3.29.6. That version has a
serious flaw - it rejects compression=None settings [1] which
cqlsh (legitimately) uses in copyutil.py.

The reason this hasn't caused numerous continuous integration failures
is that the submodule update commit did not update the frozen toolchain,
so the build was effectively running with an older version of the driver.

Fix by reverting the change. This allows us to regenerate the frozen
toolchain when we need to.

Reverted changes:

* tools/cqlsh 2240122...6badc99 (2):
  > Update scylla-driver version to 3.29.6
  > Revert "Migrate workflows to Blacksmith"

[1] 78f554236f

Closes scylladb/scylladb#27473
2025-12-08 08:50:52 +02:00
Nadav Har'El
c984f557ef Merge 'alternator: eliminate cross shard ::free for do_batch_write' from Petr Gusev
This is an optimization follow-up [for this PR](https://github.com/scylladb/scylladb/pull/27396#issuecomment-3611410774): avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem.

Before this PR, a `std::vector<put_or_delete_item>` could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead.

backport: not needed, this is an optimization

Closes scylladb/scylladb#27432

* github.com:scylladb/scylladb:
  alternator/executor.cc: avoid cross-shard free
  storage_proxy: cas: take cas_request by raw reference
2025-12-07 22:54:36 +02:00
Andrei Chekun
5e83311305 test.py: switch to ThreadPoolExecutor
With python 3.14, the Process fails due to pickling issue with nodes objects.
This will eliminate this issue, so we can bump up the python version.

Closes scylladb/scylladb#27456
2025-12-07 17:37:25 +02:00
Petr Gusev
f00f7976c1 alternator/executor.cc: avoid cross-shard free
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.

Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.

The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
2025-12-07 16:14:56 +01:00
Petr Gusev
c428645d16 storage_proxy: cas: take cas_request by raw reference
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.

This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
2025-12-07 16:14:56 +01:00
Tomasz Grabiec
082342ecad Attach names to allocating sections for better debuggability
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:

  lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;

Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.

This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.

Fixes #25799

Closes scylladb/scylladb#27470
2025-12-07 14:14:25 +02:00
Avi Kivity
47efbdffbc Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries.  In some
workloads that could be many entries, even the whole cache.  Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.

This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.

perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):

    Generated 1124195 rows
    Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
    Draining...
    took 0.000616 [ms]
    cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
    update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0

Fixes #23479
Fixes #2578

Closes scylladb/scylladb#27469

* github.com:scylladb/scylladb:
  cache, mvcc: Preempt cache update when applying range tombstone from memtable
  partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
  perf-row-cache-update: Add scenario with large tombstone covering many rows
2025-12-07 11:54:15 +02:00
Avi Kivity
d811eeb4ca Merge 'Make direct failure detector verb handler more efficient' from Gleb Natapov
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.

Fixes https://github.com/scylladb/scylladb/issues/27142

Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.

Closes scylladb/scylladb#27387

* github.com:scylladb/scylladb:
  direct_failure_detector: run direct failure detector in the gossiper scheduling group
  raft: drop invoke_on from the pinger verb handler
  direct_failure_detector: pass timeout to direct_fd_ping verb
2025-12-07 11:40:26 +02:00
Marcin Maliszkiewicz
4784e39665 auth: fix ctor signature of certificate_authenticator
In b9199e8b24 we
added cache argument to constructor of authenticators
but certificate_authenticator was ommited. Class
registrator sadly only fails in runtime for such
cases.

Fixes https://github.com/scylladb/scylladb/issues/27431

Closes scylladb/scylladb#27434
2025-12-07 11:18:42 +02:00
Tomasz Grabiec
d4014b7970 Drop legacy schema support
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.

So no system should have the old schema any more.

No need to run legacy_schema_migrator on boot.

Closes scylladb/scylladb#27420
2025-12-07 00:09:13 +02:00
Tomasz Grabiec
92b5e4d63d cache, mvcc: Preempt cache update when applying range tombstone from memtable
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries.  In some
workloads that could be many entries, even the whole cache.  Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.

This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.

perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):

Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0

Fixes #23479
Fixes #2578
2025-12-06 13:45:35 +01:00
Tomasz Grabiec
e546143fd9 partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone() 2025-12-06 01:03:10 +01:00
Tomasz Grabiec
721434054b perf-row-cache-update: Add scenario with large tombstone covering many rows
Fills memtable with rows and a tombstone which deletes all rows which
are already in cache.

Similar to raft log workload, but more extreme.

With -c1 -m4G, observed really bad performance:

update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187
cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000)

Which means that max reactor stall during cache update was 1.5 [s]
0.7 GB memtables. 2.1 GB in cache.
2025-12-06 01:03:09 +01:00
Nadav Har'El
350cbd1d66 alternator: fix typo of BatchWriteItem in comments
The DynamoDB API's "BatchWriteItem" operation is spelled like this, in
singular. Some comments incorrectly referred to as BatchWriteItems - in
plural. This patch fixes those mistakes.

There are no functional changes here or changes to user-facing documents -
these mistakes were only in code comments.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27446
2025-12-05 15:08:58 +02:00
Botond Dénes
866c96f536 Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.

Backport is not required, it is a new feature

Fixes #20100

Closes scylladb/scylladb#27287

* github.com:scylladb/scylladb:
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: Add TemporaryScylla metadata component type
  sstables: Extract file writer closing logic into separate methods
  sstables: Add components_digests to scylla metadata components
  sstables: Implement CRC32 digest-only writer
2025-12-05 11:36:50 +02:00
Botond Dénes
367633270a Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund
Fixes #27367
Fixes #27362
Fixes #27366

Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.

Closes scylladb/scylladb#27368

* github.com:scylladb/scylladb:
  ear::kms/ear::azure: Use utils::http URL parsing
  ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
  utils::http: Handle ipv6 numeric host part in URL:s
2025-12-05 10:43:07 +02:00
Asias He
e97a504775 repair: Allow min max range to be updated for repair history
It is observed that:

repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])

This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.

To fix, we can relax the check of (start, end] for the min max range.

Fixes #27220

Closes scylladb/scylladb#27357
2025-12-05 10:41:25 +02:00
Anna Stuchlik
a5c971d21c doc: update the upgrade policy to cover non-consecutive minor upgrades
Fixes https://github.com/scylladb/scylladb/issues/27308

Closes scylladb/scylladb#27319
2025-12-05 10:31:53 +02:00
Guy Shtub
a0809f0032 Update integration-jaeger.rst
Fixing broken link in Jaeger Docs to ScyllaDB

Closes scylladb/scylladb#26406
2025-12-05 10:23:07 +02:00
Piotr Dulikowski
bb6e41f97a index: allow vector indexes without rf_rack_valid_keyspces
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.

Remove the restriction for vector store indexes as it is completely
unnecessary.

Fixes: SCYLLADB-81

Closes scylladb/scylladb#27447
2025-12-05 09:26:26 +02:00
Marcin Maliszkiewicz
4df6b51ac2 auth: fix cache::prune_all roles iteration
During b9199e8b24
reivew it was suggested to use standard for loop
but when erasing element it causes increment on
invalid iterator, as role could have been erased
before.

This change brings back original code.

Fixes: https://github.com/scylladb/scylladb/issues/27422
Backport: no, offending commit not released yet

Closes scylladb/scylladb#27444
2025-12-04 23:35:54 +01:00
Taras Veretilnyk
0c8730ba05 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2025-12-04 21:09:01 +01:00
Taras Veretilnyk
bc2e83bc1f sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
2025-12-04 21:00:09 +01:00
Patryk Jędrzejczak
f4c3d5c1b7 Merge 'fix test_coordinator_queue_management flakiness' from Gleb Natapov
After 39cec4ae45 node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities.

Fixes #27320

No need to backport since the flakiness is in the mater only.

Closes scylladb/scylladb#27408

* https://github.com/scylladb/scylladb:
  test: fix test_coordinator_queue_management flakiness
  test/pylib: allow expected_error in server_start to contain regular expression
2025-12-04 16:08:02 +01:00
Tomasz Grabiec
e54abde3e8 Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.

Additionally, `test_rest_api_on_startup` is added to reproduce the problem.

Fixes: https://github.com/scylladb/scylladb/issues/27130

No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start.

Closes scylladb/scylladb#27410

* github.com:scylladb/scylladb:
  test: add test_rest_api_on_startup
  main: delay setup of storage_service REST API
2025-12-04 14:56:49 +01:00
Avi Kivity
9696ee64d0 database: fix overflow when computing data distribution over shards
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:

```c++
        // [1, 2, 3, 0] --> [0, 1, 3, 6]
        std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```

However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.

As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.

An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57

The fix is simple: the initial value is passed as uint64_t instead of int.

Fixes https://github.com/scylladb/scylladb/issues/27417

Closes scylladb/scylladb#27418
2025-12-04 14:10:53 +01:00
Michał Jadwiszczak
aa908ba99c docs/dev/view-building-coordinator: fix typos 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
529cd25c51 db/view/view_building_worker: remove unnnecessary empty lines 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
4fc5fcaec4 db/view/view_building_worker: fix typo 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
3253b05ec9 db/view/view_building_worker: avoid creating a copy of tasks map
The loop can be converted to use an iterator and avoid creating a copy
of the tasks map.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
597a2ce5f9 db/view/view_building_worker: wrap conditionally compiled code in a scope
The code creates a local variable, so it's better to wrap it in a local
scope, to the conditionally compiled variable doesn't pollute the
external scope.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
a5f19af050 db/view/view_building_worker: remove unnecessary CV broadcast
After scylladb/scylladb#26897 was merged, the worker doesn't use the
view building state machine CV to manage lifetime of batches, so the
broadcast is not needed.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
b4fe565f07 db/view/view_building_worker: catch general execption in staging task registrator
In case of general exception in `view_building_worker::create_staging_sstable_tasks()`,
catch it, print it with error level and sleep 1s before retrying.
This will allow for the registrator to retry its work in case of failure
and it should be easier to detect any bugs in the method.
2025-12-04 12:52:37 +01:00
Calle Wilund
8dd69f02a8 ear::kms/ear::azure: Use utils::http URL parsing
Fixes #27367

Move to reuse shared code.
2025-12-04 11:38:41 +00:00
Calle Wilund
d000fa3335 ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
Fixes #27362

The KMIP host connector should handle ipv4 connections (named or numeric).
It also should fall back to system trust when truststore is not specified.
2025-12-04 11:38:41 +00:00
Calle Wilund
4e289e8e6a utils::http: Handle ipv6 numeric host part in URL:s
Fixes #27366

A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.

I.e.
http://[2001:db8:4006:812::200e]:8080

v2:
* Include scheme agnostic parse + case insensitive scheme matching
2025-12-04 11:38:41 +00:00
Benny Halevy
19b6207f17 database: truncate_table_on_all_shards: set use_sstable_identifier to true
To facilitate global sstable deduplication on backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
ff52550739 nodetool: snapshot: add --use-sstable-identifier option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
e654045755 api: storage_service: take_snapshot: add use_sstable_identifier option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
07b92a1ee8 test: database_test: add snapshot_use_sstable_identifier_works
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:38 +02:00
Benny Halevy
7504d10d9e test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
28cb300d0a sstable: write_scylla_metadata: add random_sstable_identifier error injection
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
9b3fbedc8c table: snapshot_on_all_shards: take snapshot_options
And pass the use_sstable_identifier down the stack
to the sstables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
420fb1fd53 sstable: add get_format getter
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
7c62417b54 sstable: snapshot: add use_sstable_identifier option
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.

sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:53:32 +02:00
Botond Dénes
9d2f7c3f52 Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE

Fixes https://github.com/scylladb/scylladb/issues/27070

Closes scylladb/scylladb#27097

* github.com:scylladb/scylladb:
  mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
  cql: add CONCURRENCY to the USING clause
2025-12-04 11:47:41 +02:00
Aleksandra Martyniuk
e3e81a9a7a repair: throw if flush failed in get_flush_time
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).

Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.

This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.

Refs: https://github.com/scylladb/scylladb/issues/24415.

Closes scylladb/scylladb#26794
2025-12-04 11:45:53 +02:00
Gleb Natapov
86dde50c0d direct_failure_detector: run direct failure detector in the gossiper scheduling group
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
2025-12-04 11:35:43 +02:00
Gleb Natapov
6a6bbbf1a6 raft: drop invoke_on from the pinger verb handler
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
2025-12-04 11:35:43 +02:00
Avi Kivity
b82f92b439 main: replace p11-kit hack for trust paths override with gnutls hack
p11-kit has hardcoded paths for the trust paths. Of course, each
Linux distribution hardcodes those paths differently. As a result,
our relocatable gnutls, which uses p11-kit-trust.so to process the
trust paths, needs some overrides to select the right paths.

Currently, we use p11_kit_override_system_files(), a p11-kit API
intended for testing, but which worked well enough for our purpose,
to override the trust module configuration.

Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls
changed how it works with p11-kit and our override is now ignored.

This was likely unintentional, but there appears to be a better way:
instead of letting gnutls auto-load the trust module from a hacked
configuration, we load the modules outselves using
gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and
gnutls_pkcs11_add_provider(). These appear to be intended for the purpose.

We communicate the paths to the scylla executable using an environment
variable. This isn't optimal, but is much easier than adding a command
line variable since there are multiple levels of command line parsing due
to the subtool mechanism.

With this, we unlock the possibility to upgrade gnutls to newer versions.

[1] aa5f15a872

Closes scylladb/scylladb#27348
2025-12-04 11:33:51 +02:00
Gleb Natapov
f00e00fde0 test: fix test_coordinator_queue_management flakiness
After 39cec4ae45 node join may fail with either "request canceled"
notification or (very rarely) because it was banned. Depend on timing.
The patch fixes the test to check for both possibilities.
2025-12-04 11:06:20 +02:00
Gleb Natapov
b0727d3f2a test/pylib: allow expected_error in server_start to contain regular expression
Currently expected_error parameter to server_start can only work with
exact matches. Change it to support regular expressions.
2025-12-04 11:06:20 +02:00
Calle Wilund
4169bdb7a6 encryption::gcp_host: Add exponential retry for server errors
Fixes #27242

Similar to AWS, google services may at times simply return a 503,
more or less meaning "busy, please retry". We rely for most cases
higher up layers to handle said retry, but we cannot fully do so,
because both we reach this code sometimes through paths that do
no such thing, and also because it would be slightly inefficient,
since we'd like to for example control the back-off for auth etc.

This simply changes the existing retry loop in gcp_host to
be a little more forgiving, special case 503 errors and extend
the retry to the auth part, as well as re-use the
exponential_backoff_retry primitive.

v2:
* Avoid backoff if refreshing credentials. Should not add latency due to this.
* Only allow re-auth once per (non-service-failure-backoff) try.
* Add abort source to both request and retry
v3:
* Include timeout and other server errors in retry-backoff
v4:
* Reorder error code handling correctly

Closes scylladb/scylladb#27267
2025-12-04 10:13:37 +02:00
Anna Stuchlik
c5580399a8 replace the Driver pages with a link to the new Drivers pages
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.

Redirections are added for all the removed pages.

Fixes https://github.com/scylladb/scylladb/issues/26871

Closes scylladb/scylladb#27277
2025-12-04 10:07:27 +02:00
Benny Halevy
1c45ad7cee db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
To be used for naming sstables in the snapshot by their
sstable identifiers rather than their generation, to
facilitate global deduplication of sstables in backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Tomasz Grabiec
1d42770936 Merge 'topology_coordinator: Add barrier to cleanup_target' from Łukasz Paszkowski
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512

It's a pre existing issue. Backport is required to all recent 2025.x versions.

Closes scylladb/scylladb#27413

* github.com:scylladb/scylladb:
  topology_coordinator: Fix the indentation for the cleanup_target case
  topology_coordinator: Add barrier to cleanup_target
  test_node_failure_during_tablet_migration: Increase RF from 2 to 3
2025-12-03 23:57:45 +01:00
Taras Veretilnyk
d287b054b9 sstables: Add TemporaryScylla metadata component type
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
2025-12-03 23:40:10 +01:00
Szymon Wasik
4f803aad22 Improve documentation of vector search configuration parameters.
This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.

Fixes: SCYLLADB-77.

Closes scylladb/scylladb#27385
2025-12-03 21:02:59 +02:00
Karol Nowacki
a54bf50290 vector_search: Fix requests hanging on unreachable nodes
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.

This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.

This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.

Closes scylladb/scylladb#27388
2025-12-03 21:01:43 +02:00
Nadav Har'El
06dd3b2e64 install-dependencies.sh: add zlib
Scylla uses zlib, through the header <zlib.h>, in sstable compression.
We also want to use it in Alternator for gzip-compressed requests.

We never actually required zlib explicltly in install-dependencies.sh,
we only get it through transitive dependencies. But it's better to
require it explicitly so this is what we do in this patch.

In Fedora, we use the newer, more efficient, zlib-ng which is API-
compatible with the classic zlib. Unfortunately, the Debian zlib-ng
package is *not* drop-in compatible with zlib (you need to include
a different header file <zlib-ng.h>) so we use the classic zlib.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27238
2025-12-03 19:30:36 +02:00
Łukasz Paszkowski
6163fedd2e topology_coordinator: Fix the indentation for the cleanup_target case 2025-12-03 16:37:33 +01:00
Łukasz Paszkowski
67f1c6d36c topology_coordinator: Add barrier to cleanup_target
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512
2025-12-03 16:19:17 +01:00
Łukasz Paszkowski
669286b1d6 test_node_failure_during_tablet_migration: Increase RF from 2 to 3
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.

To address it, the third rack with a single node is added and the
replication factor is increased to 3.
2025-12-03 16:00:19 +01:00
Botond Dénes
b9199e8b24 Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz
Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods.

To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path.

Dependency diagram:
```
|Authentication Layer|
            |
            v
+--------------------------------+
|          Auth Cache            |
+--------------------------------+
        ^                      |
        |                      |
        |                      v
|Raft Write Logic | | CQL Read Layer|
```

Cache invalidation is based on raft and the cache contains full content of related tables.

Ldap role manager may benefit partially as can_logic function is common  and will be cached,
but it still needs to query roles from external source.

Performance results:

For single shard connection/disconnection scenario insns/conn decreased by *5%*,
allocs/conn decreased by *23%*, tasks/conn decreased by *20%*. Results for 20 shards are very similar.

Raw data before:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1128.55 tps (599.2 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op,        0 errors)
1157.41 tps (601.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op,        0 errors)
1167.42 tps (603.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op,        0 errors)
1159.63 tps (605.9 allocs/op,   0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op,        0 errors)
1165.12 tps (608.8 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op,        0 errors)
throughput:
	mean=   1155.63 standard-deviation=15.66
	median= 1159.63 median-absolute-deviation=9.49
	maximum=1167.42 minimum=1128.55
instructions_per_op:
	mean=   2602934.31 standard-deviation=16063.01
	median= 2603234.19 median-absolute-deviation=13887.96
	maximum=2625804.05 minimum=2586609.82
cpu_cycles_per_op:
	mean=   1359576.30 standard-deviation=5945.69
	median= 1360607.05 median-absolute-deviation=4358.94
	maximum=1365736.42 minimum=1350912.10
```

Raw data after:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1132.09 tps (457.5 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op,        0 errors)
1157.70 tps (458.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op,        0 errors)
1162.86 tps (459.0 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op,        0 errors)
1153.15 tps (460.2 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op,        0 errors)
1142.09 tps (460.6 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op,        0 errors)
1124.89 tps (462.5 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op,        0 errors)
1156.75 tps (464.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op,        0 errors)
1152.16 tps (466.3 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op,        0 errors)
1154.77 tps (469.8 allocs/op,   0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op,        0 errors)
1152.22 tps (472.4 allocs/op,   0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op,        0 errors)
throughput:
	mean=   1148.87 standard-deviation=12.08
	median= 1153.15 median-absolute-deviation=7.88
	maximum=1162.86 minimum=1124.89
instructions_per_op:
	mean=   2487755.88 standard-deviation=43838.23
	median= 2478900.02 median-absolute-deviation=24531.06
	maximum=2571954.26 minimum=2432485.38
cpu_cycles_per_op:
	mean=   1304144.76 standard-deviation=22129.55
	median= 1305025.71 median-absolute-deviation=12363.25
	maximum=1345341.16 minimum=1270655.17
```

Fixes https://github.com/scylladb/scylladb/issues/18891
Backport: no, it's a new feature

Closes scylladb/scylladb#26841

* github.com:scylladb/scylladb:
  auth: use auth cache on login path
  auth: corutinize standard_role_manager::can_login
  main: auth: add auth cache dependency to auth service
  raft: update auth cache when data changes
  auth: storage_service: reload auth cache on v1 to v2 auth migration
  raft: reload auth cache on snapshot application
  service: add auth cache getter to storage service
  main: start auth cache service
  auth: add unified cache implementation
  auth: move table names to common.hh
2025-12-03 16:45:01 +02:00
Andrzej Jackowski
1ff7f5941b test: add test_rest_api_on_startup
This test verifies that REST API requests are handled properly
when a server is started or restarted. It is used to verify
the fix for scylladb/scylladb#27130, where a server failed with a
segmentation fault when `storage_service/raft_topology/reload` was
called too early.

Refs: scylladb/scylladb#27130
2025-12-03 15:35:59 +01:00
Andrzej Jackowski
3b70154f0a main: delay setup of storage_service REST API
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.

Fixes: scylladb/scylladb#27130
2025-12-03 15:35:54 +01:00
Pavel Emelyanov
6ae72ed134 test: Reuse S3 fixtures facilities in cqlpy/test_tools.py
Creating endpoint conf can be made with the s3_server method
Getting boto3 resource from s3_server itself is also possible

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27380
2025-12-03 16:32:54 +02:00
Michael Litvak
9213a163cb test: fix test flakiness in test_colocated_tables_gc_mode
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.

To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.

Fixes scylladb/scylladb#27398

Closes scylladb/scylladb#27401
2025-12-03 12:12:24 +01:00
David Garcia
d9593732b1 docs: add strict mode to control metrics validation behavior
The metrics extension now includes validation to detect missing metrics. This validation caused failures during multiversion publication because older versions did not generate all required properties.

Instead of fixing each branch, a strict mode flag was introduced to control when validation should run.

Strict mode is enabled in the workflow that validates pull requests, ensuring that new changes meet the expected metrics.

During multiversion builds, validation errors are now logged but do not raise exceptions, which prevents build failures while still providing visibility into missing data.

docs: verbose mode

docs: verbose mode

Closes scylladb/scylladb#27402
2025-12-03 14:09:08 +03:00
Anna Stuchlik
48cf84064c doc: add the upgrade guide from 2025.x to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26451

Fixes https://github.com/scylladb/scylladb/issues/26452

Closes scylladb/scylladb#27310
2025-12-03 11:18:10 +03:00
Avi Kivity
a12165761e Update seastar submodule
* seastar b5c76d6b...7ec14e83 (5):
  > Merge 'reactor: coroutinize more file related functions' from Avi Kivity
    reactor: reindent after coroutinization
    reactor: fdatasync: coroutinize
    reactor: touch_directory: coroutinize
    reactor: make_directory: coroutinize
    reactor: open_directory: coroutinize
    reactor: statvfs: coroutinize
    reactor: fstatfs: coroutinize
    reactor: file_system_at: coroutinize
    reactor: file_accessible: coroutinize
    reactor: file_size: coroutinize
    reactor: file_stat: coroutinize
  > reactor: Mark some sched-stats getters const
  > Merge 'coroutine: allocate coroutine frame in a critical section' from Avi Kivity
    coroutine: allocate coroutine frame in a critical section
    memory: add C23 free_sized, free_aligned_sized
  > coroutines: simplify execute_involving_handle_destruction_in_await_suspend()
  > coroutine: introduce try_future

Closes scylladb/scylladb#27369
2025-12-03 10:55:47 +03:00
Nadav Har'El
7dc04b033c test/cluster: fix missing racks in xfailing Alternator test
Since Alternator is now using tablets by default, it's no longer possible
to create an Alternator table on a 3-node cluster with a single rack -
you need to have 3 racks to support RF=3.

Most of the multi-node Alternator tests in test/cluster/test_alternator.py
were already fixed to use a 3-rack cluster, but one test was missed
because it was marked "xfail" so its new failure to create the table was
missed. This patch adds the missing 3-rack setup, so the xfailing test
returns to failing on the real bug - not on the table creation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27382
2025-12-03 10:54:11 +03:00
Piotr Dulikowski
654ac9099b db/view/view_building_coordinator: skip work if no view is built
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.

Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.

Fixes: scylladb/scylladb#27363

Closes scylladb/scylladb#27373
2025-12-03 09:44:28 +02:00
Andrzej Jackowski
ff1b212319 Update tools/cqlsh submodule
The motivation for the update is using newer version of scylla-driver
that supports new event type CLIENT_ROUTES_CHANGE.

* tools/cqlsh 22401228...6badc992 (2):
  > Update scylla-driver version to 3.29.6
  > Revert "Migrate workflows to Blacksmith"

Closes scylladb/scylladb#27359
2025-12-02 15:14:26 +02:00
Calle Wilund
4e7ec9333f gcp::object_storage: Include auth in exponential back-off-retry
Fixes #27268
Refs #27268

Includes the auth call in code covered by backoff-retry on server
error, as well as moves the code to use the shared primitive for this
and increase the resilience a bit (increase retry count).

v2:
* Don't do backoff if we need to refresh credentials.
* Use abort source for backoff if avail
v3:
* Include other retryable conditions in auth check

Closes scylladb/scylladb#27269
2025-12-02 15:08:49 +02:00
Gleb Natapov
82f80478b8 direct_failure_detector: pass timeout to direct_fd_ping verb
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
2025-12-02 14:55:20 +02:00
Botond Dénes
357f91de52 Revert "Merge 'db/config: enable ms sstable format by default' from Michał Chojnowski"
This reverts commit b0643f8959, reversing
changes made to e8b0f8faa9.

The change forgot to update
sstables_manager::get_highest_supported_format(), which results in
/system/highest_supported_sstable_version still returning me, confusing
and breaking tests.

Fixes: scylladb/scylla-dtest#6435

Closes scylladb/scylladb#27379
2025-12-02 14:38:56 +02:00
Botond Dénes
e762027943 db/config: change batchlog_replay_cleanup_after_replays default to 1
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
2025-12-02 14:21:26 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
fb84b30f88 replica/mutation_dump: always set position weight for clustering positions
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.

This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
2025-12-02 14:21:26 +02:00
Botond Dénes
8545f7eedd service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
2025-12-02 14:21:26 +02:00
Botond Dénes
e52e1f842e test/lib: introduce error_injection.hh
Test-specific helpers for working with error injection. Provides an
RAII object to enable/disable error injection points, on all shards.
2025-12-02 14:21:26 +02:00
Botond Dénes
0a7df4b8ac utils/error_injection: add debug log to disable() and disable_all()
enable() and friends already has debug logs.
2025-12-02 14:21:26 +02:00
Botond Dénes
9bb8156f02 test/lib/cql_test_env: forward config to batchlog
Currently all batchlog config items are hardcoded. Make the two
important ones configurable: replay_timeout and
replay_cleanup_after_replays.
2025-12-02 14:21:26 +02:00
Botond Dénes
d1b796bc43 test/lib/cql_test_env: add batch type to execute_batch()
Allow executing logged batches too. Also add a trace log to the method.
2025-12-02 14:21:26 +02:00
Botond Dénes
1ad64731bc test/lib/cql_assertions: add with_size(predicate) overload
Sometimes, expected size is not a single number. Add predicate overload
to allow expressing more complicated expectations.
2025-12-02 14:21:26 +02:00
Botond Dénes
abadb8ebfb test/lib/cql_assertions: add source location to fail messages
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
2025-12-02 14:21:26 +02:00
Botond Dénes
54f16f9019 test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
Invoke the passed in function with a columns_assertions instance for the
current row, allowing for sweeping checks across columns of all rows.
2025-12-02 14:21:26 +02:00
Botond Dénes
b584e1e18e test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check 2025-12-02 14:21:26 +02:00
Botond Dénes
aa1d3f1170 test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
2025-12-02 14:21:26 +02:00
Botond Dénes
e309b5dbe1 db/batchlog_manager: config: s/write_timeout/reply_timeot/
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
ee851266be db/system_keyspace: introduce system.batchlog_v2
Rearranges the system.batchlog schema as follows:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
   range-tombstone. This allows dropping the individual dead batchlog
   entries, as they are shadowed by a higher level tombstone. This
   enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
   batchlog entries that fail the first replay attempt, are moved to the
   failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
   column.
4) Make batchlog entries ordered by the batchlog create time (id). This
   allows for selecting batchlogs to replay, without post-filtering of
   batchlogs that are too young to be replayed.
2025-12-02 14:21:25 +02:00
Botond Dénes
9434ec2fd1 service,db: extract generation of batchlog delete mutation
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
097c2cd676 db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
The propagation delay has no effect for other tombstone gc strategies,
so ignore it when tombstone-gc != repair.
2025-12-02 14:21:25 +02:00
Botond Dénes
4f30807f01 db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
2025-12-02 14:21:25 +02:00
Botond Dénes
55704908a0 data_dictionary: table: add get_truncation_time()
So the batchlog manager can avoid looking up the real table and instead
just work with data dictionary.
2025-12-02 14:21:25 +02:00
Taras Veretilnyk
a191503ddf sstables: Extract file writer closing logic into separate methods
Refactor the consume_end_of_stream() method by extracting the inline
file writer closing logic into dedicated methods:
- close_index_writer()
- close_partitions_writer()
- close_rows_writer()
2025-12-02 13:07:41 +01:00
Taras Veretilnyk
619bf3ac4b sstables: Add components_digests to scylla metadata components
Add components_digests struct with optional digest fields for storing CRC32 digests of individual SSTable components in Scylla metadata.
Those includes:
- Data
- Compression
- Filter
- Statistics
- Summary
- Index
- TOC
- Partitions
- Rows
2025-12-02 12:36:34 +01:00
Pawel Pery
b5c85d08bb unittest: fix vector_store_client_test_dns_refresh_aborted hangs
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.

The commit fixes the problem by waiting for the condition variable only once.

Fixes: #27237
Fixes: VECTOR-370

Closes scylladb/scylladb#27239
2025-12-02 12:22:44 +01:00
Piotr Dulikowski
3aaab5d5a3 Merge 'vector_search: Fix high availability during timeouts' from Karol Nowacki
This PR introduces two key improvements to the robustness and resource management of vector search:

Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.

Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.

Fixes: SCYLLADB-76

This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.

Closes scylladb/scylladb#27377

* github.com:scylladb/scylladb:
  vector_search: Fix ANN query abort on CQL timeout
  vector_search: Reduce connection and keep-alive timeouts
2025-12-02 11:14:48 +01:00
Botond Dénes
337f417b13 db/batchlog_manager: batch(): replace map_reduce() with simple loop
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
2025-12-02 12:05:10 +02:00
Wojciech Mitros
6221c58325 mv: replace the simple/complex rack-aware pairing with exact rack matching
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.

Fixes https://github.com/scylladb/scylladb/issues/26313
2025-12-02 10:52:36 +01:00
Ernest Zaslavsky
605f71d074 s3_client: handle additional transient network errors
Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request.

Fixes: https://github.com/scylladb/scylladb/issues/27349

Closes scylladb/scylladb#27350
2025-12-02 11:44:40 +02:00
Botond Dénes
705af2bc16 db/batchlog_manager: finish coroutinizing replay_all_failed_batches
It was coroutinized already but strangely, some continuations also
remained. The `batch` lambda is still left in continuation style.
2025-12-02 10:42:28 +02:00
Botond Dénes
5b5f9120d0 db/batchlog_manager: improve replayAllFailedBatches logs
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
2025-12-02 10:42:28 +02:00
Pavel Emelyanov
6c115c691f sstables_loader: Provide endpoint type for get_sstables_from_object_store()
Currently the method scans db::config to find one. It has some
drawbacks. First, it's not very nice. Second, it needs to handle the
case when the endpoint is missing, while it relally never is. Third, the
type in config entry is not necessarily set.

It's nicer to get the type from storage manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:32 +03:00
Pavel Emelyanov
5924c36b50 storage_manager: Introduce get_endpoint_type() method
So that other code (spoiler: see next patch) have simple API to get one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:27 +03:00
Pavel Emelyanov
ad6a73c29b storage_manager: Split get_endpoint_client()
To get the get_endpoint() internal helper for future use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:23 +03:00
Wojciech Mitros
4ec0fa6eb5 mv: split out vnode pairing code from get_view_natural_endpoint
To avoid repeatedly checking whether we're using tablets and having
to use unnecesarily flexible code fitting both cases, we split out
the base-view pairing code for the case of vnodes to another function.
The get_view_natural_endpoint will now have only common steps,
a call to that function, and steps specific to tablets.
2025-12-02 03:32:36 +01:00
Wojciech Mitros
c313b215e4 mv: unify self-pairing and rack-aware pairing into one bool
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
2025-12-02 03:32:32 +01:00
Karol Nowacki
086c6992f5 vector_search: Fix ANN query abort on CQL timeout
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
2025-12-02 01:17:01 +01:00
Karol Nowacki
b6afacfc1e vector_search: Reduce connection and keep-alive timeouts
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.

This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
2025-12-02 01:17:01 +01:00
Łukasz Paszkowski
0ed3452721 service/storage_service: Mark nodes excluded on shard0
Excluding nodes is a group0 operation and as such it needs to be
executed onyl on shard0. In case, the method `mark_excluded` is
invoked on a different shard, redirect the request to shard0.

Fixes https://github.com/scylladb/scylladb/issues/27129

Closes scylladb/scylladb#27167
2025-12-01 17:30:40 +01:00
Jenkins Promoter
c3c0991428 Update pgo profiles - aarch64 2025-12-01 13:47:56 +02:00
Wojciech Mitros
7c612e1789 mv: remove the workaround for left nodes when sending view updates
At one point, the get_view_natural_endpoint was using IP for the
view update (and hint) destinations, but the hint code was using
host_id for the destinations. When a node left, we could no longer
have a mapping for a IP to host_id and when trying to store a hint
for this IP, we'd crash.
We worked around this issue by dropping the view update completely
if the target is in the "left" state.
Since then, we also moved to host_id's in the view update code, so
there's no longer any translation needed when storing the hints.
Additionally, we now drain hints not when entering the "left" state,
but when the node actually stops owning tokens.
Because of that, the workaround is not needed anymore, so we remove
it in this commit.
The existing test_mv_tablets_empty_ip case verifies that indeed, we
do not crash in the original problematic scenario.
2025-12-01 12:27:28 +01:00
Calle Wilund
5f53d7852e docs::encryption: Add warning that replicated provider is deprecated
And will be removed.
2025-12-01 09:44:44 +00:00
Calle Wilund
52cc30e00c ent::encryption: Switch default key provider from replicated to local
Since we are deprecating the replicated provider, it makes little sense to
have it be default.
2025-12-01 09:44:44 +00:00
Calle Wilund
ba38d58539 replicated_key_provider: Add deprecation warning on usage
Warns user utilizing the provider that the provider is deprecated
and will be removed.
2025-12-01 09:44:44 +00:00
Jenkins Promoter
563e5ddd62 Update pgo profiles - x86_64 2025-12-01 04:24:36 +02:00
Ernest Zaslavsky
f0e2941e34 streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
Extract the SST filtering logic into a dedicated member function. This
prepares the code for independent testing without requiring the entire
streamer to be initialized.
2025-11-30 18:27:15 +02:00
Artsiom Mishuta
796205678f test.py: set worksteal distribution
set worksteal disribution for xdist(new sheduler)
Because now it shows better tests distribution that standart(load) in CI

Closes scylladb/scylladb#27354
2025-11-30 18:13:03 +02:00
Emil Maskovsky
902d70d6b2 .github: add Copilot instructions for AI-generated code
Add comprehensive coding guidelines for GitHub Copilot to improve
quality and consistency of AI-generated code. Instructions cover C++
and Python development with language-specific best practices, build
system usage, and testing workflows.

Following GitHub Copilot's standard layout with general instructions
in .github/copilot-instructions.md and language-specific files in
.github/instructions/ directory using *.instructions.md naming.

No backport: This change is only for developers in master, so it doesn't
need to be backported.

Closes scylladb/scylladb#25374
2025-11-30 13:30:05 +02:00
Avi Kivity
ce2a403f18 Merge 'alternator: implement gzip-compressed requests' from Nadav Har'El
In this series we implement Alternator's support for gzip-compressed
requests, i.e., requests with the "Content-Encoding: gzip" header,
other uncompressed header, and a gzip-compressed body.

The server needs to verify the signature of the *compressed* content,
and then uncompress the body before running the request.

We only support gzip compression because this is what DynamoDB supports.
But in the future we can easily add support for other compression
algorithms like lz4 or zstd.

This series Refs #5041 but doesn't "Fixes" it because it only implements
compressed requests (Content-Encoding), *not* compressed responses
(Accept-Encoding).

In addition to the code changes, the series also contains tests for this
feature that make sure it behaves like DynamoDB.

Note that while we will have now support in our server for compressed
requests, just like DynamoDB does, the clients (AWS SDKs) will probably
NOT make use of it because they do not enable request compression by
default. For example, see the tests for some hoops one needs to jump
through in boto3 (the Python SDK) to send compressed requests. However,
we are hoping that in the future Alternator's modified clients will
use compressed requests and enjoy this feature.

Closes scylladb/scylladb#27080

* github.com:scylladb/scylladb:
  test/alternator: enable, and add, tests for gzip'ed requests
  alternator: implement gzip-compressed requests
2025-11-30 13:27:46 +02:00
Avi Kivity
d4be9a058c Update seastar submodule
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.

* seastar 8c3fba7a...b5c76d6b (3):
  > testing: There can be only one memory_data_sink
  > util: Use std::source_location directly
  > Merge 'net: support proxy protocol v2' from Avi Kivity
    apps: httpd: add --load-balancing-algorithm
    apps: httpd: add /shard endpoint
    test: socket_test: add proxy protocol v2 test suite
    test: socket_test: test load balancer with proxy protocol
    net: posix_connected_socket: specialize for proxied connections
    net: posix_server_socket_impl: implement proxy protocol in server sockets
    net: posix_server_socket_impl: adjust indentation
    net: posix_server_socket_impl: avoid immediately-invoked lambda
    net: conntrack: complete handle nested class special member functions
    net: posix_server_socket_impl: coroutinize accept()

Closes scylladb/scylladb#27316
2025-11-30 12:38:47 +02:00
Piotr Dulikowski
44c605e59c Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.

To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)

No backport is necessary.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918

Closes scylladb/scylladb#26121

* github.com:scylladb/scylladb:
  test/alternator: Enable the tests failing because of #6918
  alternator, cdc: Don't emit events for no-op removes
  alternator, cdc: Don't emit an event for equal items
  alternator/streams, cdc: Differentiate item replace and item update in CDC
  alternator: Change the return type of rmw_operation_return
  config: Add alternator_streams_strict_compatibility flag
  cdc: Don't split a row marker away from row cells
2025-11-30 07:20:22 +01:00
Asias He
da5cc13e97 repair: Fix deadlock when topology coordinator steps down in the middle
Consider this:

1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.

To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.

Fixes #26346

Closes scylladb/scylladb#27163
2025-11-28 15:14:39 +01:00
Radosław Cybulski
b54a9f4613 Fix use-after-free in encode_paging_state in Alternator
Fix unlikely use-after-free in `encode_paging_state`. The function
incorrectly assumes that current position to encode will always have
data for all clustering columns the schema defines. It's possible to
encounter current position having less than all columns specified, for
eample in case of range tombstone. Those don't happen in Alternator
tables as DynamoDB doesn't allow range deletions and clustering key
might be of size at most 1. Alternator api can be used to read
scylla system tables and those do have range tombstones with more
than single clustering column.

The fix is to stop trying to encode columns, that don't have the value -
they are not needed anyway, as there's no possible position with those
values (range tombstone made sure of that).

Fixes #27001
Fixes #27125

Closes scylladb/scylladb#26960
2025-11-28 16:51:15 +03:00
Pavel Emelyanov
d35ce81ff1 Merge 'test: wait for read_barrier in wait_until_driver_service_level_created' from Andrzej Jackowski
Previously, `wait_until_driver_service_level_created` only waited for
the `driver` service level to appear in the output of
`LIST ALL SERVICE_LEVELS`. However, the fact that one node lists
`sl:driver` does not necessarily mean that all other nodes can see
it yet. This caused sporadic test failures, especially in DEBUG builds.

To prevent these failures, this change adds an extra wait for
a `raft/read_barrier` after the `driver` service level first appears.
This ensures the service level is globally visible across the cluster.

Fixes: https://github.com/scylladb/scylladb/issues/27019

Na backport - test fix for `sl:driver` tests, and this that is only available on `master`

Closes scylladb/scylladb#27076

* github.com:scylladb/scylladb:
  test: wait for read_barrier in wait_until_driver_service_level_created
  test: use ManagerClient in wait_until_driver_service_level_created
2025-11-28 16:47:29 +03:00
Dawid Mędrek
b76af2d07f cql3: Improve errors when manipulating default service level
Before this commit, any attempt to create, alter, attach, or drop
the default service level would result in a syntax error whose
error message was unclear:

```
cqlsh> attach service level default to cassandra;
SyntaxException: line 1:21 no viable alternative at input 'default'
```

The error stems from the grammar not being able to parse `default`
as a correct service level name. To fix that, we cover it manually.
This way, the grammar accepts it and we can process it in Scylla.

The reason why we'd like to cover the default service level is that
it's an actual service level that the user should reference. Getting
a syntax error is not what should happen. Hence this fix.

We validate the input and if the given role is really the default
service level, we reject the query and provide an informative error
message.

Two validation tests are provided.

Fixes scylladb/scylladb#26699

Closes scylladb/scylladb#27162
2025-11-28 15:32:37 +03:00
Dawid Mędrek
48a28c24c5 db/commitlog: Include position and alignment information in errors
When we come across a segment truncation, this information may
be helpful to determine when the error occurred exactly and
hint at what code path might've led to it.

Closes scylladb/scylladb#27207
2025-11-28 15:28:08 +03:00
Calle Wilund
59c87025d1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236
2025-11-28 15:26:46 +03:00
Ernest Zaslavsky
1d5f60baac streaming:: add more logging
Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags

Fixes: https://github.com/scylladb/scylladb/issues/27299

Closes scylladb/scylladb#27311
2025-11-28 12:50:33 +01:00
Emil Maskovsky
37e3dacf33 topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"

This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

No backport: The problem was only seen in tests and not reported in
customer tickets, so it's enough to fix it in the main branch.

Closes scylladb/scylladb#27314
2025-11-28 12:19:21 +01:00
Michael Litvak
97b7c03709 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312
2025-11-28 11:17:12 +01:00
Taras Veretilnyk
62802b119b sstables: Implement CRC32 digest-only writer
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
2025-11-27 22:40:07 +01:00
Pavel Emelyanov
54edb44b20 code: Stop using seastar::compat::source_location
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.

The patch is

  for f in $(git grep -l 'seastar::compat::source_location'); do
    sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
  done

and removal of few header includes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27309
2025-11-27 19:10:11 +02:00
Avi Kivity
c85671ce51 scripts: refresh-submodules: don't omit last (first) commit
`git log --format` doesn't add a newline after the last line. This
causes `read` to ignore that line, losing the last line (corresponding
to the first commit).

Use `git log --tformat` instead, which terminates the last line.

Closes scylladb/scylladb#27317
2025-11-27 18:46:27 +02:00
Botond Dénes
9b968dc72c docs: update dependencies
Via make update.

Fixes: scylladb/scylladb#27231

Closes scylladb/scylladb#27263
2025-11-27 15:56:34 +03:00
Andrzej Jackowski
e366030a92 treewide: seastar module update
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.

The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
 - Remove `seastar/util/modules.hh` includes as the file was removed
   from seastar
 - Modified `metrics::impl::labels_type` construction in
   `test/boost/group0_test.cc` because now it requires `escaped_string`

* seastar 340e14a7...8c3fba7a (32):
  > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
    dns: Optimize packet sending for newer c-ares versions
    dns: Replace net::packet with vector<temporary_buffer>
    dns: Remove unused local variable
    dns: Remove pointless for () loop wrapping
    dns: Introduce do_sendv_tcp() method
    dns: Introduce do_send_udp() method
  > test: Add http rules test of matching order
  > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
    memcached: Patch test to use memory_data_source
    memcached: Use memory_data_source in server
    rpc: Use memory_data_sink without constructing net::packet
    util: Generalize packet_data_source into memory_data_source
  > tests: coroutines: restore "explicit this" tests
  > reactor: remove blocking of SIGILL
  > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
    github: Use gcc-14
    github: Use clang-20
  > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
    test: Make test_resolve() try several addresses
    test: Coroutinize test_resolve() helper
  > modules: make module support standards-compliant
  > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
    dns: Squash two if blocks together
    dns: Do not check tcp entry for udp type
  > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
  > promise: Document that promise is resolved at most once
  > coroutine: exception: workaround broken destroy coroutine handle in await_suspend
  > socket: Return unspecified socket_address for unconnected socket
  > smp: Fix exception safety of invoke_on_... internal copying
  > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
    reactor: Keep loads timer on reactor
    reactor: Update loads evaluation loop
  > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
    test: extend tests/unit/api.json to use 'integer' type
    scripts: add 'integer' type to seastar-json2code
  > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
    tls: Rename do_put_one(temporary_buffer) into do_put()
    tls: Fix indentation after previous patch
    tls: Move semaphore grab into iterating do_put()
  > net: tcp: change unsent queue from packets to temporary_buffer:s
  > timer: Enable highres timer based on next timeout value
  > rpc: Add a new constructor in closed_error to accept string argument
  > memcache: Implement own data sink for responses
  > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
    file: do_recursive_remove_directory(): move object when popping from queue
    file: do_recursive_remove_directory(): adjust indentation
    file: do_recursive_remove_directory(): coroutinize
    file: do_recursive_remove_directory(): simplify conditional
    file: do_recursive_remove_directory(): remove wrong const
    file: do_recursive_remove_directory(): clean up work_entry
  > tests: Move thread_context_switch_test into perf/
  > test: Add unit test for append_challenged_posix_file
  > Merge 'Prometheus metrics handler optimization' from Travis Downs
    prometheus: optimize metrics aggregation
    prometheus: move and test aggregate_by helper
    prometheus: various optimizations
    metrics: introduce escaped_string for label values
    metric:value: implement + in terms of +=
    tests: add prometheus text format acceptance tests
    extract memory_data_sink.hh
    metrics_perf: enhance metrics bench
  > demos: Simplify udp_zero_copy_demo's way of preparing the packet
  > metrics: Remove deprecated make_...-ers
  > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
    test: Use BOOST_REQUIRE checkers
    test: Replace some SEASTAR_ASSERT-s with static_assert-s
    test: Convert slab test into boost kind
  > Merge 'Coroutinize lister_test' from Pavel Emelyanov
    test: Fix indentation after previuous patch
    test: Coroutinize lister_test lister::report() method
    test: Coroutinize lister_test main code
  > file: recursive_remove_directory(): use a list instead of a deque
  > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
    tls: Stop using net::packet in session::put()
    tls: Fix indentation after previous patch
    tls: Split session::do_put()
    tls: Mark some session methods private

Closes scylladb/scylladb#27240
2025-11-27 12:34:22 +02:00
Nadav Har'El
32afcdbaf0 test/alternator: enable, and add, tests for gzip'ed requests
After in the previous patch we implemented support in Alternator for
gzip-compressed requests ("Content-Encoding: gzip"), here we enable
an existing xfail-ing test for this feature, and also add more tests
for more cases:

  * A test for longer compressed requests, or a short compressed
    request which expands to a longer request. Since the decompression
    uses small buffers, this test reaches additional code paths.

  * Check for various cases of a malformed gzip'ed request, and also
    an attempt to use an unsupported Content-Encoding. DynamoDB
    returns error 500 for both cases, so we want to test that we
    do to - and not silently ignore such errors.

  * Check that two concatenated gzip'ed streams is a valid request,
    and check that garbage at the end of the gzip - or a missing
    character at the end of the gzip - is recognized as an error.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-27 09:42:47 +02:00
Wojciech Mitros
323e5cd171 mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Fixes https://github.com/scylladb/scylladb/issues/27070
2025-11-27 00:02:28 +01:00
Tomasz Grabiec
d6c14de380 Merge 'locator/node: include _excluded in missing places' from Patryk Jędrzejczak
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

Closes scylladb/scylladb#27265

* github.com:scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-26 18:29:59 +01:00
Asias He
ab4896dc70 topology_coordinator: Send incremental repair rpc only when the feature is enabled
Otherwise, in a mixed cluster, the handle_tablet_resize_finalization
would fail because of the unknown rpc verb.

Fixes #26309

Closes scylladb/scylladb#27218
2025-11-26 15:25:36 +01:00
Patryk Jędrzejczak
287c9eea65 locator/node: include _excluded in verbose formatter
It can be helpful during debugging.
2025-11-26 13:26:17 +01:00
Patryk Jędrzejczak
4160ae94c1 locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
2025-11-26 13:26:11 +01:00
Patryk Jędrzejczak
cc273e867d Merge 'fix notification about expiring erm held for to long' from Gleb Natapov
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

Closes scylladb/scylladb#27140

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-26 12:59:00 +01:00
Amnon Heiman
68c7236acb vector_index: require tablets for vector indexes
This patch enforces that vector indexes can only be created on keyspaces
that use tablets. During index validation, `check_uses_tablets()` verifies
the base keyspace configuration and rejects creation otherwise.

To support this, the `custom_index::validate()` API now receives a
`const data_dictionary::database&` parameter, allowing index
implementations to access keyspace-level settings during DDL validation.

Fixes https://scylladb.atlassian.net/browse/VECTOR-322

Closes scylladb/scylladb#26786
2025-11-26 13:30:43 +02:00
Marcin Maliszkiewicz
dd461e0472 auth: use auth cache on login path
This path may become hot during connection storms
that's why we want it to stress the node as little
as possible.
2025-11-26 12:01:33 +01:00
Marcin Maliszkiewicz
0c9b2e5332 auth: corutinize standard_role_manager::can_login
Corutinize so that it's easier to add new logic
in following commit.
2025-11-26 12:01:32 +01:00
Marcin Maliszkiewicz
b29c42adce main: auth: add auth cache dependency to auth service
In the following commit we'll switch some authorizer
and role manager code to use the cache so we're preparing
the dependency.
2025-11-26 12:01:31 +01:00
Marcin Maliszkiewicz
ea3dc0b0de raft: update auth cache when data changes
When applying group0_command we now inspect
whether any auth internal tables were modified,
and reload affected role entries in the cache.

Since one auth DML may change multiple tables,
when iterating over mutations we deduplicate
affected roles across those tables.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
2a6bef96d6 auth: storage_service: reload auth cache on v1 to v2 auth migration 2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
19da1cb656 raft: reload auth cache on snapshot application
Receiving snaphot is a rare event so as a simplification
we'll be reloading the whole cache instead of trying to merge
states, especially that expected size is small, below 100 records.

Reloading is non-disruptive operation, old entries are removed
only after all entries are loaded. If entry is updated, shared
pointer will be atomically replaced in a cache map.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
2cf1ca43b5 service: add auth cache getter to storage service
Prepare for use in a subsequent commit in group0_state_machine,
where the auth cache will be integrated. This follows the same
pattern as updates to the service-level cache, view-building
state, and CDC streams.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
642f468c59 main: start auth cache service
The service is not yet used anywhere,
we first build scaffolding.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
bd7c87731b auth: add unified cache implementation
It combines data from all underlying auth tables.
Supports gentle full load and per role reloads.
Loading is done on shard 0 and then deep copies data
to all shards.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
4c667e87ec auth: move table names to common.hh
They will be used additionally in cache code, added
in following commits.
2025-11-26 12:00:50 +01:00
Nadav Har'El
f4555be8a5 docs/alternator: list another unimplemented Alternator feature
A new feature was announced this week for Amazon DynamoDB, "multi-
attribute composite keys in global secondary indexes", which allows to
create GSIs with composite keys (multiple columns). This feature already
existed in CQL's materialized views, but didn't exist in DynamoDB until
now.

So this patch adds a paragraph to our docs/alternator/compatibility.md
mentioning that we don't support this DynamoDB feature yet.

See also issue #27182 which we opened to track this unimplemented
feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27183
2025-11-26 12:10:37 +02:00
Pavel Emelyanov
943350fd35 scripts: Add target branch checking in PR merging script
Sometimes (though rarely) I call this script on mis-matching PR and
current branch. E.g. trying to merge master PR into stable next, or
2025.X PR into next-2025.Y (X != Y). Typically merge fails, but it's
good to catch it early.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27249
2025-11-26 12:10:16 +02:00
Nadav Har'El
9cde93e3da Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.

With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.

In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.

Fixes https://github.com/scylladb/scylladb/issues/26311

This patch needs to be backported to 2025.4.

Closes scylladb/scylladb#26897

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: update the docs after recent changes
  db/view/view_building: send coordinator's term in the RPC
  db/view/view_building_state: replace task's state with `aborted` flag
  db/view/view_building_coordinator: batch finished tasks reporting
  db/view/view_building_worker: change internal implementation
  db/view/view_building_coordinator: change `work_on_tasks` RPC return type
2025-11-26 11:35:44 +02:00
dependabot[bot]
86cd0a4dce build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.3 to 0.3.4.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#27214
2025-11-26 06:57:02 +02:00
tomek7667
9bbdd487b4 docs: insert.rst: Update insert example by removing 'year' column
Closes scylladb/scylladb#26862
2025-11-26 06:55:28 +02:00
tomek7667
2138ab6b0e docs: insert.rst: fix INSERT statement for NerdMovies example
Closes scylladb/scylladb#26863
2025-11-26 06:53:45 +02:00
tomek7667
90a6aa8057 docs: ddl.rst: Fix formatting of null value note
Closes scylladb/scylladb#26853
2025-11-26 06:52:18 +02:00
Botond Dénes
384bffb8da Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho
This PR adds support for limiting the maximum shares allocated to a
compaction scheduling class by the compaction controller. It introduces
a new configuration parameter, compaction_max_shares, which, when set
to a non zero value, will cap the shares allocated to compaction jobs.
This PR also exposes the shares computed by the compaction controller
via metrics, for observability purposes.

Fixes https://github.com/scylladb/scylladb/issues/9431

Enhancement. No need to backport.

NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696

Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50)

Closes scylladb/scylladb#27024

* github.com:scylladb/scylladb:
  db/config: introduce new config parameter `compaction_max_shares`
  compaction_manager:config: introduce max_shares
  compaction_controller: add configurable maximum shares
  compaction_controller: introduce `set_max_shares()`
2025-11-26 06:51:30 +02:00
Avi Kivity
1f6e3301e7 dist: systemd: drop deprecated CPU and I/O shares/weight from scylla-server.slice
The BlockIOWeight and CPUShares are deprecated. They are only used on
RHEL 7, which has reached end-of-life. Their replacements, IOWeight
and CPUWeight, are already set in the file.

Remove the deprecated settings to reduce noise in the logs.

Closes scylladb/scylladb#27222
2025-11-26 06:42:11 +02:00
Yaniv Michael Kaul
765a7e9868 gms/gossiper.cc: fix gossip log to show host-id/ip instead of host-id/host-id
Probably a copy-paste error, fixes the log to print host-id/ip.

Backport: no need, benign log issue.

Fixes: https://github.com/scylladb/scylladb/issues/27113
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27225
2025-11-25 20:56:20 +01:00
Wojciech Mitros
3c376d1b64 alternator: use storage_proxy from the correct shard in executor::delete_table
When we delete a table in alternator, the schema change is performed on shard 0.
However, we actually use the storage_proxy from the shard that is handling the
delete_table command. This can lead to problems because some information is
stored only on shard 0 and using storage_proxy from another shard may make
us miss it.
In this patch we fix this by using the storage_proxy from shard 0 instead.

Fixes https://github.com/scylladb/scylladb/issues/27223

Closes scylladb/scylladb#27224
2025-11-25 18:56:31 +01:00
Botond Dénes
584f4e467e tools/scylla-sstable: introduce the dump-schema command
There is a limited number of ways to obtain the schema of a table:
1) Use DESCRIBE TABLE in cqlsh
2) Find the schema definition in the code (for system tables)
3) Ask support/user to provide schema
4) Piece together the schema definition from the system tables

Option (1) is the most convenient but requires access to live cluster.
(2) is limited to system tables only.
When investigating issues for customers, we have to rely on (3) and this
often adds communication round-trips and delays. (4) requires knowledge
of ScyllaDB internals and access to system tables.

The new dump-schema commands provides a convenient way to obtain the
schema of tables, given that there is access to either an sstable or the
system tables. It can dump the schema of system tables without either.

Closes scylladb/scylladb#26433
2025-11-25 20:32:36 +03:00
Nadav Har'El
4c7c5f4af7 alternator: implement gzip-compressed requests
In this patch we implement Alternator's support for gzip-compressed
requests, i.e., requests with the "Content-Encoding: gzip" header,
other uncompressed headers, and a gzip-compressed body.

The server needs to verify the signature of the *compressed* content,
and then uncompress the body before running the request.

We only support gzip compression because this is what DynamoDB supports.
But in the future we can easily add support for other compression
algorithms like lz4 or zstd.

This patch Refs #5041 but doesn't "Fixes" it because it only implements
compressed requests (Content-Encoding), *not* compressed responses
(Accept-Encoding).

The next patch will enable several tests for this feature and make sure
it behaves like DynamoDB.

Note that while we will have now support in our server for compressed
requests, just like DynamoDB does, the clients (AWS SDKs) will probably
NOT make use of it because they do not enable request compression by
default. For example, see the tests for some hoops one needs to jump
through in boto3 (the Python SDK) to send compressed requests. However,
we are hoping that in the future Alternator's modified clients will
use compressed requests and enjoy this feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-25 17:46:44 +02:00
Gleb Natapov
5dcdaa6f66 test: test that expired erm that held for too long triggers notification 2025-11-25 17:33:54 +02:00
Piotr Dulikowski
ff5c7bd960 Merge 'topology_coordinator: don't repair colocated tablets' from Michael Litvak
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.

The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map.  It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.

However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.

We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.

This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.

Fixes https://github.com/scylladb/scylladb/issues/27119

backport to 2025.4 since we must change it in the same version it's introduced before it's released

Closes scylladb/scylladb#27120

* github.com:scylladb/scylladb:
  tombstone_gc: don't use 'repair' mode for colocated tables
  Revert "storage service: add repair colocated tablets rpc"
  topology_coordinator: don't repair colocated tablets
2025-11-25 14:58:06 +01:00
David Garcia
64a65cac55 docs: add metrics generation validation
fix: windows gitbash support

fix: new name found with no group vector_search/vector_store_client.cc 343

fix: rm allowmismatch

fix: git bash (windows) compatibility

fix: git bash (windows) compatibility

Closes scylladb/scylladb#26173
2025-11-25 15:39:52 +03:00
Gleb Natapov
9f97c376f1 token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
2025-11-25 13:35:24 +02:00
Michał Jadwiszczak
fe9581f54c docs/dev/view-building-coordinator: update the docs after recent changes
Remove information about view building task state and explain how
current lifetime of the task.
2025-11-25 12:14:05 +01:00
Michał Jadwiszczak
fb8cbf1615 db/view/view_building: send coordinator's term in the RPC
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
2025-11-25 12:14:05 +01:00
Michał Jadwiszczak
24d69b4005 db/view/view_building_state: replace task's state with aborted flag
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.

Once a task was aborted, it cannot get resurrected to a normal state.
2025-11-25 12:14:04 +01:00
Michał Jadwiszczak
eb04af5020 db/view/view_building_coordinator: batch finished tasks reporting
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.

After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.

This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.
2025-11-25 12:14:04 +01:00
dependabot[bot]
b911a643fd build(deps): bump sphinx-scylladb-theme from 1.8.8 to 1.8.9 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.8 to 1.8.9.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.9
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#27169
2025-11-25 11:01:37 +02:00
Botond Dénes
1263e1de54 Merge 'docs: modify debian/ubutnu installation instructions' from Yaron Kaikov
To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available

Also updated installation instruction to match the latest release

Fixes: https://github.com/scylladb/scylladb/issues/26673

**No need for backport since we added debian13 only in master for now**

Closes scylladb/scylladb#27205

* github.com:scylladb/scylladb:
  install-on-linux.rst: update installation example to supported release
  docs: modify debian/ubutnu installation instructions
2025-11-25 10:53:11 +02:00
Nadav Har'El
bcd1758911 Merge 'vector_search: add validator tests' from Pawel Pery
The vector-search-validator is a binary tool which do functional and integration tests between scylla and vector-store. It is build in Rust mainly in vector-store repository. This patch adds possibility to write tests on scylladb repository side, compile them together with vector-store tests and run them in `test.py` environment.

There are three parts of the change:

- add sources of validator to the `test/vector_search_validator` directory
- add support for building validator and vector-store in `build/vector-search-validator/bin` directory with or without cmake
- add support for `pytest` and `test.py` to run validator test locally and in the CI environment; this part adds also README to the `test/vector_search_validator` directory

Design for validator integration tests: https://scylladb.atlassian.net/wiki/spaces/RND/pages/39518215/Vector+Search+Core+Test+Plan+Document

References: VECTOR-50

No backport needed as this is  a new functionality.

Closes scylladb/scylladb#26653

* github.com:scylladb/scylladb:
  vector_search: add vector-search-validator tests
  vector_search: implement building vector-search-validator
  vector_search: add vector-search-validator sources
2025-11-25 10:34:33 +02:00
Michael Litvak
868ac42a8b tombstone_gc: don't use 'repair' mode for colocated tables
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.
2025-11-25 09:15:46 +01:00
Michael Litvak
005807ebb8 Revert "storage service: add repair colocated tablets rpc"
This reverts commit 11f045bb7c.

The rpc was added together with colocated tablets in 2025.4 to support a
"shared repair" operation of a group of colocated tablets that repairs
all of them and allows also for special behavior as opposed to repairing
a single specific tablet.

It is not used anymore because we decided to not repair all colocated
tablets in a single shared operation, but to repair only the base table,
and in a later release support repairing colocated tables individually.

We can remove the rpc in 2025.4 because it is introduced in the same
version.
2025-11-25 09:06:48 +01:00
Michael Litvak
273f664496 topology_coordinator: don't repair colocated tablets
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.

The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map.  It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.

However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.

We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.

This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.

Fixes scylladb/scylladb#27119
2025-11-25 09:05:59 +01:00
Amnon Heiman
b2c2a99741 index/vector_index.cc: Don't allow zero as an index option
This patch forces vector_index option value to be real-positive numbers
as zero would make no senese.

Fixes https://scylladb.atlassian.net/browse/VECTOR-249

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#27191
2025-11-25 10:05:44 +02:00
Karol Nowacki
ca62effdd2 vector_search: Restrict vector index tests to tablets only
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.

This change ensures tests using vector indexes run exclusively with tablets.

Fixes: VECTOR-49

Closes scylladb/scylladb#26843
2025-11-25 09:26:16 +02:00
Pawel Pery
9f10aebc66 vector_search: add vector-search-validator tests
The commit adds a functionality for `pytest` and `test.py` to run
`vector-search-validator` in `sudo unshare` environment. There are already two
tests - first parametrized `test_validator.py::test_validator[test-case-name]`
(run validator) and second `test_cargo_toml.py::test_cargo_toml` (check if the
current `Cargo.toml` for validator is correct).

Documentation for these tests are provided in `README.md`.
2025-11-24 17:26:04 +01:00
Pawel Pery
3702e982b9 vector_search: implement building vector-search-validator
The commit adds targets building
`build/vector-search-validator/bin/{vector-store,vector-search-validator}. The
targets must be build for tests. They don't depend on build mode.

The commit adds target in `configure.py` and also in `cmake`.
2025-11-24 17:26:04 +01:00
Pawel Pery
e569a04785 vector_search: add vector-search-validator sources
The commit adds validator sources uses combination of local files and
vector-store's files. In `build-env` there are definition of vector-store git
repository and revision on which validator will be built. `cargo-toml-template`
is script for printing current `Cargo.toml` to the stdout. After updating
`build-env` developer needs to update new configuration with
`./cargo-toml-template > Cargo.toml`. Git revision is used in several places in
`Cargo.toml` and will be used for building `vector-store`, so for better
handling git revision it should be setup only in one place.

The validator is divided into several crates to be able to built it within
scylladb and vector-store repositories. Here we need to create a new validator
crate with simple `main` function and call `validator_engine::main` there. We
provide tests written in scylladb repo in `validator-scylla` crate. The commit
provides empty `cql` test case, which should be filled in the future.
2025-11-24 17:26:04 +01:00
Gleb Natapov
39cec4ae45 topology: let banned node know that it is banned
Currently if a banned node tries to connect to a cluster it fails to
create connections, but has no idea why, so from inside the node it
looks like it has communication problems. This patch adds new rpc
NOTIFY_BANNED which is sent back to the node when its connection is
dropped. On receiving the rpc the node isolates itself and print an
informative message about why it did so.

Closes scylladb/scylladb#26943
2025-11-24 17:12:13 +01:00
Lakshmi Narayanan Sreethar
9cb766f929 db/config: introduce new config parameter compaction_max_shares
Add support for the new configuration parameter `compaction_max_shares`,
and update the compaction manager to pass it down to the compaction
controller when it changes. The shares allocated to compaction jobs will
be limited by this new parameter.

Fixes #9431

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar
468b800e89 compaction_manager:config: introduce max_shares
Introduce an updateable value `max_shares` to compaction manager's
config. Also add a method `update_max_shares()` that applies the latest
`max_shares` value to the compaction controller’s `max_shares`. This new
variable will be connected to a config parameter in the next patch.

Refs #9431

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 11:43:38 -03:00
Lakshmi Narayanan Sreethar
f2b0489d8c compaction_controller: add configurable maximum shares
Add a `max_shares` constructor parameter to compaction_controller to
allow configuring the maximum output of the control points at
construction time. The constructor now calls `set_max_shares()` with the
provided max_shares value. The subsequent commits will wire this value
to a new configuration option.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 11:43:24 -03:00
Lakshmi Narayanan Sreethar
853811be90 compaction_controller: introduce set_max_shares()
Add a method to dynamically adjust the maximum output of control points
in the compaction controller. This is required for supporting runtime
configuration of the maximum shares allocated to the compaction process
by the controller.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-24 11:43:20 -03:00
Tomasz Grabiec
d4b77c422f Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili
When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator.

This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty.

This bug was introduced in 10f07fb95a

This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas.

This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster.

A version containing this bug has not yet been released, so a backport is not needed.

Closes scylladb/scylladb#27118

* github.com:scylladb/scylladb:
  test: add tests for tablet size migration during end_migration
  virtual_table: add tablet_sizes virtual table
  load_stats: update tablet sizes after migration or rebuild
2025-11-24 15:31:30 +01:00
Yaron Kaikov
13eca61d41 install-on-linux.rst: update installation example to supported release
Example of installation is out of date, since scylla-5.2 is EOL for long time

upding the example for more recent release (together with packages update)
2025-11-24 16:22:17 +02:00
Anna Stuchlik
724dc1e582 doc: fix the info about object storage
This commit fixes the information about object storage:

- Object storage configuration is no longer marked as experimental.
- Redundant information has been removed from the description.
- Information related to object storage for SStabels has been removed
  as the feature is not working.

Fixes https://github.com/scylladb/scylladb/issues/26985

Closes scylladb/scylladb#26987
2025-11-24 17:16:33 +03:00
Yaron Kaikov
5541f75405 docs: modify debian/ubutnu installation instructions
To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available

Fixes: https://github.com/scylladb/scylladb/issues/26673
2025-11-24 13:33:17 +02:00
Michał Jadwiszczak
08974e1d50 db/view/view_building_worker: change internal implementation
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.

Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.

Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed

So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.

The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.

To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.
2025-11-24 11:12:31 +01:00
Michał Jadwiszczak
6d853c8f11 db/view/view_building_coordinator: change work_on_tasks RPC return type
During the initial implementation of the view builing coordinator,
we decided that if a view building task fails locally on the worker
(example reason: view update's target replica is not available),
the worker will retry this work instead of reporting a failure to the
coordinator.

However, we left return type of the RPC, which was telling if a task was
finished successfully or aborted.
But the worker doesn't need to report that a task was aborted, because
it's the coordinator, who decides to abort a task.

So, this commit changes the return type to list of UUIDs of completed
tasks.
Previously length of the returned vector needed to be the same as length
of the vector sent in the request.
No we can drop this restriction and the RPC handler return list of UUIDs
of completed tasks (subset of vector sent in the request).

This change is required to drop `STARTED` state in next commits.

Since Scylla 2025.4 wasn't released yet and we're going to merge this
patch before releasing, no RPC versioning or cluster feature is needed.
2025-11-24 11:12:29 +01:00
Avi Kivity
eb5e9f728c build: lock cxxbridge-cmd version to the rest of the cxx packages
rust/Cargo.toml locks the cxx packages to version 1.0.83,
but install-dependencies.sh does not lock cxxbridge-cmd, part
of that ecosystem. Since cxx 1.0.189 broke compatibility with
1.0.83 (understandable, as these are all sub-packages of a single
repository), builds with newer cxxbridge-cmd are broken.

Fix by locking cxxbridge-cmd to the same version as the other
cxx subpackages.

Regenerated frozen toolchain with optimized clang from
    https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz

Probably better done by building cxxbridge-cmd during the build
itself, but that is a deeper change.

Fixes #27176

Closes scylladb/scylladb#27177
2025-11-24 07:04:53 +02:00
Avi Kivity
d6ef5967ef tools: toolchain: prepare: replace 'reg' with 'skopeo'
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.

Fixes #27178.

Closes scylladb/scylladb#27179
2025-11-24 06:59:34 +02:00
Aleksandra Martyniuk
19a7d8e248 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165
2025-11-24 06:42:40 +02:00
Botond Dénes
296d7b8595 Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk
This patch enables integrity check in  'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.

These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded  - the cost comes from reading, calculating and verifying the diges.

New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.

Backport is not required, since it is a new feature

Fixes #21776

Closes scylladb/scylladb#26702

* github.com:scylladb/scylladb:
  file_stream_test: add sstable file streaming integrity verification test cases
  streaming: prioritize sender-side errors in tablet_stream_files
  sstables: enable integrity check for data file streaming
  sstables: Add compressed raw streaming support
  sstables: Allow to read digest and checksum from user provided file instance
  sstables: add overload of data_stream() to accept custom file_input_stream_options
2025-11-24 06:37:27 +02:00
Aleksandra Martyniuk
76174d1f7a cql3: reject ALTER KEYSPACE if rf of datacenter with tablets is omitted
In ALTER KEYSPACE, when a datacenter name is omitted, its replication
factor is implicitly set to zero with vnodes, while with tablets,
it remains unchanged.

ALTER KEYSPACE should behave the same way for tablets as it does
for vnodes. However, this can be dangerous as we may mistakenly
drop the whole datacenter.

Reject ALTER KEYSPACE if it changes replication factor, but omits
a datacenter that currently contains tablet replicas.

Fixes: https://github.com/scylladb/scylladb/issues/25549.

Closes scylladb/scylladb#25731
2025-11-24 06:36:51 +02:00
Avi Kivity
85db7b1caf Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865

Closes scylladb/scylladb#26941

* github.com:scylladb/scylladb:
  address_map: Use barrier() to wait for replication
  address_map: Use more efficient and reliable replication method
  utils: Introduce helper for replicated data structures
2025-11-23 19:15:12 +02:00
Avi Kivity
b0643f8959 Merge 'db/config: enable ms sstable format by default' from Michał Chojnowski
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.

New functionality, and this is a drastic change. No backport needed.

Closes scylladb/scylladb#26377

* github.com:scylladb/scylladb:
  db/config: enable `ms` sstable format by default
  cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
  api/system: add /system/chosen_sstable_version
  test/cluster/dtest: reduce num_tokens to 16
2025-11-23 13:52:57 +02:00
Piotr Dulikowski
e8b0f8faa9 Merge 'vector search: Add HTTPS requests support' from Karol Nowacki
vector_search: Add HTTPS support for vector store connections

This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327

Backport to 2025.4 as this feature is expected to be available in 2025.4.

Closes scylladb/scylladb#26935

* github.com:scylladb/scylladb:
  test: vector_search: Ensure all clients are stopped on shutdown
  vector_search: Add HTTPS support for vector store connections
2025-11-22 14:58:06 +01:00
Karol Nowacki
58456455e3 test: vector_search: Ensure all clients are stopped on shutdown
A flaky test revealed that after `clients::stop()` was called,
the `old_clients` collection was sometimes not empty,
indicating that some clients were not being stopped correctly.
This resulted in sanitizer errors when objects went out of scope at the end of the test.

This patch modifies `stop()` to ensure all clients, including those in `old_clients`,
are stopped, guaranteeing a clean shutdown.
2025-11-22 08:18:45 +01:00
Karol Nowacki
c40b3ba4b3 vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327
2025-11-22 08:18:45 +01:00
Ferenc Szili
39711920eb test: add tests for tablet size migration during end_migration
This change adds tests for the correctness of tablet size migration
during the end_migrations stage. This size migration can happend for
tablet migrations and for tablet rebuild.
2025-11-21 16:58:11 +01:00
Ferenc Szili
e96863be0c virtual_table: add tablet_sizes virtual table
This change adds the tablet_sizes virtual table. The contents of this
table are gathered from the current load_stats data structure.
2025-11-21 16:53:28 +01:00
Ferenc Szili
cede4f66af load_stats: update tablet sizes after migration or rebuild
When migrating tablet size during the end_migration tablet transition
stage, we need the pending and leaving replica hosts. The leaving and
pending replicas are gathered in objects of type
std::optional<tablet_replica> and are not checked if they contain a
value before dereferencing which could cause an exception in the
topology coordinator.

This patch adds a check for leaving and pending replicas, and only
perfoms the tablet size migration if neither are empty.

This bug was introduced in 10f07fb95a

This change also adds the functionality to add the tablet size to
load_stats after a tablet rebuild. We compute the average tablet size
from the existing replicas, and add the new size to the pending replica.
2025-11-21 16:22:20 +01:00
Botond Dénes
38a1b1032a Merge 'doc: update Cloud Instance Recommendations for GCP' from Anna Stuchlik
This PR:
- Removes n1-highmem instances from Recommended Instances.
- Adds missing support for n2-highmem-96.
- Updates the reference to n2 instances in the Google Cloud docs (fixes a broken link to GCP).
- Adds the missing information about processors for n2-highmem-instance - Ice Lake and Cascade Lake (requested by CX).

Fixes https://github.com/scylladb/scylladb/issues/25946
Fixes https://github.com/scylladb/scylladb/issues/24223
Fixes https://github.com/scylladb/scylladb/issues/23976

No backport needed if this PR is merged before 2025.4 branching.

Closes scylladb/scylladb#26182

* github.com:scylladb/scylladb:
  doc: update information for n2-highmem instances
  doc: remove n1-highmem instances from Recommended Instances
2025-11-21 16:28:54 +02:00
Anna Stuchlik
dab74471cc doc: update information for n2-highmem instances
This commit updates the section for n2-highmem instances
on the Cloud Instance Recommendations page

- Added missing support for n2-highmem-96
- Update the reference to n2 instances in the Google Cloud docs.
- Added the missing information about processors for this instance
  type (Ice Lake and Cascade Lake).
2025-11-21 15:13:36 +01:00
Taras Veretilnyk
3003669c96 file_stream_test: add sstable file streaming integrity verification test cases
Add 'test_sstable_stream' to verify SSTable file streaming integrity check.
The new tests cover both compressed and uncompressed SSTables and includes:
- Checksum mismatch detection verification
- Digest mismatch detection verifivation
2025-11-21 12:52:35 +01:00
Taras Veretilnyk
77dcad9484 streaming: prioritize sender-side errors in tablet_stream_files
When 'send_data_to_peer' throws and
closes the sink, the peer later reports its own error, masking the
original sender failure.
This commit preserves the original sender exception.
If the status-retrieval task throws its own error before sender task rethrows its exception,
we can still propagate the original exception later.
2025-11-21 12:52:31 +01:00
Taras Veretilnyk
c8d2f89de7 sstables: enable integrity check for data file streaming
This patch enables integrity check in  'create_stream_sources()' by introducing a new
'sstable_data_stream_source_impl' class for handling the Data component of
SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead
of the raw input_stream.

These additional checks require reading the digest and CRC components from
disk, which may introduce some I/O overhead. For uncompressed SSTables,
this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the
cost comes from reading, calculation and verifying the digest.
2025-11-21 12:52:26 +01:00
Taras Veretilnyk
18e1dbd42e sstables: Add compressed raw streaming support
Implement compressed_raw_file_data_source that streams compressed chunks
without decompression while verifying checksums and calculating digests.
Extends raw_stream enum to support compressed_chunks mode.
This data_source implementation will be used in the next commits
for file based streaming.
2025-11-21 12:52:04 +01:00
Taras Veretilnyk
c32e9e1b54 sstables: Allow to read digest and checksum from user provided file instance
Add overloaded methods to read digest and checksum from user-provided file
handles:
- 'read_digest(file f)'
- 'read_checksum(file f)

This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.
2025-11-21 12:51:40 +01:00
Michał Chojnowski
da51a30780 db/config: enable ms sstable format by default
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.
2025-11-21 12:39:46 +01:00
Michał Chojnowski
73090c0d27 cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.

Use chosen_sstable_format instead.
2025-11-21 12:39:46 +01:00
Michał Chojnowski
38e14d9cd5 api/system: add /system/chosen_sstable_version
Returns the sstable version currently chosen for use in for new sstables.

We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).

(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
2025-11-21 12:39:46 +01:00
Wojciech Mitros
aacf883a8b cql: add CONCURRENCY to the USING clause
Currently, the PRUNE MATERIALIZED VIEW statement performs all its
reads and writes in a single, continous sequence. This takes too
much time even for a moderate amount of 'PRUNED' data.
Instead, we want to make it possible to set a concurrency of the
reads and writes performed while processing the PRUNE statement,
so that if the user so desires, it may finish the PRUNING quicker
at the cost of adding more load on the cluster.
In this patch we add the CONCURRENCY setting to the USING clause
in cql. In the next patch, we'll be using it to actually set the
concurrency of PRUNE MATERIALIZED VIEW.
2025-11-21 12:32:52 +01:00
Botond Dénes
5c6813ccd0 test/cluster/test_repair.py: add test_repair_timestamp_difference
Add a test which verifies that if two nodes have the same data, with
different timestamps, repair will detect and fix the diverging
timestamps.

All our repair tests focus on difference in data and I remember writing
this test multiple times in the past to quickly verify whether this
works. Time to upstream this test.

Closes scylladb/scylladb#26900
2025-11-21 14:19:51 +03:00
Botond Dénes
6f79fcf4d5 tools/scylla-nodetool: dump request history on json assert
A JSON assert happens when a JSON member is either missing or has
unexpected type. rapidjson has a very unhelpful "json assert failed"
message for this, with a backtrace (somewhat helpful), with no other
context. To help debug such errors, collect all request sent to the API
and dump them when such errors happen. The backtrace with the full
request history should be enough to debug any such issues.

Refs CUSTOMER-17

Closes scylladb/scylladb#26899
2025-11-21 14:17:53 +03:00
Gautam Menghani
939fcc0603 db/system_keyspace: Remove the FIXME related to caching of large tables
Remove the FIXME comment for re-enabling caching of the large tables
since the tables are used infrequently [1].

[1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364

Fixes #26032

Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com>

Closes scylladb/scylladb#26789
2025-11-21 12:34:34 +02:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Nadav Har'El
64a075533b alternator: fix update of stats from wrong shard
In commit 51186b2 (PR #25457) we introduced new statistics for
authentication errors, and among other places we modified
executor::create_table() to update them when necessary.

This function runs its real work (create_table_on_shard0()) on shard
0, but incorrectly updates "_stats" from the original shard. It doesn't
really matter which shard's stats we update - but it does matter that
code running on shard 0 shouldn't touch some other shard's objects.
Since all we do on these stats is to increment an integer, the risk
of updating it on the wrong shard is minimal to non-existant, but it's
still wrong and can cause bigger trouble in the future as the code
continues to evolve.

The fix is simple - we should pass to create_table_on_shard0() the
_stats object from the acutal shard running it (shard 0).

Fixes #26942

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26944
2025-11-21 11:53:06 +02:00
Calle Wilund
3c4546d839 messaging_service: Add internode_compression=rack as option
Fixes #27085

Adds a "rack" option to enum/config and handles in connection
setup in messaging_service.

Closes scylladb/scylladb#27099
2025-11-21 11:50:55 +02:00
Nadav Har'El
66bd3dc22c test/alternator: tests for request compression
DynamoDB's documentation
https://docs.aws.amazon.com/sdkref/latest/guide/feature-compression.html
suggests that DynamoDB allows request bodies to be compressed (currently
only by gzip).

The purpose of patch is to have a test reproducing this feature. The
test shows us that indeed DynamoDB understands compressed requests using
the "gzip" encoding, but Alternator does *not*, so the new test is xfail.

As you can see in the test code, although the low-level SDK (botocore)
can send compress requests, this is not actually enabled for DynamoDB
and we need to resort to some trickery to send compressed requests.
But the point is that once we do manage to send compressed requests,
the test shows us that they work properly on AWS, but fail on
Alternator.

The failure of the compressed requests on Alternator is reported like:

    An error occurred (ValidationException) when calling the PutItem
    operation: Parsing JSON failed: Invalid value. at 70459088

This error message should probably be improved (what is that high
number?!) but of course even better would be to make it really work.

By enabling tracing on alternator-server (e.g., edit test/cqlpy/run.py
and add `'--logger-log-level', 'alternator-server=trace',`) we can
see exactly what request the SDK sends Alternator. What we can see in
the request is:

1. The request headers are uncompressed (this is expected in HTTP)
2. There is a header "Content-Encoding: gzip"
3. The request's body is binary, a full-fleged gzip output complete with
   a gzip magic in the beginning.

Refs #5041

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27049
2025-11-21 10:48:33 +02:00
Shreyas Ganesh
4488a4fb06 docs: document sstables quarantine subdirectory
Add documentation for the quarantine/ subdirectory that holds SSTables
isolated due to validation failures or corruption. Document the scrub
operation's quarantine_mode parameter options and the drop_quarantined_sstables
API operation.

Also update the directory hierarchy example to include the quarantine directory.

Fixes #10742

Signed-off-by: Shreyas Ganesh <vansi.ganeshs@gmail.com>

Closes scylladb/scylladb#27023
2025-11-21 10:45:33 +02:00
Ernest Zaslavsky
825d81dde2 cmake: dont complain about deprecated builtins
On clang 21.1.4 (Fedora 43) the abseil compilation started to fail with `builtin XXX is deprecated use YYY instead`. Suppress this for abseil compilation only

Closes scylladb/scylladb#27098
2025-11-21 10:31:54 +02:00
Botond Dénes
0cc5208f8e Merge 'Add sstables_manager::config' from Pavel Emelyanov
Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well.

Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls.

Shuffling components dependencies, no need to backport

Closes scylladb/scylladb#27021

* github.com:scylladb/scylladb:
  sstables_manager: Drop db::config from sstables_manager
  tools/sstable: Make shard_of_with_tablets use db::config argument
  tools/sstable: Add db::config& to all operations
  tools/sstable: Get endpoints from storage manager
  sstables_manager: Hold sstable IO extensions on it
  sstables: Manager helper to grab file io extensions
  sstables_manager: Move default format on config
  sstables_manager: Move enable_sstable_data_integrity_check on config
  sstables_manager: Move data_file_directories on config
  sstables_manager: Move components_memory_reclaim_threshold on config
  sstables_manager: Move column_index_auto_scale_threshold on config
  sstables_manager: Move column_index_size on config
  sstables_manager: Move sstable_summary_ratio on config
  sstables_manager: Move enable_sstable_key_validation on config
  sstables_manager: Move available_memory on config
  code: Introduce sstables_manager::config
  sstables: Patch get_local_directories() to work on vector of paths
  code: Rename sstables_manager::config() into db_config()
2025-11-21 10:21:41 +02:00
Botond Dénes
f89bb68fe2 Merge 'cdc: Preserve properties when reattaching log table' from Dawid Mędrek
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:

```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};

/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};

/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```

The log table can also be reattached by enabling CDC on the base table
again:

```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```

However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.

To prevent that, we ensure that the properties are preserved.

A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct.

Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.

Fixes scylladb/scylladb#25523

Backport: not necessary. Although the behavior may be unexpected,
it's not a bug per se.

Closes scylladb/scylladb#26443

* github.com:scylladb/scylladb:
  cdc: Preserve properties when reattaching log table
  cdc: Extract creating columns in CDC log table to dedicated function
  cdc: Extract default properties of CDC log tables to dedicated function
  schema/schema_builder.hh: Add set_properties
  schema: Add getter for schema::user_properties
  schema: Remove underscores in fields of schema::user_properties
  schema: Extract user properties out of raw_schema
2025-11-21 10:06:05 +02:00
Calle Wilund
03408b185e utils::gcp::object_storage: Fix buffer alignment reordering trailing data
Fixes #26874

Due to certain people (me) not being able to tell forward from backward,
the data alignment to ensure partial uploads adhere to the 256k-align
rule would potentially _reorder_ trailing buffers generated iff the
source buffers input into the sink are small enough. Which, as a fun fact,
they are in backup upload.

Change the unit test to use raw sink IO and add two unit tests (of which
the smaller size provokes the bug) that checks the same 64k buf segmented
upload backup uses.

Closes scylladb/scylladb#26938
2025-11-21 09:36:13 +02:00
Radosław Cybulski
ce8db6e19e Add table name to tracing in alternator
Add a table name to Alternator's tracing output, as some clients would
like to consistently receive this information.

- add missing `tracing::add_table_name` in `executor::scan`
- add emiting tables' names in `trace_state::build_parameters_map`
- update tests, so when tracing is looked for it is filtered by table's
  name, which confirms table is being outputed.
- change `struct one_session_records` declaration to `class one_session_records`,
  as `one_session_records` is later defined as class.

Refs #26618
Fixes #24031

Closes scylladb/scylladb#26634
2025-11-21 09:33:40 +02:00
Michał Chojnowski
3f11a5ed8c test/cluster/dtest: reduce num_tokens to 16
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.

It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.

To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).

This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
2025-11-21 00:38:50 +01:00
Piotr Dulikowski
22f22d183f Merge 'Refine sstables_loader vs database dependency' from Pavel Emelyanov
There are two issues with it.

First, a small RPC helper struct carries database reference on board just to get feature state from it.
Second, sstable_streamer uses database as proxy to feature service.

This PR improves both.

Services dependencies improvement, not need to backport

Closes scylladb/scylladb#26989

* github.com:scylladb/scylladb:
  sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging
  sstables_loader: Keep bool on send_meta, not database reference
2025-11-20 16:13:16 +01:00
Asias He
d51b1fea94 tablets: Allow tablet merge when repair tasks exist
Currently we do not allow tablet merge if either of the tablets contain
a tablet repair request. This could block the tablet merge for a very
long time if the repair requests could not be scheduled and executed.

We can actually merge the repair tasks in most of the cases. This is
because most of the time all tablets are requested to be repaired by a
single API request, so they share the same task_id, request_type and
other parameters. We can merge the repair task info and executes the
repair after the merge.  If they do not share the task info, we could
not merge and have to wait for the repair before merge, which is both
rare and ok.

Another case is that one of the tablet has a repair task info (t1) while
the other tablet (t2) does not have, it is possible the t2 has finished
repair by the same repair request or t2 is not requested to be repaired
at all. We allow merge in this case too to avoid blocking the tablet
merge, with the price of reparing a bit more.

Fixes #26844

Closes scylladb/scylladb#26922
2025-11-20 16:01:23 +01:00
Asias He
3cf1225ae6 docs: Add feature page for incremental repair
Adds a new documentation page for the incremental repair feature.

The page covers:
- What incremental repair is and its benefits over the standard repair process.
- How it works at a high level by tracking the repair status of SSTables.
- The prerequisite of using the tablets architecture.
- The different user-configurable modes: 'regular', 'full', and 'disabled'.

Fixes #25600

Closes scylladb/scylladb#26221
2025-11-20 11:58:53 +02:00
Raphael S. Carvalho
74ecedfb5c replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078
2025-11-20 11:44:03 +02:00
Geoff Montee
a0734b8605 Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures
Fixes #27077

Multiple points can be clarified relating to:

* Names of each sub-procedure could be clearer
* Requirements of each sub-procedure could be clearer
* Clarify which keyspaces are relevant and how to check them
* Fix typos in keyspace name

Closes scylladb/scylladb#26855
2025-11-20 11:33:15 +02:00
Patryk Jędrzejczak
45ad93a52c topology_coordinator: include all transitioning nodes in all global commands
This change makes the code simpler and less vulnerable to regressions.

There is no functional impact because:
- we already include a decommissioning/bootstrapping/replacing node for
  `barrier` and `barrier_and_drain`,
- we never execute global commands in the presence of a rebuilding node,
- removing node always belongs to `exclude_nodes`, so it's filtered out
  anyway,
- we execute global `stream_ranges` only for removenode,
- we execute global `wait_for_ip` only for new nodes when there are no
  transitioning nodes.

Fixes #20272
Fixes #27066

Closes scylladb/scylladb#27102
2025-11-20 11:11:32 +02:00
dependabot[bot]
2ca926f669 build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.2 to 0.3.3.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#27081
2025-11-20 10:28:34 +02:00
Gleb Natapov
ad3cf2c174 utils: fix get_random_time_UUID_from_micros to generate correct time uuid
According to the IETF spec uuid variant bits should be set to '10'. All
others are either invalid or reserved. The patch change the code to
follow the spec.

Closes scylladb/scylladb#27073
2025-11-20 10:27:29 +02:00
Avi Kivity
5d761373c2 Update tools/cqlsh submodule
* tools/cqlsh 19445a5...2240122 (3):
  > copyutil: adjust multiprocessing method to 'fork'
  > Drop serverless/cloudconf feature
  > Migrate workflows to Blacksmith

Closes scylladb/scylladb#27065
2025-11-20 10:26:43 +02:00
Taras Veretilnyk
e5fbe3d217 docs: improve documentation of the scrub
Update nodetool scrub documentation to include --quarantine-mode and --drop-unfixable-sstables options,
add a section explaining quarantine modes
and provide examples and procedures for handling and removing corrupted SSTables.

Closes scylladb/scylladb#27018
2025-11-20 10:26:07 +02:00
Nadav Har'El
a9cf7d08da test/cqlpy: remove USE from test, and test a USE-related bug
One of the tests in test_describe.py used "USE {test_keyspace}" which
affects the CQL session shared by all tests in an unrecoverable way (there
is no "UNUSE" statement).

As an example of what might happen if the shared CQL session is "polluted"
by a USE, issue #26334 is about a bug we have in DESC KEYSPACES when USE
is active.

So in this patch, we:

1. Fix the test to not use USE on the shared CQL session - it's easy to
   create a separate session to use the "USE" on. With this fix, the test
   no longer leaves the shared CQL session in a "USE" state.

2. Add a new xfailing test to reproduce the DESC KEYSPACES bug.
   Refs #26334

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26345
2025-11-20 10:25:03 +02:00
Botond Dénes
a084094c18 Merge 'alternator and cql: tests for default sstable compression' from Nadav Har'El
The purpose of this two-patch series is to reproduce a previously unknown bug, Refs #26914.

Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL base tables and perhaps not to Alternator or to CQL's auxiliary tables (materialized views, secondary indexes, or CDC logs). For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL.

The first patch includes Alternator tests, and the second CQL tests. In both tests we discover that although recently (commit adf9c42, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, actually this change was **only** applied to CQL base tables. All Alternator tables and all CQL auxiliary tables (views, indexes, CDC) still use the old LZ4Compressor. This is issue #26914.

Closes scylladb/scylladb#26915

* github.com:scylladb/scylladb:
  test/cqlpy: test compression setting for auxiliary table
  test/alternator: tests for schema of Alternator table
2025-11-20 10:24:31 +02:00
Karol Nowacki
104de44a8d vector_search: Add support for secondary vector store clients
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.

Fixes: VECTOR-187

Closes scylladb/scylladb#26484
2025-11-20 08:37:18 +01:00
Pavel Emelyanov
1cabc8d9b0 Merge 'streaming: fix loop break condition in tablet_sstable_streamer::stream' from Ernest Zaslavsky
When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely.

Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]

The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did.

Correct logic should be:

before(sst_last) → skip and continue.

after(sst_first) → break (no further SSTables can overlap).

Otherwise → `contains` to classify as full or partial.

Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations.

1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.
2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range
3. Add pytest to cover this case, full streaming flow by means of `restore`
4. Add boost tests to test the new refactored function

This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4

Fixes: https://github.com/scylladb/scylladb/issues/26979

Closes scylladb/scylladb#26980

* github.com:scylladb/scylladb:
  streaming: fix loop break condition in tablet_sstable_streamer::stream
  streaming: add pytest case to reproduce mutation loss issue
2025-11-20 10:16:17 +03:00
Piotr Dulikowski
dc7944ce5c Merge 'vector_search: Fix error handling and status parsing' from Karol Nowacki
vector_search: Fix error handling and status parsing

This change addresses two issues in the vector search client that caused
validator test failures: incorrect handling of 5xx server errors and
faulty status response parsing.

1.  5xx Error Handling:
    Previously, a 5xx response (e.g., 503 Service Unavailable) from the
    underlying vector store for an `/ann` search request was incorrectly
    interpreted as a node failure. This would cause the node to be marked
    as down, even for transient issues like an index scan being in progress.

    This change ensures that 5xx errors are treated as transient search
    failures, not node failures, preventing nodes from being incorrectly
    marked as down.

2.  Status Response Parsing:
    The logic for parsing status responses from the vector store was
    flawed. This has been corrected to ensure proper parsing.

Fixes: SCYLLADB-50

Backport to 2025.4 as this problem is present on this branch.

Closes scylladb/scylladb#27111

* github.com:scylladb/scylladb:
  vector_search: Don't mark nodes as down on 5xx server errors
  test: vector_search: Move unavailable_server to dedicated file
  vector_search: Fix status response parsing
2025-11-20 08:14:28 +01:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Karol Nowacki
9563d87f74 vector_search: Don't mark nodes as down on 5xx server errors
For an `/ann` search request, a 5xx server response does not
indicate that the node is down. It can signify a transient state, such
as the index full scan being in progress.

Previously, treating a 503 error as a node fault would cause the node
to be incorrectly marked as down, for example, when a new index was
being created. This commit ensures that such errors are treated as
transient search failures, not node failures.
2025-11-20 08:10:20 +01:00
Karol Nowacki
366ecef1b9 test: vector_search: Move unavailable_server to dedicated file
The unavailable_server code will be reused in upcoming client unit tests.
2025-11-20 08:09:21 +01:00
Benny Halevy
8ed36702ae Update seastar submodule
* seastar 63900e03...340e14a7 (19):
  > Merge 'rpc: harden sink_impl::close()' from Benny Halevy
    rpc: sink_impl::close: fixup indentation
    rpc: harden sink_impl::close()
  > http: Document the way "unread body bytes" accounting works
  > net: tighten port load balancing port access
  > coroutine: reimplement generator with buffered variant
  > Merge 'Stop using net::packet in posix data sink' from Pavel Emelyanov
    net/posix-stack: Don't use packet in posix_data_sink_impl
    reactor: Move fragment-vs-iovec static assertion
    reactor: Make backend::sendmsg() calls use std::span<iovec>
    utils: Introduce iovec_trim_front helper
    utils: Spannize iovec_len()
  > Merge 'Generalize memory data sink in tests' from Pavel Emelyanov
    test: Make output_stream_test splitting test case use own sink
    test: Make some output_stream_test cases use memory data sink
    test: Threadify one of output_stream_test test cases
    test: Make json_formatter_test use memory_data_sink
    test: Move memory_data_sink to its own header
  > dns: avoid using deprecated c-ares API
  > reactor: Move read_directory() to posix_file_impl
  > Merge 'rpc: sink_impl: batch sending and deletion of snd_buf:s' from Benny Halevy
    test: rpc_test: add test_rpc_stream_backpressure_across_shards
    reactor: add abort_on_too_long_task_queue option
    rpc: make sink flush and close noexcept
    rpc: sink_impl: batch sending and deletion of snd_buf:s
    rpc: move sink_impl and source_impl into internal namespace
    rpc: sink_impl: extend backpressure until snd_buf destroy
  > configure.py: fix --api-level help
  > Merge 'Close http client connection if handler doesn't consume all chunked-encoded body' from Pavel Emelyanov
    test: Fix indentation after previous patch
    test/http: Extend test for improper client handling of aborted requests
    test/http: Ignore EPIPE exception from server closing connection
    test/http: Split the partial response body read test
    http: Track "remaining bytes" for chunked_source_impl
    http: Switch content_length_source_impl to update remaining bytes
  > metrics: Add default ~config()
  > headers: Remove smp.hh from app-template.hh
  > prometheus: remove hostname and metric_help config
  > rpc: Tune up connection methods visibility
  > perf_tests: Fix build with fmt 12.0.0 by avoiding internal functions
  > doc: Fix some typos in codying style
  > reactor: Remove unused try_sleep() method

directory_lister::get is adjusted in this patch to
use the new experimental::coroutine::generator interface
that was changed in scylladb/seastar@81f2dc9dd9

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26913
2025-11-20 07:29:47 +03:00
Pavel Emelyanov
53b71018e8 Merge 'Alternator: additional tests for ExclusiveStartKey' from Nadav Har'El
In pull request #26960 there was some discussion about what is the valid form of ExclusiveStartKey, and whether we need to allow some "non-standard" uses of it for scan over system tables (which aren't real Alternator tables and may have multiple key columns, something not possible in normal Altenrator tables). This made me realize our tests for what is allowed - and what is not allowed - in ExclusiveStartKey - are very sparse and don't cover all the cases that are possible in Scan and Query, in base tables and in GSIs.

So this small series attempts to increase the coverage of the tests for ExclusiveStartKey to make sure we are compatible with DynamoDB and also that we don't regress in #26960.

The new tests reproduce a previously unknown error-path issues, #26988, where in some cases DynamoDB considers ExclusiveStartKey to be invalid but Alternator erronously accepts. Fortunately, we didn't find any success-path (correctness) bugs.

Closes scylladb/scylladb#26994

* github.com:scylladb/scylladb:
  test/alternator: tests for ExclusiveStartKey in GSI
  test/alternator: more tests for ExclusiveStartKey in Scan
  test/alternator: more tests for ExclusiveStartKey in Query
2025-11-20 07:21:39 +03:00
Avi Kivity
0d68512b1f stall_free: make variadic dispose_gently sequential
Having variadic dispose_gently() clear inputs concurrently
serves no purpose, since this is a CPU bound operation. It
will just add more tasks for the reactor to process.

Reduce disruption to other work by processing inputs
sequentially.

Closes scylladb/scylladb#26993
2025-11-20 07:16:16 +03:00
Benny Halevy
fd81333181 test/pylib/cpp: increase max-networking-io-control-blocks value
Increase the value of the max-networking-io-control-blocks option
for the cpp tests as it is too low and causes flakiness
as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures:
```
seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed.
```

See also https://github.com/scylladb/seastar/issues/976

Fixes #27056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27117
2025-11-20 04:31:36 +01:00
Ernest Zaslavsky
dedc8bdf71 streaming: fix loop break condition in tablet_sstable_streamer::stream
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
2025-11-19 17:32:49 +02:00
Tomasz Grabiec
f83c4ffc68 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
2025-11-19 15:21:02 +01:00
Tomasz Grabiec
4a85ea8eb2 address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835
2025-11-19 15:21:02 +01:00
Tomasz Grabiec
ed8d127457 utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
2025-11-19 15:21:02 +01:00
Michał Chojnowski
d8e299dbb2 sstables/trie/trie_writer: free nodes after they are flushed
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.

This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.

Fix that, and add some test that catches this.

Fixes scylladb/scylladb#27082

Closes scylladb/scylladb#27083
2025-11-19 14:54:16 +02:00
Karol Nowacki
05b9cafb57 vector_search: Fix status response parsing
The response was incorrectly parsed as a plain string and compared
directly with C++ string. However, the body contains a JSON string,
which includes escaped quotes that caused comparison failures.
2025-11-19 10:02:05 +01:00
Nadav Har'El
7b9428d8d7 test/cqlpy: test compression setting for auxiliary table
In the previous patch we noticed that although recently (commit adf9c42,
Refs #26610) we changed the default sstable compressor from LZ4Compressor
to LZ4WithDictsCompressor, this change was only applied to CQL, not to
Alternator.

In this patch we add tests that demonstrate that it's even worse - the
new compression only applies to CQL's *base* table - all the "auxiliary"
tables -
        * Materialized views
        * Secondary index's materialized views
        * CDC log tables

all still have the old LZ4Compressor, different from the base table's
default compressor.

The new test fails on Scylla, reproducing #26914, and passes on
Cassandra (on Cassandra, we only compare the materialized view table,
because SI and CDC is implemented differently).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-19 09:18:37 +02:00
Nadav Har'El
11f6a25d44 test/alternator: tests for schema of Alternator table
This patch introduces a new test that exposed a previously unknown bug,
Refs #26914:

Recently we saw a lot of patches that change how we create new schemas
(keyspaces and tables), sometimes changing various long-time defaults.
We started to worry that perhaps some of these defaults were applied only
to CQL and not to Alternator. For example, in Refs #26307 we wondered if
perhaps the default "speculative_retry" option is different in Alternator
than in CQL.

This patch includes a new test file test/alternator/test_cql_schema.py,
with tests for verifying how Alternator configures the underlying tables
it creates. This test shows that the "speculative_retry" doesn't have
this suspected bug - it defaults to "99.0PERCENTILE" in both CQL and
Alternator. But unfortunately, we do have this bug with the "compression"
option:

It turns out that recently (commit adf9c42, Refs #26610) we changed the
default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor,
but the change was only applied to CQL, not Alternator. So the test that
"compression" is the same in both fails - and marked "xfails" and
I created a new issue to track it - #26914.

Another test verifies that Alternators "auxiliary" tables - holding
GSIs, LSIs and Streams - have the same default properties as the base
table. This currently seems to hold (there is no bug).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-19 09:18:37 +02:00
Pavel Emelyanov
4d5f7a57ea sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging
The feature in question is about the way streaming sink-and-source
operate. Since sink-and-source itself are obtained from messaging
service, the feature is better be fetched from it too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-19 09:35:54 +03:00
Pavel Emelyanov
64e099f03b sstables_loader: Keep bool on send_meta, not database reference
The send_meta helper needs database reference to get feature_service
from it (to check some feature state). That's too much, features are
"immutable" throug the loader lifetime, it's enough to keep the boolean
member on send_meta.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-19 09:32:56 +03:00
Patryk Jędrzejczak
e35ba974ce test: test_raft_recovery_stuck: ensure mutual visibility before using driver
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.

scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.

Fixes #27055

Closes scylladb/scylladb#27075
2025-11-19 05:54:12 +01:00
David Garcia
3f2655a351 docs: add liveness::MustRestart support
Closes scylladb/scylladb#27079
2025-11-18 15:28:55 +01:00
Szymon Wasik
f714876eaf Add documentation about lack of returning similarity distances
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.

Fixes #27086

It has to be backported to 2025.4 as this is the limitation in 2025.4.

Closes scylladb/scylladb#27096
2025-11-18 13:50:36 +01:00
Ernest Zaslavsky
656ce27e7f streaming: add pytest case to reproduce mutation loss issue
Introduce a test that demonstrates mutation loss caused by premature
loop termination in tablet_sstable_streamer::stream. The code broke
out of the SSTable iteration when encountering a non-overlapping range,
which skipped subsequent SSTables that should have been partially
contained. This test showcases the problem only.

Example:
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]
2025-11-18 09:34:41 +02:00
Avi Kivity
f7413a47e4 sstables: writer: avoid recursion in variadic write()
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.

Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.

Closes scylladb/scylladb#27050
2025-11-18 08:17:17 +02:00
Botond Dénes
2ca66133a4 Revert "db/config: don't use RBNO for scaling"
This reverts commit 43738298be.

This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.

Closes scylladb/scylladb#27067

Fixes #27047
2025-11-18 08:17:17 +02:00
Botond Dénes
0dbad38eed Merge 'docs/dev/topology-over-raft: make various updates' from Patryk Jędrzejczak
The updates include:
- adding missing parts like topology states and table rows,
- documenting zero-token nodes,
- replacing the old recovery procedure with the new one.

Fixes #26412

Updates of internal docs (usually read on master) don't require
backporting.

Closes scylladb/scylladb#27022

* github.com:scylladb/scylladb:
  docs/dev/topology-over-raft: update the recovery section
  docs/dev/topology-over-raft: document zero-token nodes
  docs/dev/topology-over-raft: clarify the lack of tablet-specific states
  docs/dev/topology-over-raft: add the missing join_group0 state
  docs/dev/topology-over-raft: update the topology columns
2025-11-18 08:17:17 +02:00
Patryk Jędrzejczak
adaa0560d9 Merge 'Automatic cleanup improvements' from Gleb Natapov
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

Closes scylladb/scylladb#26868

* https://github.com/scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-18 08:17:17 +02:00
Pavel Emelyanov
02513ac2b8 alternator: Get feature service from proxy directly
The executor::add_stream_options() obtains local database reference from
proxy just to get feature service from it.

Similar chain is used in executor::update_time_to_live().

It's shorter to get features from proxy itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26973
2025-11-18 08:17:16 +02:00
Botond Dénes
514c1fc719 Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk
…played

Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.

Needs backport to all live versions.

Closes scylladb/scylladb#26793

* github.com:scylladb/scylladb:
  test: extend test_batchlog_replay_failure_during_repair
  db: batchlog_manager: update _last_replay only if all batches were replayed
2025-11-18 08:17:16 +02:00
Nadav Har'El
5b78e1cebe test/alternator: tests for ExclusiveStartKey in GSI
After in the previous patches we added more exhaustive testing for
the ExclusiveStartKey feature of Query and Scan, in this patch we
add tests for this feature in the context of GSIs.

Most interestingly, the ExclusiveStartKey when querying a GSI
isn't just the key of the GSI, but also includes the key columns
of the base - in other words, it is the key that Scylla uses for
its materialized view.

The tests here confirm that paging on GSI works - this paging uses
ExclusiveStartKey of course - but also what is the specific structure
and meaning of the content of ExclusiveStartKey.

We also include two xfailing tests which again, like in the previous
patches, show we don't do enough validation (issue #26988) and
don't recognize wrong values or spurious columns in ExclusiveStartKey.

As usual, all new tests pass on DynamoDB, and all except the xfailing
ones pass on Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-17 22:07:28 +02:00
Nadav Har'El
65b364d94a test/alternator: more tests for ExclusiveStartKey in Scan
In the previous patch we added more tests for ExclusiveStartKey
in the context of the "Query" request. Here we do a similar thing
for "Scan". There are fewer error cases for Scan. In particular,
while it didn't make sense to use ExclusiveStartKey on a Query on a
table without a sort key (since a Query there always returns a single
item), for Scan it's needed - for paging. So we add in this patch
a test (that we didn't have before!) that Scan paging works correctly
also in the case of a table without a sort key.

This patch has one xfailing test reproducing #26988, that we don't
recognize and refuse spurious columns (columns not in the key) in
ExclusiveStartKey.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-17 22:07:28 +02:00
Nadav Har'El
c049992a93 test/alternator: more tests for ExclusiveStartKey in Query
We already have in test/alternator/test_query.py a test -
test_query_exclusivestartkey - for one successful uses of
ExclusiveStartKey. But we missed testing quite a few edge cases of
this parameter, so this patch adds more tests for it - see the
comments on each individual test explaining its purpose.

With the new tests, we actually identified three cases where we got
the error handling wrong - cases of ExclusiveStartKey which DynamoDB
refuses, but Alternator allows. So three of the tests included here
pass on DynamoDB but fail on Alternator, so are marked with "xfail".

Refs #26988 - which is a new issue about these three cases of missing
validation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-17 22:07:27 +02:00
Andrzej Jackowski
35fd603acd test: wait for read_barrier in wait_until_driver_service_level_created
Previously, `wait_until_driver_service_level_created` only waited for
the `driver` service level to appear in the output of
`LIST ALL SERVICE_LEVELS`. However, the fact that one node lists
`sl:driver` does not necessarily mean that all other nodes can see
it yet. This caused sporadic test failures, especially in DEBUG builds.

To prevent these failures, this change adds an extra wait for
a `raft/read_barrier` after the `driver` service level first appears.
This ensures the service level is globally visible across the cluster.

Fixes: scylladb/scylladb#27019
2025-11-17 15:21:28 +01:00
Andrzej Jackowski
39bfad48cc test: use ManagerClient in wait_until_driver_service_level_created
Pass a ManagerClient instead of a `cql` session to
`wait_until_driver_service_level_created`. This makes it easier
to add additional functionality to the helper later (e.g. waiting for
a Raft read barrier in a subsequent commit).

Refs: scylladb/scylladb#27019
2025-11-17 14:55:14 +01:00
Botond Dénes
d54d409a52 Merge 'audit: write out to both table and syslog' from Dario Mirovic
This patch adds support for multiple audit log outputs.

If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.

Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit.

Read query ops = 60k/s
AUTH ops = 200/s

| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 777 | 0 |
|table| 801 | +3.09% |
|syslog | 803 | +3.35% |
|table,syslog | 818 | +5.28% |

Read query ops = 50k/s
AUTH ops = 200/s

| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 643 | 0 |
|table| 647 | +0.62% |
|syslog | 648 | +0.78% |
|table,syslog | 656 | +2.02% |

Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test)

Fixes #26022

Backport:

The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it.

Closes scylladb/scylladb#26613

* github.com:scylladb/scylladb:
  test: dtest: audit_test.py: add AuditBackendComposite
  test: dtest: audit_test.py: group logs in dict per audit mode
  audit: write out to both table and syslog
  audit: move storage helper creation from `audit::start` to `audit::audit`
  audit: fix formatting in `audit::start_audit`
  audit: unify `create_audit` and `start_audit`
2025-11-17 15:04:15 +02:00
Gleb Natapov
0f0ab11311 cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
2025-11-17 15:00:51 +02:00
Piotr Dulikowski
c29efa2cdb Merge 'vector_search: Improve vector-store health checking' from Karol Nowacki
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.

The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.

Fixes: VECTOR-187

Backport to 2025.4 as this feature is expected to be available in 2025.4.

Closes scylladb/scylladb#26413

* github.com:scylladb/scylladb:
  vector_search: Improve vector-store health checking
  vector_search: Move response_content_to_sstring to utils.hh
  vector_search: Add unit tests for client error handling
  vector_search: Enable mocking of status requests
  vector_search: Extract abort_source_timeout and repeat_until
  vector_search: Move vs_mock_server to dedicated files
2025-11-17 12:16:07 +01:00
Dawid Mędrek
0602afc085 cdc: Preserve properties when reattaching log table
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:

```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};

/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};

/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```

The log table can also be reattached by enabling CDC on the base table
again:

```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```

However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.

To prevent that, we ensure that the properties are preserved.

A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct. It fails before this
commit.

Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.

Fixes scylladb/scylladb#25523
2025-11-17 11:56:30 +01:00
Dawid Mędrek
10975bf65c cdc: Extract creating columns in CDC log table to dedicated function
We extract the portion of the code responsible for creating columns
in a CDC log table to a separate, dedicated function. This should
improve the overall readability of the function (and also making it
very short now).
2025-11-17 11:54:48 +01:00
Dawid Mędrek
8bf09ac6f7 cdc: Extract default properties of CDC log tables to dedicated function
We extract the portion of the code responsible for setting the default
properties of a CDC log table to a separate function. This should
improve the overall readability of the function. Also, it should be
helpful when modifying the code later on in this commit series.
2025-11-17 11:50:35 +01:00
Dawid Mędrek
991c0f6e6d schema/schema_builder.hh: Add set_properties
We add a method used for overwriting the properties of a schema.
It will be used to create a new schema based on another.
2025-11-17 11:46:32 +01:00
Dawid Mędrek
76b21d7a5a schema: Add getter for schema::user_properties
The getter will be used later to access the user properties
and copy them to a fresh `schema_builder`.
2025-11-17 11:46:24 +01:00
Dawid Mędrek
3856c9d376 schema: Remove underscores in fields of schema::user_properties
The fields are public, so according to the style guide, they should
not start with an underscore.
2025-11-17 11:46:15 +01:00
Dawid Mędrek
5a0fddc9ee schema: Extract user properties out of raw_schema
The properties can be directly manipulated by the user
via statements like `ALTER TABLE`. To better organize
the structure of `raw_schema`, we encapsulate that data
in the form of a dedicated struct. This change will be
later used for applying multiple properties to `schema_builder`
in one go.
2025-11-17 11:46:07 +01:00
Patryk Jędrzejczak
b5f38e4590 docs/dev/topology-over-raft: update the recovery section
We have the new recovery procedure now, but this doc hasn't been
updated. It still describes the old recovery procedure.

For comparison, external docs can be found here:
https://docs.scylladb.com/manual/master/troubleshooting/handling-node-failures.html#manual-recovery-procedure

Fixes #26412
2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak
785a3302e6 docs/dev/topology-over-raft: document zero-token nodes
The topology transitions are a bit different for zero-token nodes, which
is worth mentioning.
2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak
d75558e455 docs/dev/topology-over-raft: clarify the lack of tablet-specific states
Tablets are never mentioned before this part of the doc, so it may be
confusing why some topology states are missing.
2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak
c362ea4dcb docs/dev/topology-over-raft: add the missing join_group0 state
This state was added as a part of the join procedure, and we didn't
update this part of the doc.
2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak
182d416949 docs/dev/topology-over-raft: update the topology columns
Some of the columns were added, but the doc wasn't updated.

`upgrade_state` was updated in only one of the two places.

`ignore_nodes` was changed to a static column.
2025-11-17 10:40:20 +01:00
Piotr Dulikowski
f0039381d2 Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.

To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.

The patch should be backported to 2025.4

Fixes https://github.com/scylladb/scylladb/issues/26244

Closes scylladb/scylladb#26454

* github.com:scylladb/scylladb:
  service/storage_service: migrate staging sstables in view building worker during intra-node migration
  db/view/view_building_worker: support sstables intra-node migration
  db/view_building_worker: fix indent
  db/view/view_building_worker: don't organize staging sstables by last token
2025-11-17 08:53:19 +01:00
Karol Nowacki
7f45f15237 vector_search: Improve vector-store health checking
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.

The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
2025-11-17 06:21:31 +01:00
Karol Nowacki
5c30994bc5 vector_search: Move response_content_to_sstring to utils.hh
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.

This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
2025-11-17 06:21:31 +01:00
Karol Nowacki
4bbba099d7 vector_search: Add unit tests for client error handling
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
2025-11-17 06:21:31 +01:00
Karol Nowacki
cb654d2286 vector_search: Enable mocking of status requests
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.

This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
2025-11-17 06:21:31 +01:00
Karol Nowacki
f665564537 vector_search: Extract abort_source_timeout and repeat_until
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.

This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
2025-11-17 06:21:31 +01:00
Karol Nowacki
ee3b83c9b0 vector_search: Move vs_mock_server to dedicated files
The mock server utility is extracted into its own files so it can be
reused by future `client` unit tests.
2025-11-17 06:21:30 +01:00
Artsiom Mishuta
696596a9ef test.py: shutdown ManagerClient only in current loop In python 3.14 there is stricter policy regarding asyncio loops. This leads that we can not close clients from different loops. This change ensures that we are closing only client in the current loop.
Closes scylladb/scylladb#26911
2025-11-16 19:19:46 +02:00
Jenkins Promoter
3672715211 Update pgo profiles - x86_64 2025-11-16 11:42:41 +02:00
Jenkins Promoter
41933b3f5d Update pgo profiles - aarch64 2025-11-15 05:27:38 +02:00
Pavel Emelyanov
9cb776dee8 sstables_manager: Drop db::config from sstables_manager
Now it has all it needs via its own specific config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
d55044b696 tools/sstable: Make shard_of_with_tablets use db::config argument
Its caller, the shard_of_operation, already has it as argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
2ec3303edd tools/sstable: Add db::config& to all operations
It's not extremely elegant, but one tool operation needs db::config --
the "shard of" one. Currently it gets one from sstables_manager, but
manager is going to stop using db::config, and the operation needs to
get it elsehow.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
0fede18447 tools/sstable: Get endpoints from storage manager
The tool may open sstables on S3. For that it gets configured endpoints
with the help of db::config obtained from sstables_manager.db_config().
However, storage endpoints are maintained by sstables storage manager,
and since tool has this instance, it's better to use storage manager to
get list of endpoints.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
675eb3be98 sstables_manager: Hold sstable IO extensions on it
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
c853197281 sstables: Manager helper to grab file io extensions
Currently all the code that needs to iterate over sstables extensions
get config from manager, extensions from it and then iterate. Add a
helper that returns extensions directly. No real changes, just a helper.
Next patch will change the way the helper works.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
9868341c73 sstables_manager: Move default format on config
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.

Live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
e6dee8aab5 sstables_manager: Move enable_sstable_data_integrity_check on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
78ab31118e sstables_manager: Move data_file_directories on config
Make it a reference, so all the code that configures it is updated to
provide the target.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
cb1679d299 sstables_manager: Move components_memory_reclaim_threshold on config
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.

This one is live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:42 +03:00
Botond Dénes
8579e20bd1 Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.

This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.

To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.

Backport is not required, it is an improvement

Refs #21776

Closes scylladb/scylladb#26444

* github.com:scylladb/scylladb:
  boost/repair_test: add repair reader integrity verification test cases
  test/lib: allow to disable compression in random_mutation_generator
  sstables: Skip checksum and digest reads for unlinked SSTables
  table: enable integrity checks for streaming reader
  table: Add integrity option to table::make_sstable_reader()
  sstables: Add integrity option to create_single_key_sstable_reader
2025-11-14 18:00:33 +02:00
Benny Halevy
f9ce98384a scylla-sstable: correctly dump sharding_metadata
This patch fixes 2 issues at one go:

First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.

Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.

Fixes #26982

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26991
2025-11-14 17:55:41 +02:00
Aleksandra Martyniuk
e3dcb7e827 test: extend test_batchlog_replay_failure_during_repair
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
2025-11-14 14:18:07 +01:00
Pavel Emelyanov
1c9c4c8c8c Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.

To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
  storage_service
- but wait with storage_service shutdown until all *existing*
  executions are done

Fixes scylladb/scylladb#26734

Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.

Closes scylladb/scylladb#26779

* github.com:scylladb/scylladb:
  service: attach storage_service to migration_manager using pluggabe
  service: migration_manager: corutinize merge_schema_from
  service: migration_manager: corutinize reload_schema
2025-11-14 15:14:28 +03:00
Piotr Dulikowski
2ccc94c496 Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes https://github.com/scylladb/scylladb/issues/26976

backport to previous versions since it fixes a bug in MV with vnodes

Closes scylladb/scylladb#27008

* github.com:scylladb/scylladb:
  test: add mv write during node join test
  topology_coordinator: include joining node in barrier
2025-11-14 12:41:16 +01:00
Pavel Emelyanov
604e5b6727 sstables_manager: Move column_index_auto_scale_threshold on config
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.

This one is live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:30:49 +03:00
Pavel Emelyanov
8f9f92728e sstables_manager: Move column_index_size on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:30:28 +03:00
Pavel Emelyanov
88bb203c9c sstables_manager: Move sstable_summary_ratio on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:29:34 +03:00
Pavel Emelyanov
1f6918be3f sstables_manager: Move enable_sstable_key_validation on config
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.

Also not live-updateable, so plain bool.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:28:14 +03:00
Pavel Emelyanov
79d0f93693 sstables_manager: Move available_memory on config
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.

Also, it's not live-updateable, so a plain size_t type for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:27:14 +03:00
Pavel Emelyanov
218916e7c2 code: Introduce sstables_manager::config
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:25:18 +03:00
Pavel Emelyanov
004ba32fa5 sstables: Patch get_local_directories() to work on vector of paths
Now it uses db::config. Next patches will eliminate db::config from this
code and the helper in question will need to get datadir names
explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:24:04 +03:00
Pavel Emelyanov
1895d85ed2 code: Rename sstables_manager::config() into db_config()
The config() method name is going to return sstables_manager config, so
first need to set this name free.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:23:08 +03:00
Patryk Jędrzejczak
1141342c4f Merge 'topology: refactor excluded nodes' from Petr Gusev
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.

The PR improves codes readability and efficiency, no behavior changes.

backport: this is a refactoring/optimization, no need to backport

Closes scylladb/scylladb#26907

* https://github.com/scylladb/scylladb:
  topology_coordinator: drop unused exec_global_command overload
  topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
  topology_state_machine: inline get_excluded_nodes
  messaging_service: simplify and optimize ban_host
  storage_service: topology_state_load: extract topology variable
  topology_coordinator: excluded_tablet_nodes -> ignored_nodes
  topology_state_machine: add excluded_tablet_nodes field
2025-11-14 11:52:00 +01:00
Piotr Dulikowski
68407a09ed Merge 'vector_store_client: Add support for failed-node backoff' from Karol Nowacki
vector_search: Add backoff for failed nodes

Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.

Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.

Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.

References: VECTOR-187.

Backport to 2025.4 as this feature is expected to be available in 2025.4.

Closes scylladb/scylladb#26308

* github.com:scylladb/scylladb:
  vector_search: Set max backoff delay to 2x read request timeout
  vector_search: Report status check exception via on_internal_error_noexcept
  vector_search: Extract client management into dedicated class
  vector_search: Add backoff for failed clients
  vector_search: Make endpoint available
  vector_search: Use std::expected for low-level client errors
  vector_search: Extract client class
2025-11-14 11:49:18 +01:00
Piotr Dulikowski
833b824905 Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

Closes scylladb/scylladb#26856

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-14 11:12:28 +01:00
Botond Dénes
43738298be db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: #24664

Closes scylladb/scylladb#26330
2025-11-14 13:03:50 +03:00
Piotr Dulikowski
43506e5f28 Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test.

Fixes scylladb/scylladb#26686

Backport: impact on the user. We should backport it to 2025.4.

Closes scylladb/scylladb#26729

* github.com:scylladb/scylladb:
  tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
  db/view/view_building_coordinator: Rate limit logging failed RPC
  db/view: Add backoff when RPC fails
2025-11-14 10:17:57 +01:00
Piotr Dulikowski
308c5d0563 Merge 'cdc: set column drop timestamp in the future' from Michael Litvak
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes https://github.com/scylladb/scylladb/issues/26340

the issue affects all previous releases, backport to improve stability

Closes scylladb/scylladb#26533

* github.com:scylladb/scylladb:
  test: test concurrent writes with column drop with cdc preimage
  cdc: check if recreating a column too soon
  cdc: set column drop timestamp in the future
  migration_manager: pass timestamp to pre_create
2025-11-14 08:52:34 +01:00
Marcin Maliszkiewicz
958d04c349 service: attach storage_service to migration_manager using pluggabe
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.

To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
  storage_service
- but wait with storage_service shutdown until all *existing*
  executions are done

Fixes scylladb/scylladb#26734
2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz
cf9b2de18b service: migration_manager: corutinize merge_schema_from
It's needed to easily keep-alive pluggable storage_service
permit in a following commit.
2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz
5241e9476f service: migration_manager: corutinize reload_schema
It's needed to easily keep-alive pluggable storage_service
permit in a following commit.
2025-11-14 08:50:18 +01:00
Tomasz Grabiec
27e74fa567 tools: scylla-sstable: Print filename and tablet ids on error
Since error is not printed to stdout, when working with multiple
files, we don't know whith which sstable the error is associated with.

Closes scylladb/scylladb#27009
2025-11-14 09:47:38 +02:00
Karol Nowacki
1972fb315b vector_search: Set max backoff delay to 2x read request timeout
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
2025-11-14 08:05:21 +01:00
Karol Nowacki
097c0f9592 vector_search: Report status check exception via on_internal_error_noexcept
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
2025-11-14 08:05:21 +01:00
Karol Nowacki
940ed239b2 vector_search: Extract client management into dedicated class
Refactor client list management by moving it to separate files
(clients.cc/clients.hh) to improve code organization and modularity.
2025-11-14 08:05:21 +01:00
Karol Nowacki
009d3ea278 vector_search: Add backoff for failed clients
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.

Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
2025-11-14 07:38:01 +01:00
Karol Nowacki
190459aefa vector_search: Make endpoint available
In preparation for a new feature, the tests need the ability to make
an endpoint that was previously unavailable, available again.

This is achieved by adding an `unavailable_server::take_socket` method.
This method allows transferring the listening socket from the
`unavailable_server` to the `mock_vs_server`, ensuring they both
operate on the same endpoint.
2025-11-14 07:23:40 +01:00
Karol Nowacki
49a177b51e vector_search: Use std::expected for low-level client errors
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.

The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
2025-11-14 07:23:40 +01:00
Karol Nowacki
62f8b26bd7 vector_search: Extract client class
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.

This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.

`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
2025-11-14 07:23:40 +01:00
Lakshmi Narayanan Sreethar
3eba90041f sstables: prevent oversized allocation when parsing summary positions
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.

This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.

Fixes #24428

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26846
2025-11-14 06:40:53 +02:00
Dawid Mędrek
393f1ca6e6 tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
After the changes in the test, we clean up its syntax. It boils
down to very simple modifications.
2025-11-13 17:57:33 +01:00
Dawid Mędrek
acd9120181 db/view/view_building_coordinator: Rate limit logging failed RPC
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.

However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.

To mitigate that, we rate limit the logging by 1 seconds.

We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
2025-11-13 17:57:23 +01:00
Dawid Mędrek
4a5b1ab40a db/view: Add backoff when RPC fails
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test: it fails before this commit and succeeds
with it.

Fixes scylladb/scylladb#26686
2025-11-13 17:55:41 +01:00
Michał Hudobski
7646dde25b select_statement: add a warning about unsupported paging for vs queries
Currently we do not support paging for vector search queries.
When we get such a query with paging enabled we ignore the paging
and return the entire result. This behavior can be confusing for users,
as there is no warning about paging not working with vector search.
This patch fixes that by adding a warning to the result of ANN queries
with paging enabled.

Closes scylladb/scylladb#26384
2025-11-13 18:47:05 +02:00
Michael Litvak
e85051068d test: test concurrent writes with column drop with cdc preimage
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.

the test reproduces issue #26340 where this results in a malformed
sstable.
2025-11-13 17:00:08 +01:00
Michael Litvak
039323d889 cdc: check if recreating a column too soon
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.

To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
2025-11-13 17:00:07 +01:00
Michael Litvak
48298e38ab cdc: set column drop timestamp in the future
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes scylladb/scylladb#26340
2025-11-13 16:59:43 +01:00
Michael Litvak
eefae4cc4e migration_manager: pass timestamp to pre_create
pass the write timestamp as parameter to the
on_pre_create_column_families notification.
2025-11-13 16:59:43 +01:00
Piotr Dulikowski
7f482c39eb Merge '[schema] Speculative retry rounding fix' from Dario Mirovic
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.

Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.

Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.

Fixes #26369

This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.

Closes scylladb/scylladb#26909

* github.com:scylladb/scylladb:
  test: cqlpy: add test case for non-numeric PERCENTILE value
  schema: speculative_retry: update exception type for sstring ops
  docs: cql: ddl.rst: update speculative-retry-options
  test: cqlpy: add test for valid speculative_retry values
  schema: speculative_retry: allow 0 and 100 PERCENTILE values
2025-11-13 15:27:45 +01:00
Petr Gusev
d3bd8c924d topology_coordinator: drop unused exec_global_command overload 2025-11-13 14:19:03 +01:00
Petr Gusev
45d1302066 topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
This method is specific to topology requests -- node joining, replacing,
decommissioning etc, everything that goes through
topology::transition_state::write_both_read_old and
raft_topology_cmd::command::stream_ranges. It shouldn't be used in
other contexts -- to handle global topology requests
(e.g. truncate table) or for tablets. Rename the method to make this
more explicit.
2025-11-13 14:19:03 +01:00
Petr Gusev
bf8cc5358b topology_state_machine: inline get_excluded_nodes
The method is specific to topology_coordinator, which already contains
a wrapper for it, so inline the topology method into it.

Also, make the logic of the method more explicit and remove multiple
transition_nodes lookups.
2025-11-13 14:18:46 +01:00
Taras Veretilnyk
e7ceb13c3b boost/repair_test: add repair reader integrity verification test cases
Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch.
Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification.

Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.
2025-11-13 14:08:33 +01:00
Taras Veretilnyk
554ce17769 test/lib: allow to disable compression in random_mutation_generator
Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations.
When set to compress::no, the schema builder uses no_compression() parameters.
2025-11-13 14:08:33 +01:00
Taras Veretilnyk
add60d7576 sstables: Skip checksum and digest reads for unlinked SSTables
Add an _unlinked flag to track SSTable unlink state and check it in
read_digest() and read_checksum() methods to skip file reads for
unlinked SSTables, preventing potential file not found errors.
2025-11-13 14:08:26 +01:00
Michael Litvak
b925e047be test: add mv write during node join test
Add a test that reproduces the issue scylladb/scylladb#26976.

The test adds a new node with delayed group0 apply, and does writes with
MV updates right after the join completes on the coordinator and while
the joining node's state is behind.

The test fails before fixing the issue and passes after.
2025-11-13 12:24:32 +01:00
Michael Litvak
13d94576e5 topology_coordinator: include joining node in barrier
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes scylladb/scylladb#26976
2025-11-13 12:24:31 +01:00
Michał Chojnowski
346e0f64e2 replica/table: add a metric for hypothetical total file size without compression
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
2025-11-13 11:28:19 +01:00
Dawid Mędrek
b357c8278f test/cluster/test_maintenance_mode.py: Wait for initialization
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:

```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```

To avoid that, we should wait until initialization is complete.
2025-11-13 11:07:45 +01:00
Aleksandra Martyniuk
4d0de1126f db: batchlog_manager: update _last_replay only if all batches were replayed
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.
2025-11-13 10:40:19 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Pavel Emelyanov
f47f2db710 Merge 'Support local primary-replica-only for native restore' from Robert Bindar
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.

The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.

Fixes #26584

Closes scylladb/scylladb#26609

* github.com:scylladb/scylladb:
  Add cluster tests for checking scoped primary_replica_only streaming
  Improve choice distribution for primary replica
  Refactor cluster/object_store/test_backup
  nodetool restore: add primary-replica-only option
  nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
  Enable scoped primary replica only streaming
  Support primary_replica_only for native restore API
2025-11-13 12:11:18 +03:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Tomasz Grabiec
10b893dc27 Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:

```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
    if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```

This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.

This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.

A version containing this bug has not been released yet, so no backport is needed.

Closes scylladb/scylladb#26946

* github.com:scylladb/scylladb:
  load_stats: add test for migrate_tablet_size()
  load_stats: fix problem with tablet size migration
2025-11-12 23:48:37 +01:00
Nadav Har'El
5839574294 Merge 'cql3: Fix std::bad_cast when deserializing vectors of collections' from Karol Nowacki
cql3: Fix std::bad_cast when deserializing vectors of collections

This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.

The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.

To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.

Fixes: #26704

This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.

Closes scylladb/scylladb#26740

* github.com:scylladb/scylladb:
  cql3: Make abstract_type explicitly noncopyable
  cql3: Fix std::bad_cast when deserializing vectors of collections
2025-11-13 00:24:25 +02:00
Petr Gusev
9fed80c4be messaging_service: simplify and optimize ban_host
We do one cross-shard call for all left+ignored nodes.
2025-11-12 12:27:44 +01:00
Petr Gusev
52cccc999e storage_service: topology_state_load: extract topology variable
It's inconvinient to always write the long expression
_topology_state_machine._topology.
2025-11-12 12:27:44 +01:00
Petr Gusev
66063f202b topology_coordinator: excluded_tablet_nodes -> ignored_nodes
ignored_nodes is sufficient in these cases. excluded_tablet_nodes
also includes left_nodes_rs, which are not needed
here — global_token_metadata_barrier runs the barrier only
on normal and transition nodes, not on left nodes.
2025-11-12 12:27:44 +01:00
Petr Gusev
82da83d0e5 topology_state_machine: add excluded_tablet_nodes field
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
2025-11-12 12:27:43 +01:00
Gleb Natapov
e872f9cb4e cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
2025-11-12 10:56:57 +02:00
Nadav Har'El
4de88a7fdc test/cqlpy: fix run script for materialized views on tablets
Recently we enabled tablets by default, but it is necessary to
enable rf_rack_valid_keyspaces if materialized views are to be used
with tablets, and this option is *not* the default.

We did add this option in test/pylib/scylla_cluster.py which is
used by test.py, but we didn't add it to test/cqlpy/run.py, so
the test/cqlpy/run script is no longer able to run tests with
materialized views. So this patch adds the missing configuration
to run.py.

FIxes #26918

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26919
2025-11-12 11:56:21 +03:00
Karol Nowacki
77da4517d2 cql3: Make abstract_type explicitly noncopyable
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.
2025-11-12 09:11:56 +01:00
Karol Nowacki
960fe3da60 cql3: Fix std::bad_cast when deserializing vectors of collections
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.

This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.
2025-11-12 09:11:56 +01:00
Botond Dénes
6f6ee5581e Merge 'encryption::kms_host: Add exponential backoff-retry for 503 errors' from Calle Wilund
Refs #26822

AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and    actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.

Closes scylladb/scylladb#26934

* github.com:scylladb/scylladb:
  encryption::kms_host: Add exponential backoff-retry for 503 errors
  encryption::kms_host: Include http error code in kms_error
2025-11-12 08:33:33 +02:00
Yaron Kaikov
3ade3d8f5b auto-backport: Add support for JIRA issue references
- Added support for JIRA issue references in PR body and commit messages
- Supports both short format (PKG-92) and full URL format
- Maintains existing GitHub issue reference support
- JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID}
- Allows backporting for PRs that reference JIRA issues with 'fixes' keyword

Fixes: https://github.com/scylladb/scylladb/issues/26955

Closes scylladb/scylladb#26954
2025-11-12 08:15:06 +02:00
Calle Wilund
d22e0acf0b encryption::kms_host: Add exponential backoff-retry for 503 errors
Refs #26822

AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.

v2:
* Use utils::exponential_backoff_retry
2025-11-11 21:02:32 +00:00
Calle Wilund
190e3666cb encryption::kms_host: Include http error code in kms_error
Keep track of actual HTTP failure.
2025-11-11 21:02:32 +00:00
Ferenc Szili
fcbc239413 load_stats: add test for migrate_tablet_size()
This change adds tests which validate the functionality of
load_stats::migrate_tablet_size()
2025-11-11 14:28:31 +01:00
Ferenc Szili
b77ea1b8e1 load_stats: fix problem with tablet size migration
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.

This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
2025-11-11 14:26:09 +01:00
Yehuda Lebi
a05ebbbfbb dist/docker: add configurable blocked-reactor-notify-ms parameter
Add --blocked-reactor-notify-ms argument to allow overriding the default
blocked reactor notification timeout value of 25 ms.

This change provides users the flexibility to customize the reactor
notification timeout as needed.

Fixes: scylladb/scylla-enterprise#5525

Closes scylladb/scylladb#26892
2025-11-11 12:38:40 +02:00
Benny Halevy
a290505239 utils: stall_free: add dispose_gently
dispose_gently consumes the object moved to it,
clearing it gently before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26356
2025-11-11 12:20:18 +02:00
Yaron Kaikov
c601371b57 install-dependencies.sh: update node_exporter to 1.10.2
Update node exporter to solve CVE-2025-22871

[regenerate frozen toolchain with optimized clang from
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5

Closes scylladb/scylladb#26916
2025-11-11 11:36:13 +02:00
Nadav Har'El
b659dfcbe9 test/cqlpy: comment out Cassandra check that is no longer relevant
In the test translated from Cassandra validation/operations/alter_test.py
we had two lines in the beginning of an unrelated test that verified
that CREATE KEYSPACE is not allowed without replication parameters.
But starting recently, ScyllaDB does have defaults and does allow these
CREATE KEYSPACE. So comment out these two test lines.

We didn't notice that this test started to fail, because it was already
marked xfail, because in the main part of this test, it reproduces a
different issue!

The annoying side-affect of these no-longer-passing checks was that
because the test expected a CREATE KEYSPACE to fail, it didn't bother
to delete this keyspace when it finished, which causes test.py to
report that there's a problem because some keyspaces still exist at the
end of the test. Now that we fixed this problem, we no longer need to
list this test in test/cqlpy/suite.yaml as a test that leaves behind
undeleted keyspaces.

Fixes #26292

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26341
2025-11-11 10:34:27 +02:00
Nikos Dragazis
56e5dfc14b migration_manager: Add missing validations for schema extensions
The migration manager offers some free functions to prepare mutations
for a new/updated table/view. Most of them include a validation check
for the schema extensions, but in the following ones it's missing:

* `prepare_new_column_family_announcement` (overload with vector as out parameter)
* `prepare_new_column_families_announcement`

Presumably, this was just an omission. It's also not a very important
one since the only extension having validation logic is the
`encryption_schema_extension`, but none of these functions is connected
to user queries where encryption options can be provided in the schema.
User queries go through the other
`prepare_new_column_family_announcement` overload, which does perform a
validation check.

Add validation in the missing places.

Fixes #26470.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#26487
2025-11-11 10:08:58 +02:00
Botond Dénes
042303f0c9 Merge 'Alternator: enable tablets by default - depending on tablets_mode_for_new_keyspaces' from Nadav Har'El
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.

We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.

As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.

Fixes https://github.com/scylladb/scylladb/issues/22463

Backport to 2025.4, it's important not to delay phasing out vnodes.

Closes scylladb/scylladb#26836

* github.com:scylladb/scylladb:
  test,alternator: use 3-rack clusters in tests
  alternator: improve error in tablets_mode_for_new_keyspaces=enforced
  config: make tablets_mode_for_new_keyspaces live-updatable
  alternator: improve comment about non-hidden system tags
  alternator: Fix test_ttl_expiration_streams()
  alternator: Fix test_scan_paging_missing_limit()
  alternator: Don't require vnodes for TTL tests
  alternator: Remove obsolete test from test_table.py
  alternator: Fix tag name to request vnodes
  alternator: Fix test name clash in test_tablets.py
  alternator: test_tablets.py handles new policy reg. tablets
  alternator: Update doc regarding tablets support
  alternator: Support `tablets_mode_for_new_keyspaces` config flag
  Fix incorrect hint for tablets_mode_for_new_keyspaces
  Fix comment for tablets_mode_for_new_keyspaces
2025-11-11 09:45:29 +02:00
Avi Kivity
bae2654b34 tools: dbuild: avoid test -v incompatibility with MacOS shell
`test -v` isn't present on the MacOS shell. Since dbuild is intended
as a compatibility bridge between the host environment and the build
environment, don't use it there.

Use ${var+text_if_set} expansion as a workaround.

Fixes #26937

Closes scylladb/scylladb#26939
2025-11-11 09:43:14 +02:00
Nikos Dragazis
94c4f651ca test/cqlpy: Test secondary index with short reads
Add a test to check that paged secondary index queries behave correctly
when pages are short. This is currently failing in Scylla, but passes in
Cassandra 5, therefore marked as "xfailing". Refer to the test's
docstring for more details.

The bug is a regression introduced by commit f6f18b1.
`test/cqlpy/run --release ...` shows that the test passes in 5.1 but
fails in 5.2 onwards.

Refs #25839.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25843
2025-11-11 09:28:45 +02:00
Robert Bindar
a04ebb829c Add cluster tests for checking scoped primary_replica_only streaming
This commits adds a tests checking various scenarios of restoring
via load and stream with primary_replica_only and a scope specified.

The tests check that in a few topologies, a mutation is replicated
a correct amount of times given primary_replica_only and that
streaming happens according to the scope rule passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
817fdadd49 Improve choice distribution for primary replica
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.

This patch sorts the replica set before passing it through the
scope filter.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
d4e43bd34c Refactor cluster/object_store/test_backup
This PR splits the suppport code from test_backup.py
into multiple functions so less duplicated code is
produced by new tests using it. It also makes it a bit
easier to understand.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
c1b3fe30be nodetool restore: add primary-replica-only option
Add --primary-replica-only and update docs page for
nodetool restore.

The relationship with the scope parameter is:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
  replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
83aee954b4 nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
  replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
136b45d657 Enable scoped primary replica only streaming
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Robert Bindar
965a16ce6f Support primary_replica_only for native restore API
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.

This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:17:52 +02:00
Dawid Mędrek
394207fd69 test: Disable maintenance mode correctly in test_maintenance_mode.py
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
2025-11-10 19:22:06 +01:00
Dawid Mędrek
222eab45f8 test: Fix keyspace in test_maintenance_mode.py
The keyspace used in the test is not necessarily called `ks`.
2025-11-10 19:21:58 +01:00
Dawid Mędrek
c0f7622d12 service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816
2025-11-10 19:21:36 +01:00
Yaron Kaikov
850ec2c2b0 Trigger scylla-ci directly from PR instead of scylla-ci-route job
Refactoring scylla-ci to be triggered directly from each PR using GitHub action. This will allow us to skip triggering CI when PR commit message was updated (which will save us un-needed CI runs) Also we can remove `Scylla-CI-route` pipeline which route each PR to the proper CI job under the release (GitHub action will do it automatically), to reduce complexity

Fixes: https://scylladb.atlassian.net/browse/PKG-69

Closes scylladb/scylladb#26799
2025-11-10 15:10:11 +02:00
Pavel Emelyanov
decf86b146 Merge 'Make AWS & Azure KMS boost testing use fixture + include Azure in pytests' from Calle Wilund
* Adds test fixture for AWS KMS
* Adds test fixture for Azure KMS
* Adds key provider proxy for Azure to pytests (ported dtests)
* Make test gather for boost tests handle suites
* Fix GCP test snafu

Fixes #26781
Fixes #26780
Fixes #26776
Fixes #26775

Closes scylladb/scylladb#26785

* github.com:scylladb/scylladb:
  gcp_object_storage_test: Re-enable parallelism.
  test::pylib: Add azure (mock) testing to EAR matrix
  test::boost::encryption_at_rest: Remove redundant azure test indent
  test::boost::encryption_at_rest: Move azure tests to use fixture
  test::lib: Add azure mock/real server fixture
  test::pylib::boost: Fix test gather to handle test suites
  utils::gcp::object_storage: Fix typo in semaphore init
  test::boost::encryption_at_rest_test: Remove redundant indent
  test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
  test::boost::test_encryption_at_rest: Reorder tests and helpers
  ent::encryption: Make text helper routines take std::string
  test::pylib::dockerized_service: Handle docker/podman bind error message
  test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
  test::lib::gcs_fixture: Only set port if running docker image + more retry
2025-11-10 14:35:05 +03:00
Michał Jadwiszczak
9345c33d27 service/storage_service: migrate staging sstables in view building
worker during intra-node migration

Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
  shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage

This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.

Fixes scylladb/scylladb#26244
2025-11-10 10:38:08 +01:00
Michał Jadwiszczak
4bc6361766 db/view/view_building_worker: support sstables intra-node migration
We need to be able to load sstables on the target shard during
intra-node tablet migration and to cleanup migrated sstables on the
source shard.
2025-11-10 10:36:32 +01:00
Michał Jadwiszczak
c99231c4c2 db/view_building_worker: fix indent 2025-11-10 09:02:16 +01:00
Michał Jadwiszczak
2e8c096930 db/view/view_building_worker: don't organize staging sstables by last token
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.

Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.
2025-11-10 09:02:16 +01:00
Nadav Har'El
35f3a8d7db docs/alternator: fix small mistake in compatibility.md
docs/alternator/compatibility.md describes support for global (multi-DC)
tables, and suggests that the CQL command "ALTER TABLE" should be used
to change the replication of an Alternator table. But actually, the
right command is "ALTER KEYSPACE", not "ALTER TABLE". So fix the
document.

Fixes #26737

Closes scylladb/scylladb#26872
2025-11-10 08:48:18 +03:00
Yauheni Khatsianevich
d3e62b15db fix(test): minor typo fix, removing redundant param from logging
Closes scylladb/scylladb#26901
2025-11-10 08:42:11 +03:00
Dario Mirovic
d364904ebe test: dtest: audit_test.py: add AuditBackendComposite
Add `AuditBackendComposite`, a test class which allows testing multiple
audit outputs in a single run, implemented in `audit_composite_storage_helper`
class.

Add two more tests.
`test_composite_audit_type_invalid` tests if an invalid audit mode among
correct ones causes the same error as when it is the only specified audit mode.
`test_composite_audit_empty_settings` tests if `'none'` audit mode, when
specified along other audit modes, properly disables audit logging.

Refs #26022
2025-11-10 00:31:34 +01:00
Dario Mirovic
a8ed607440 test: dtest: audit_test.py: group logs in dict per audit mode
Before this patch audit test could process audit logs from a single
audit output. This patch adds support for multiple audit outputs
in the same run. The change is needed in order to test
 `audit_composite_storage_helper`, which can write to multiple
audit outputs.

Refs #26022
2025-11-10 00:31:34 +01:00
Dario Mirovic
afca230890 audit: write out to both table and syslog
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the
`audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.

Fixes #26022
2025-11-10 00:31:30 +01:00
Dario Mirovic
7ec9e23ee3 test: cqlpy: add test case for non-numeric PERCENTILE value
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.

Refs #26369
2025-11-09 13:59:36 +01:00
Dario Mirovic
85f059c148 schema: speculative_retry: update exception type for sstring ops
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304

Refs #26369
2025-11-09 13:55:57 +01:00
Dario Mirovic
aba4c006ba docs: cql: ddl.rst: update speculative-retry-options
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)

Refs #26369
2025-11-09 13:23:29 +01:00
Dario Mirovic
5d1913a502 test: cqlpy: add test for valid speculative_retry values
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.

Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.

Refs #26369
2025-11-09 13:23:26 +01:00
Dario Mirovic
da2ac90bb6 schema: speculative_retry: allow 0 and 100 PERCENTILE values
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.

On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.

Fixes #26369
2025-11-09 12:26:27 +01:00
Nadav Har'El
65ed678109 test,alternator: use 3-rack clusters in tests
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:

    An error occurred (InternalServerError) when calling the CreateTable
    operation: ... Replication factor 3 exceeds the number of racks (1) in
    dc datacenter1

So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.

Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Nadav Har'El
c03081eb12 alternator: improve error in tablets_mode_for_new_keyspaces=enforced
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.

We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Nadav Har'El
25439127c8 config: make tablets_mode_for_new_keyspaces live-updatable
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.

For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.

In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Nadav Har'El
b34f28dae2 alternator: improve comment about non-hidden system tags
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.

That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
eeb3a40afb alternator: Fix test_ttl_expiration_streams()
The test is now aware of the new name of the
`system:initial_tablets` tag.
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
a659698c6d alternator: Fix test_scan_paging_missing_limit()
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.

In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.

Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).

Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
345747775b alternator: Don't require vnodes for TTL tests
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
274d0b6d62 alternator: Remove obsolete test from test_table.py
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
63897370cb alternator: Fix tag name to request vnodes
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
c7de7e76f4 alternator: Fix test name clash in test_tablets.py 2025-11-09 12:52:28 +02:00
Piotr Szymaniak
7466325028 alternator: test_tablets.py handles new policy reg. tablets
Adjust the tests so they are in-line with the config flag
'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.
2025-11-09 12:52:28 +02:00
Piotr Szymaniak
35216d2f01 alternator: Update doc regarding tablets support
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.
2025-11-09 12:52:28 +02:00
Piotr Szymaniak
376a2f2109 alternator: Support tablets_mode_for_new_keyspaces config flag
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.

After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.

Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.

Fixes #22463
2025-11-09 12:52:17 +02:00
Piotr Szymaniak
af00b59930 Fix incorrect hint for tablets_mode_for_new_keyspaces 2025-11-09 10:49:46 +02:00
Piotr Szymaniak
403068cb3d Fix comment for tablets_mode_for_new_keyspaces
The comment was not listing all the 3 possible values correctly,
despite an explanation just below covers all 3 values.
2025-11-09 10:49:46 +02:00
Botond Dénes
cdba3bebda Merge 'Generalize directory checks in database_test's snapshot test cases' from Pavel Emelyanov
Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter.

Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it.

Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot.

Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586)

Improving existing working test, no backport is needed.

Closes scylladb/scylladb#26693

* github.com:scylladb/scylladb:
  database_test: Simplify snapshot_with_quarantine_works() test
  database_test: Improve snapshot_skip_flush_works test
  database_test: Simplify snapshot_works() tests
  database_test: Use collect_files() to remove files
  database_test: Use collectz_files() to count files in directory
  database_test: Introduce collect_files() helper
2025-11-07 16:04:02 +02:00
Michał Chojnowski
b82c2aec96 sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key
_last_key is a multi-fragment buffer.

Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.

The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.

But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).

Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.

Fix that.

We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.

Fixes scylladb/scylladb#26819

Closes scylladb/scylladb#26839
2025-11-07 11:25:07 +02:00
Abhinav Jha
ab0e0eab90 raft topology: skip non-idempotent steps in decommission path to avoid problems during races
In the present scenario, there are issues in left_token_ring transition state
execution in the decommissioning path. In case of concurrent mutation race
conditions, we enter left_token_ring more than once, and apparently if
we enter left token ring second time, we try to barrier the decommisioned
node, which at this point is no longer possible. That's what causes the errors.

This pr resolves the issue by adding a check right in the start of
left_token_ring to check if the first topology state update, which marks
the request as done is completed. In this case, its confirmed that this
is the second time flow is entering left_token_ring and the steps preceding
the request status update should be skipped. In such cases, all the rest
steps are skipped and topology node status update( which threw error in
previous trial) is executed directly. Node removal status from group0 is
also checked and remove operation is retried if failed last time.

Although these changes are done with regard to the decommission operation
behavior in `left_token_ring` transition state, but since the pr doesn't
interfere with the core logic, it should not derail any rollback specific
logic. The changes just prevent some non-idempotent operations from
re-occuring in case of failures. Rest of the core logic remain intact.

Test is also added to confirm the proper working of the same.

Fixes: scylladb/scylladb#20865

Backport is not needed, since this is not a super critical bug fix.

Closes scylladb/scylladb#26717
2025-11-07 10:07:49 +01:00
Ran Regev
aaf53e9c42 nodetool refresh primary-replica-only
Fixes: #26440

1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
   itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#26480
2025-11-07 10:59:27 +02:00
Avi Kivity
245173cc33 tools: toolchain: optimized_clang: remove unused variable CLANG_SUFFIX
The variable was unused since cae999c094 ("toolchain: change
optimized clang install method to standard one"), and now causes
the differential shellcheck continuous integration test to fail whenever
it is changed. Remove it.

Closes scylladb/scylladb#26796
2025-11-07 10:08:23 +02:00
Patryk Jędrzejczak
d6c64097ad Merge 'storage_proxy: use gates to track write handlers destruction' from Petr Gusev
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.

Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).

In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.

The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.

Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)

backport: need to backport to 2025.4 (LWT for tablets release)

Closes scylladb/scylladb#26827

* https://github.com/scylladb/scylladb:
  storage_proxy: use coroutine::maybe_yield();
  storage_proxy: use gates to track write handlers destruction
2025-11-06 10:17:04 +01:00
Nadav Har'El
b8da623574 Update tools/cqlsh submodule
* tools/cqlsh f852b1f5...19445a5c (2):
  > Update scylla-driver version to 3.29.4

Update tools/cqlsh submodule for scylla-driver 3.29.4

The motivation for this update is to resolve a driver-side serialization bug that was blocking work on #26740. The bug affected vector<collection> types (e.g., vector<set<int>,1>) and is fixed in scylla-driver versions 3.29.2+.

Refs #26704
2025-11-06 10:01:26 +02:00
Asias He
dbeca7c14d repair: Add metric for time spent on tablet repair
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.

Fixes #26505

Closes scylladb/scylladb#26502
2025-11-06 10:00:20 +03:00
Dario Mirovic
c3a673d37f audit: move storage helper creation from audit::start to audit::audit
Extract storage helper creation into `create_storage_helper` function.
Call this function from `audit::audit`. It will be called per shard inside
`sharded<audit>::start` method.

Refs #26022
2025-11-06 03:05:43 +01:00
Dario Mirovic
28c1c0f78d audit: fix formatting in audit::start_audit
Refs #26022
2025-11-06 03:05:17 +01:00
Dario Mirovic
549e6307ec audit: unify create_audit and start_audit
There is no need to have `create_audit` separate from `start_audit`.
`create_audit` just stores the passed parameters, while `start_audit`
does the actual initialization and startup work.

Refs #26022
2025-11-06 03:05:06 +01:00
Calle Wilund
b0061e8c6a gcp_object_storage_test: Re-enable parallelism.
Re-enable parallel execution to get better logs.
Note, this is somewhat wasteful, as we won't re-use test fixture here,
but in the end, it is probably an improvement.
2025-11-05 15:07:26 +00:00
Wojciech Mitros
0a22ac3c9e mv: don't mark the view as built if the reader produced no partitions
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")

If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`

In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.

Fixes https://github.com/scylladb/scylladb/issues/26523

Closes scylladb/scylladb#26635
2025-11-05 17:02:32 +02:00
Petr Gusev
5bda226ff6 storage_proxy: use coroutine::maybe_yield();
This is a small "while at it" refactoring -- better to use
coroutine::maybe_yield with co_await-s.
2025-11-05 14:38:19 +01:00
Petr Gusev
4578304b76 storage_proxy: use gates to track write handlers destruction
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.

Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.

In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.

The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.

Fixes scylladb/scylladb#26788
2025-11-05 14:37:52 +01:00
Nadav Har'El
8a07b41ae4 test/cqlpy: add test confirming page_size=0 disables paging
In pull request #26384 a discussion started whether page_size=0 really
disables paging, or maybe one needs page_size=-1 to truly disable paging.

The reason for that discussion was commit 08c81427b that started to
use page_size=-1 for internal unpaged queries, and commit 76b31a3 that
incorrectly claimed that page_size>=0 means paging is enabled.

This patch introduces a test that confirms that with page_size=0, paging
is truly disabled - including the size-based (1MB) paging.

The new test is Scylla-only, because Cassandra is anyway missing the
size-based page cutoff (see CASSANDRA-11745).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26742
2025-11-05 15:52:16 +03:00
Tomasz Grabiec
f8879d797d tablet_allocator: Avoid load balancer failure when replacing the last node in a rack
Introduced in 9ebdeb2

The problem is specific to node replacing and rack-list RF. The
culprit is in the part of the load balancer which determines rack's
shard count. If we're replacing the last node, the rack will contain
no normal nodes, and shards_per_rack will have no entry for the rack,
on which the table still has replicas. This throws std::out_of_range
and fails the tablet draining stage, and node replace is failed.

No backport because the problem exists only on master.

Fixes #26768

Closes scylladb/scylladb#26783
2025-11-05 15:49:51 +03:00
Avi Kivity
8e480110c2 dist: housekeeping: set python.multiprocessing fork mode to "fork"
Python 3.14 changed the multiprocessing fork mode to "forkserver",
presumably for good reasons. However, it conflicts with our
relocatable Python system. "forkserver" forks and execs a Python
process at startup, but it does this without supplying our relocated
ld.so. The system ld.so detects a conflict and crashes.

Fix this by switching back to "fork", which is sufficient for
housekeeping's modest needs.

Closes scylladb/scylladb#26831
2025-11-05 15:47:38 +03:00
Pavel Emelyanov
05d711f221 database_test: Simplify snapshot_with_quarantine_works() test
The test collects Data files from table dir, then _all_ files from
snapshot dir and then checks whether the former is the subset of the
latter. Using std::includes over two sets makes the code much shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:35:28 +03:00
Pavel Emelyanov
c8492b3562 database_test: Improve snapshot_skip_flush_works test
It has two inaccuracies.

First, when checking the contents of table directory, it uses
pre-populated expected list with "manifest.json" in it. Weird.

Second, when cechking the contents of snapshot directory it doesn't
check if the "schema.cql" is there. It's always there, but if something
breaks in the future it may come unnoticed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:35:26 +03:00
Pavel Emelyanov
5a25d74b12 database_test: Simplify snapshot_works() tests
No functional changes here, just make use of the new lister to shorten
the code. A small side effect -- if the test fails because contents of
directories changes, it will print the exact difference in logs, not
just that N files are missing/present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:34:25 +03:00
Pavel Emelyanov
365044cdbb database_test: Use collect_files() to remove files
Some test cases remove files from table directory to perform some checks
over the taken snapshots. Using collect_files() helper makes the code
easier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:34:24 +03:00
Pavel Emelyanov
e1f326d133 database_test: Use collectz_files() to count files in directory
Some test cases want to see that there are more than one file in a
directory, so they can just re-use the new helper. Much shorter this
way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:32:58 +03:00
Pavel Emelyanov
60d1f78239 database_test: Introduce collect_files() helper
It returns a set of files in a given directoy. Will be used by all next
patches.

Implemented using directory_lister, not lister::scan_dir in order to
help removing the latter one in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-05 15:32:58 +03:00
Calle Wilund
6c6105e72e test::pylib: Add azure (mock) testing to EAR matrix
Fixes #26782

Adds a provider proxy for azure, using the existing mock server,
now as a fixture.
2025-11-05 10:22:23 +00:00
Calle Wilund
b8a6b6dba9 test::boost::encryption_at_rest: Remove redundant azure test indent 2025-11-05 10:22:23 +00:00
Calle Wilund
10e591bd6b test::boost::encryption_at_rest: Move azure tests to use fixture
Fixes #26781

Makes the test independent of wrapping scripts. Note: retains the
split into "real" and "mock" tests. For other tests, we either all
mock, or allow the environment to select mock or real. Here we have
them combined. More expensive, but otoh more thourough.
2025-11-05 10:22:22 +00:00
Calle Wilund
1d37873cba test::lib: Add azure mock/real server fixture
Wraps the real/mock azure server for test in a fixture.
Note: retains the current test setup which explicitly runs
some tests with "real" azure, if avail, and some always mock.
2025-11-05 10:22:22 +00:00
Calle Wilund
10041419dc test::pylib::boost: Fix test gather to handle test suites
Fixes #26775
2025-11-05 10:22:22 +00:00
Calle Wilund
565c701226 utils::gcp::object_storage: Fix typo in semaphore init
Fixes #26776

Semaphore storage is ssize_t, not size_t.
2025-11-05 10:22:22 +00:00
Calle Wilund
2edf6cf325 test::boost::encryption_at_rest_test: Remove redundant indent
Removed empty scope and reindents kms test using fixtures.
2025-11-05 10:22:22 +00:00
Calle Wilund
286a655bc0 test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
Fixes #26780

Uses fake/real CI endpoint for AWS KMS tests, and moves these into a
suite for sharing the mock server.
2025-11-05 10:22:22 +00:00
Calle Wilund
a1cc866f35 test::boost::test_encryption_at_rest: Reorder tests and helpers
No code changes. Just reorders code to organize more by provider etc,
prepping for fixtures and test suites.
2025-11-05 10:22:22 +00:00
Calle Wilund
af85b7f61b ent::encryption: Make text helper routines take std::string
Moving away from custom string type. Pure cosmetics.
2025-11-05 10:22:22 +00:00
Calle Wilund
1b0394762e test::pylib::dockerized_service: Handle docker/podman bind error message
If we run non-dbuild, docker/podman can/will cause first bind error,
we should check these too.
2025-11-05 10:22:22 +00:00
Calle Wilund
0842b2ae55 test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
Runs local-kms mock AWS KMS server unless overridden by env var.
Allows tests to use real or fake AWS KMS endpoint and shared fixture
for quicker execution.
2025-11-05 10:22:21 +00:00
Calle Wilund
98c060232e test::lib::gcs_fixture: Only set port if running docker image + more retry
Our connect can spuriously fail. Just retry.
2025-11-05 10:22:21 +00:00
Wojciech Mitros
977fa91e3d view_building_coordinator: rollback tasks on the leaving tablet replica
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.

Fixes https://github.com/scylladb/scylladb/issues/26691

Closes scylladb/scylladb#26825
2025-11-05 10:44:06 +01:00
Pavel Emelyanov
2cb98fd612 Merge 'api: storage_service: tasks: unify sync and async compaction APIs' from Aleksandra Martyniuk
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.

Fixes: https://github.com/scylladb/scylladb/issues/26715.

Requires backports to all live versions

Closes scylladb/scylladb#26746

* github.com:scylladb/scylladb:
  api: storage_service: tasks: unify upgrade_sstable
  api: storage_service: tasks: force_keyspace_cleanup
  api: storage_service: tasks: unify force_keyspace_compaction
2025-11-05 10:47:14 +03:00
Pavel Emelyanov
59019bc9a9 Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.

This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization`  from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.

The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages,  the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.

Fixes #25308

This is a feature that users need, so it should probably be backported to live branches.

Closes scylladb/scylladb#25457

* github.com:scylladb/scylladb:
  docs/alternator: explain alternator_warn_authorization
  test/alternator: tests for new auth failure metrics and log messages
  alternator: add alternator_warn_authorization config
2025-11-05 10:45:17 +03:00
Pavel Emelyanov
fc37518aff test: Check file existence directly
There's a test that checks if temporary-statistics file is gone at some
point. It does it by listing the directory it expects the file to be in
and then comparing the names met with the temp. stat. file name.

It looks like a single file_exists() call is enough for that purpose.

As a "sanity" check this patch adds a validation that non-temporary
statistics file is there, all the more so this file is removed after the
test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26743
2025-11-04 19:37:55 +01:00
Avi Kivity
95700c5f7f Merge 'Support counters with tablets' from Michael Litvak
Support the counters feature in tablets keyspaces.

The main change is to fix the counter update during tablets intranode migration.

Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell.

When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new.

Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy.

The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard.

Fixes https://github.com/scylladb/scylladb/issues/18180

no backport - new feature

Closes scylladb/scylladb#26636

* github.com:scylladb/scylladb:
  docs: counters now work with tablets
  pgo: enable counters with tablets
  test: enable counters tests with tablets
  test: add counters with tablets test
  cql3: remove warning when creating keyspace with tablets
  cql3: allow counters with tablets
  storage_proxy: lock all read shards for counter update
  storage_proxy: apply counter mutation on all write shards
  storage_proxy: move counter update coordination to storage proxy
  storage_proxy: refactor mutate_counter_on_leader
  replica/db: add counter update guard
  replica/db: split counter update helper functions
2025-11-03 22:28:10 +01:00
Raphael S. Carvalho
7f34366b9d sstables_loader: Don't bypass synchronization with busy topology
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.

The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.

Fixes #22707

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#26730
2025-11-03 18:10:08 +01:00
Michael Litvak
8555fd42df docs: counters now work with tablets
Counters are now supported in tablet-enabled keyspaces, so remove
the documentation that listed counters as an unsupported feature
and the note warning users about the limitation.
2025-11-03 16:04:37 +01:00
Michael Litvak
1337f4213f pgo: enable counters with tablets
Now that counters are supported with tablets, update the keyspace
statement for counters to allow it to run with tablets.
2025-11-03 16:04:37 +01:00
Michael Litvak
1dbf53ca29 test: enable counters tests with tablets
Enable all counters-related tests that were disabled for tablets because
counters was not supported with tablets until now.

Some tests were parametrized to run with both vnodes and tablets, and
the tablets case was skipped, in order to not lose coverage. We change
them to run with the default configuration since now counters is
supported with both vnodes and tablets, and the implementation is the
same, so there is no benefit in running them with both configurations.
2025-11-03 16:04:37 +01:00
Michael Litvak
a6c12ed1ef test: add counters with tablets test
add a new test for counters with tablets to test things that are
specific to tablets. test counter updates that are concurrent with
tablet internode and intranode migrations and verify it remains
consistent and no updates are lost.
2025-11-03 16:04:37 +01:00
Michael Litvak
60ac13d75d cql3: remove warning when creating keyspace with tablets
When creating a keyspace with tablets, a warning is shown with all the
unsupported features for tablets, which is only counters currently.

Now that counters is also supported with tablets, we can remove this
warning entirely.
2025-11-03 16:04:37 +01:00
Michael Litvak
9208b2f317 cql3: allow counters with tablets
Now that counters work with tablets, allow to create a table with
counters in a tablets-enabled keyspace, and remove the warning about
counters not being supported when creating a keyspace with tablets.

We allow to use counters with tablets only when all nodes are upgraded
and support counters with tablets. We add a new feature flag to
determine if this is the case.

Fixes scylladb/scylladb#18180
2025-11-03 16:04:37 +01:00
Michael Litvak
296b116ae2 storage_proxy: lock all read shards for counter update
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.

This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.

To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
2025-11-03 16:04:35 +01:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Tomasz Grabiec
e878042987 Revert "Revert "tests(lwt): new test for LWT testing during tablet resize""
This reverts commit 6cb14c7793.

The issue causing the previous revert was fixed in 88765f627a.
2025-11-03 10:38:00 +01:00
Michael Litvak
579031cfc8 storage_proxy: refactor mutate_counter_on_leader
Slightly reorganize the mutate counter function to prepare it for a
later change.

Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
2025-11-03 08:43:11 +01:00
Michael Litvak
7cc6b0d960 replica/db: add counter update guard
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.

This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
2025-11-03 08:43:11 +01:00
Michael Litvak
88fd9a34c4 replica/db: split counter update helper functions
Split do_apply_counter_update to a few smaller and simpler functions to
help prepare for a later change.
2025-11-03 08:43:11 +01:00
Avi Kivity
9b6ce030d0 sstables: remove quadratic (and possibly exponential) compile time in parse()
parse() taking a list of elements is quadratic (during compile time) in
that it generates recursive calls to itself, each time with one fewer
parameter. The total size of the parameter lists in all these generated
functions is quadratic in the initial parameter list size.

It's also exponential if we ignore inlining limits, since each .then()
call expands to two branches - a ready future branch and a non-ready
future branch. If the compiler did not give up, we'd have 2^list_len
branches. For sure the compiler does not do so indefinitely, but the effort
getting there is wasted.

Simplify by using a fold expression over the comma operator. Instead
of passing the remaining parameter list in each step, we pass only
the parameter we are processing now, making processing linear, and not
generating unnecessary functions.

It would be better expressed using pack expansion statements, but these
are part of C++26.

The largest offender is probably stats_metadata, with 21 elements.

dev-mode sstables.o:

   text	   data	    bss	    dec	    hex	filename
1760059	   1312	   7673	1769044	 1afe54	sstables.o.before
1745533	   1312	   7673	1754518	 1ac596	sstables.o.after

We save about 15k of text with presumably a corresponding (small)
decrease in compile time.

Closes scylladb/scylladb#26735
2025-11-02 13:09:37 +01:00
Jenkins Promoter
cb30eb2e21 Update pgo profiles - aarch64 2025-11-01 05:23:52 +02:00
Jenkins Promoter
e3a0935482 Update pgo profiles - x86_64 2025-11-01 04:54:49 +02:00
Petr Gusev
88765f627a paxos_state: get_replica_lock: remove shard check
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0

Fixes scylladb/scylladb#26801

Closes scylladb/scylladb#26809
2025-10-31 21:37:39 +01:00
Avi Kivity
7a72155374 Merge 'Introduce nodetool excludenode' from Tomasz Grabiec
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).

Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.

This patch introduces this kind of API:

```
  nodetool excludenode <host-id> [ ... <host-id> ]
```

Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.

Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.

Fixes #21281

The PR also changes "nodetool status" to show excluded nodes,
they have 'X' in their status instead of 'D'.

Closes scylladb/scylladb#26659

* github.com:scylladb/scylladb:
  nodetool: status: Show excluded nodes as having status 'X'
  test: py: Test scenario involving excludenode API
  nodetool: Introduce excludenode command
2025-10-31 22:14:57 +02:00
Avi Kivity
d458dd41c6 Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.

Adopting newer seastar API, no need to backport

Closes scylladb/scylladb#26747

* github.com:scylladb/scylladb:
  commitlog: Remove unused work::r stream variable
  ec2_snitch: Fix indentation after previous patch
  ec2_snitch: Coroutinize the aws_api_call_once()
  sstable: Construct output_stream for data instantly
  test: Don't reuse on-stack input stream
2025-10-31 21:22:41 +02:00
Avi Kivity
adf9c426c2 Merge 'db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Nikos Dragazis
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).

Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.

Fixes #26610.

Closes scylladb/scylladb#26697

* github.com:scylladb/scylladb:
  test/cluster: Add test for default SSTable compressor
  db/config: Change default SSTable compressor to LZ4WithDictsCompressor
  db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
  boost/cql_query_test: Get expected compressor from config
2025-10-31 21:15:18 +02:00
Lakshmi Narayanan Sreethar
3eb7193458 backlog_controller: compute backlog even when static shares are set
The compaction manager backlog is exposed via metrics, but if static
shares are set, the backlog is never calculated. As a result, there is
no way to determine the backlog and if the static shares need
adjustment. Fix that by calculating backlog even when static shares are
set.

Fixes #26287

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26778
2025-10-31 18:18:36 +02:00
Michał Hudobski
fd521cee6f test: fix typo in vector_index test
Unfortunately in https://github.com/scylladb/scylladb/pull/26508
a typo than changes a behavior of a test was introduced.
This patch fixes the typo.

Closes scylladb/scylladb#26803
2025-10-31 18:35:02 +03:00
Tomasz Grabiec
284c73d466 scripts: pull_github_pr.sh: Fix auth problem detection
Before the patch, the script printed:

parse error: Invalid numeric literal at line 2, column 0

Closes scylladb/scylladb#26818
2025-10-31 18:32:58 +03:00
Michael Litvak
e7dbccd59e cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792
2025-10-31 13:02:34 +01:00
Tomasz Grabiec
1c0d847281 Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.

This is the second part of the size based load balancing changes:

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26152

* github.com:scylladb/scylladb:
  load_balancer: load_stats reconcile after tablet migration and table resize
  load_stats: change data structure which contains tablet sizes
2025-10-31 09:58:25 +01:00
Tomasz Grabiec
2bd173da97 nodetool: status: Show excluded nodes as having status 'X'
Example:

$ build/dev/scylla nodetool status
Datacenter: dc1
===============
Status=Up/Down/eXcluded
|/ State=Normal/Leaving/Joining/Moving
-- Address   Load      Tokens Owns Host ID                              Rack
UN 127.0.0.1 783.42 KB 1      ?    753cb7b0-1b90-4614-ae17-2cfe470f5104 rack1
XN 127.0.0.2 785.10 KB 1      ?    92ccdd23-5526-4863-844a-5c8e8906fa55 rack2
UN 127.0.0.3 708.91 KB 1      ?    781646ad-c85b-4d77-b7e3-8d50c34f1f17 rack3
2025-10-31 09:03:20 +01:00
Tomasz Grabiec
87492d3073 test: py: Test scenario involving excludenode API 2025-10-31 09:03:20 +01:00
Tomasz Grabiec
55ecd92feb nodetool: Introduce excludenode command
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).

Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.

This patch introduces this kind of API:

  nodetool excludenode <host-id> [ ... <host-id> ]

Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.

Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.

Fixes #21281
2025-10-31 09:03:20 +01:00
Avi Kivity
04a289cae6 Merge 'Auto expand to rack list' from Tomasz Grabiec
We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled.

The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs.

Fixes #26397

Closes scylladb/scylladb#26692

* github.com:scylladb/scylladb:
  cql3: ks_prop_defs: Expand numeric RF to rack list
  locator: Move rack_list to topology.hh
  alternator: Do not set RF for zero-token DCs
  alternator: Switch keyspace creation to use ks_prop_defs
  test: alternator: Adjust for rack lists
  cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs
  test: cqlpy: Mark tests using rack lists as scylla-only
  test: Switch to rack-list based RF
  test: Generalize tests to work with both numeric RF and rack lists
  test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF
  test: Prepare for handling errors specific to rack list path
  test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention
  test: Create cluster with multiple racks in multi-dc setups
  test: boost: network_topology_strategy_test: Adjust to rack-list RF
  test: tablets: Adjust to rack list
  test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness
  test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid
  test: object_store: test_backup: Adjust for rack lists
  test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity
  test: cluster: mv: Do not move tablets across racks
  test: cluster: util: Fix docstring for parse_replication_options()
  tablets, topology_coordinator: Skip tablet draining on replace
2025-10-30 21:54:08 +02:00
Avi Kivity
c0222e4d3c Merge 'replica/table: do not stop major compaction when disabling auto compaction' from Lakshmi Narayanan Sreethar
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.

This PR fixes the issue by tagging major compaction tasks with a newly
introduced `compaction_type::Major` enum. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.

Fixes #24501

PR improves how the compactions are stopped when a disable auto compaction request is executed.
No need to backport

Closes scylladb/scylladb#26288

* github.com:scylladb/scylladb:
  replica/table: do not stop major compaction when disabling auto compaction
  compaction/compaction_descriptor: introduce compaction_type::Major
2025-10-30 21:45:57 +02:00
Nikos Dragazis
a0bf932caa test/cluster: Add test for default SSTable compressor
The previous patch made the default compressor dependent on the
SSTABLE_COMPRESSION_DICTS feature:
* LZ4Compressor if the feature is disabled
* LZ4WithDictsCompressor if the feature is enabled

Add a test to verify that the cluster uses the right default in every
case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-30 15:53:54 +02:00
Nikos Dragazis
2fc812a1b9 db/config: Change default SSTable compressor to LZ4WithDictsCompressor
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).

Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-30 15:53:49 +02:00
Pavel Emelyanov
395e275e03 Merge 'test/cluster/random_failures: Adjust to RF-rack-validity' from Dawid Mędrek
We adjust the test to RF-rack-validity and then re-enable
index random events, which requires the configuration option
`rf_rack_valid_keyspaces` to be enabled.

Fixes scylladb/scylladb#26422

Backport: I'd rather not backport these changes. They're almost a hack and poses too much risk for little gain.

Closes scylladb/scylladb#26591

* github.com:scylladb/scylladb:
  test/cluster/random_failures: Re-enable index events
  test/cluster/random_failures: Enable rf_rack_valid_keyspaces
  test/cluster/random_failures: Adjust to RF-rack-validity
2025-10-30 15:39:38 +03:00
Aleksandra Martyniuk
fdd623e6bc api: storage_service: tasks: unify upgrade_sstable
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.
2025-10-30 11:42:48 +01:00
Aleksandra Martyniuk
044b001bb4 api: storage_service: tasks: force_keyspace_cleanup
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.
2025-10-30 11:42:47 +01:00
Aleksandra Martyniuk
12dabdec66 api: storage_service: tasks: unify force_keyspace_compaction
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).

Unify the handlers of both apis.
2025-10-30 11:33:17 +01:00
Tomasz Grabiec
6cb14c7793 Revert "tests(lwt): new test for LWT testing during tablet resize"
This reverts commit 99dc31e71a.

The test is not stable due to #26801
2025-10-30 08:50:40 +01:00
Piotr Wieczorek
0398bc0056 test/alternator: Enable the tests failing because of #6918
The tests pass only with alternator_streams_strict_compatibility flag
enabled, because of a suspected non-negligible performance impact (i.e.
an additional entire-item comparison and type conversions).

Refs https://github.com/scylladb/scylladb/issues/6918
2025-10-30 08:38:31 +01:00
Piotr Wieczorek
66ac66178b alternator, cdc: Don't emit events for no-op removes
Deletes that don't change the state of the database visible to the user
(e.g. an attempt to delete a missing item) shouldn't produce a cdc log.
This commit addresses this DynamoDB compatibility issue if the delete is
a partition delete, a row delete, or a cell delete. It works under the
assumption that the change was produced by Alternator. This means that
it doesn't support range deletes, static row deletes, deletes of
collection cells other than a map, etc. See also its parent commit,
which introduces the methods that this commit extends.

This commit handles the following cases:
- `DeleteItem of nonexistent item: nothing`,
- `BatchWriteItem.DeleteItem of nonexistent item: nothing`.

Refs https://github.com/scylladb/scylladb/pull/26121
2025-10-30 08:38:30 +01:00
Piotr Wieczorek
a32e8091a9 alternator, cdc: Don't emit an event for equal items
This commit adds a function that compares split mutations with the
`row_state`, that was selected as a preimage or propagated through
cdc options by a caller. If the items are equal, the corresponding log
row isn't generated. The result being that creating an item with
BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY
event if exactly identical item already exists.

Comparing the items may be costly, so this logic is controlled by
`alternator_streams_compabitiblity` flag.

This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal
  item: nothing`
2025-10-30 08:38:30 +01:00
Piotr Wieczorek
8c2f60f111 alternator/streams, cdc: Differentiate item replace and item update in CDC
This commit improves compatibility with DynamoDB streams by changing the
emitted events when creating/updating an item. Replace/update operations
of an existing item emit a MODIFY, whereas replacing/updating a missing
item results in an INSERT. If the state of the item doesn't change after
applying the operation, no event is emitted.

This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY`
- `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT`

Refs https://github.com/scylladb/scylladb/issues/6918
2025-10-30 07:40:31 +01:00
Piotr Wieczorek
4f6aeb7b6b alternator: Change the return type of rmw_operation_return
Change the type from future<executor::request_return_type> to
executor::request_return_type, because the method isn't async and one
out of two callers unwraps the future immediately. This simplifies the
code a little and probably saves a few instructions, since we suspect
that moving a future<X> is more expensive than just moving X.
2025-10-30 07:40:31 +01:00
Piotr Wieczorek
ffdc8d49c7 config: Add alternator_streams_strict_compatibility flag
With this flag enabled, Alternator Streams produces more accurate event
types:
- nop operations (i.e. replacing an item with an identical one, deleting
  a nonexistent item) don't produce an event,
- updates of an existing item produce a MODIFY event, instead of INSERT,
- etc.

This flag affects the internal behaviour of some operations, i.e.
Alternator may select a preimage and propagate it to CDC (in contrary to
CDC making the request), or do extra item comparisons (i.e. compare the
existing item with the new one). These operations may be costly, and
users that don't use Streams won't need them.

This flag is live-updatable. An operation reads this flag once, and uses
its value for the entire operation.
2025-10-30 07:40:31 +01:00
Piotr Wieczorek
e3fde8087a cdc: Don't split a row marker away from row cells
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.

As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
2025-10-30 07:40:31 +01:00
Tomasz Grabiec
28f6bdc99b cql3: ks_prop_defs: Expand numeric RF to rack list
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.

Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.

Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-10-29 23:32:59 +01:00
Tomasz Grabiec
35166809cb locator: Move rack_list to topology.hh
So that we can use it in locator/tablets.hh and avoid circular dependency
between that header and abstract_replication_strategy.hh
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
f6dfea2fb1 alternator: Do not set RF for zero-token DCs
That will fail with tablets because it won't be able to allocate
replicas.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
21db21af7e alternator: Switch keyspace creation to use ks_prop_defs
So that we get the same validation and option post-processing as
during regular keyspace creation.
RF auto-expansion logic happens in ks_prop_defs, and we want that
for tablets.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
7f66f67d95 test: alternator: Adjust for rack lists
To achieve RF=3 with tablets and rf_rack_valid_keyspaces, we need 3
racks. So change the test to create 3 racks. Alternator was bypassing
standard keyspace creation path, so it escaped validation. But this
will change, and the test will stop wroking.

Also, after auto-expansion of RF to rack list, not all of 4 nodes
will host replicas. So need to adjust expectations.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
a88f70ce2c cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs
Tests expect this failure in some scenarios, but later changes make us
fail ealier due to topology constraints.

As a rule, more general validation should come before more specific
validation. So syntax validation before topology validation.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
8e69c65124 test: cqlpy: Mark tests using rack lists as scylla-only
Those tests are intended to be also run against Cassandra, which
doesn't support rack lists.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
ba53f41f59 test: Switch to rack-list based RF
Have to do that before we enable auto-expansion of numeric RF to
rack-lists, because those tests alter the replication factor, and
altering from rack-list to numeric will not be allowed.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
d2e7d6fad2 test: Generalize tests to work with both numeric RF and rack lists 2025-10-29 23:32:58 +01:00
Tomasz Grabiec
aa05f0fad0 test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF
Two changes here:

1) Allocate nodes in dc2 in separeate racks to make the test stronger
- it invites bugs where RF==nr_racks succeeds despite there being
zero-token nodes, and not simply fail due to rack count.

2) Due to auto-expansion to rack list, scylla throws in keyspace
creation rather than table creation.
2025-10-29 23:32:58 +01:00
Benny Halevy
e8b9f13061 test: Prepare for handling errors specific to rack list path 2025-10-29 23:32:58 +01:00
Tomasz Grabiec
255f429a80 test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention
With rf_rack_valid_keyspaces enabled, RF of alternator tables will be
equal to the number of racks (in this test: nodes). Prior to that, if
number of nodes is smaller than 3, alternator creates the keyspace
with RF=1. Turns out, with RF=2 the test fails with write timeouts due
to contention. Enforce RF=1 by creating the table with one node before
adding the second node.
2025-10-29 23:32:58 +01:00
Tomasz Grabiec
40e7543361 test: Create cluster with multiple racks in multi-dc setups
To allow auto-expansion of numeric RF to rack list. Otherwise,
keyspace creation will be rejected if rf-rack-valid keyspaces are
enforced.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
723622cf70 test: boost: network_topology_strategy_test: Adjust to rack-list RF 2025-10-29 23:32:57 +01:00
Tomasz Grabiec
19d0beff38 test: tablets: Adjust to rack list
test_decommission_rack_load_failure expects some tablets to land in
the rack which only has the decommissioning node. Since the table uses
RF=1, auto-expansion may choose the other rack and put all tablets
there, and the expected failure will not happen. Force placement by
using rack-list RF.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
7ccc2a3560 test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness 2025-10-29 23:32:57 +01:00
Tomasz Grabiec
0f38f7185c test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid 2025-10-29 23:32:57 +01:00
Tomasz Grabiec
5962498983 test: object_store: test_backup: Adjust for rack lists
With rack lists, not all nodes in a rack will receive streams if RF=1.
Adjust expectations.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
3b8a3823db test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity
Choose old_replica and new_replica so that they're both in rack r1.

After later changes (rack list auto expansion), it's no longer
guaranteed that the first replica will be on r1.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
5bf7112fe6 test: cluster: mv: Do not move tablets across racks
It's illegal with rf-rack-valid keyspaces.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
e34548ccdb test: cluster: util: Fix docstring for parse_replication_options()
rack lists are now in replication_v2, which is also parsed with this
function.
2025-10-29 23:32:57 +01:00
Tomasz Grabiec
288e75fe22 tablets, topology_coordinator: Skip tablet draining on replace
Replace doesn't drain (rebuild) tablets during topology change. They
are rebuilt afterwards when the replaced node is in "left" state and
replacing node is in normal state. So there is no point in attempting
to drain, as nothing will be drained.

Not only that, doing so has a risk, because the load balancer is
invoked on a transitional topology state in which we can end up with
no normal nodes in a rack. That's the case if the replaced node was
the last one in the rack. This tripped one of the algorithms which
computes rack's shard count for the purpose of determining ideal
tablet count, it was not prepared to find an empty rack to which a
table is still repliacated. That was fixed separately, but to avoid
this, we better skip tablet draining here.
2025-10-29 23:32:57 +01:00
Taras Veretilnyk
c922256616 sstables: add overload of data_stream() to accept custom file_input_stream_options
This patch introduces a new overload of 'sstable::data_stream()' that allows
callers to provide their own 'file_input_stream_options'.

This change will be useful in the next commit to enable integrity checking
for file streaming.
2025-10-29 22:30:18 +01:00
Nikos Dragazis
96e727d7b9 db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435

As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.

Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.

For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-29 20:13:08 +02:00
Nadav Har'El
aa34f0b875 alternator: fix CDC events for TTL expiration
In commit a3ec6c7d1d we supposedly
implemented the feature of telling TTL experation events from regular
user-sent deletions. However, that implementation did not actually work
at all... It had two bugs:

 1. It created an null rjson::value() instead of an empty dictionary
    rjson::empty_object(), so GetRecords failed every time such a
    TTL expiration event was generated.
 2. In the output, it used lowercase field names "type" and "principalId"
    instead of the uppercase "Type" and "PrincipalId". This is not the
    correct capitalization, and when boto3 recieves such incorrect
    fields it silently deletes them and never passes them to the user's
    get_records() call.

This patch fixes those two bugs, and importantly - enables a test for
this feature. We did already have such a test but it was marked as
"veryslow" so doesn't run in CI and apparently not even run once to
check the new feature. This test is not actually very long on Alternator
when the TTL period is set very low (as we do in our tests), so I replaced
the "veryslow" marker by "waits_for_expiration". The latter marker means
that the test is still very slow - as much as half an hour - on DynamoDB -
but runs quickly on Scylla in our test setup, and enabled in CI by
default.

The enabled test failed badly before this patch (a server error during
GetRecords), and passes with this patch.

Also, the aforementioned commit forgot to remove the paragraph in
Alternator's compatibility.md that claims we don't have that feature yet.
So we do it now.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26633
2025-10-29 17:08:20 +01:00
Piotr Wieczorek
2812e67f47 cdc: Emit a preimage for non-clustered tables
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.

Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.

Fixes https://github.com/scylladb/scylladb/issues/26382

Closes scylladb/scylladb#26578
2025-10-29 17:54:58 +02:00
Nadav Har'El
29ed1f3de7 Merge 'cql3: Refactor vector search select impl into a dedicated class' from Karol Nowacki
cql3: Refactor vector search select impl into a dedicated class

The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500.

This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization.

Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes.

To address this, the refactoring introduces the following changes:

- A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk.
-  The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes.
- An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future.

Fixes: VECTOR-162

No backport is needed, as this is refactoring.

Closes scylladb/scylladb#25798

* github.com:scylladb/scylladb:
  cql3: Rename indexed_table_select_statement
  cql3: Move vector search select to dedicated class
2025-10-29 17:49:24 +02:00
Lakshmi Narayanan Sreethar
7eac18229c replica/table: do not stop major compaction when disabling auto compaction
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.

This patch fixes the issue by tagging major compaction tasks with the
newly introduced `compaction_type::MajorCompaction`. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.

Fixes #24501

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-29 19:22:07 +05:30
Lakshmi Narayanan Sreethar
4d442f48db compaction/compaction_descriptor: introduce compaction_type::Major
Introduce a new compaction_type enum : `Major`.
This type will be used by the next patches to differentiate between
major compaction and regular compaction (compaction_type::Compaction).

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-29 19:21:53 +05:30
Nikos Dragazis
d95ebe7058 boost/cql_query_test: Get expected compressor from config
Since 5b6570be52, the default SSTable compression algorithm for user
tables is no longer hardcoded; it can be configured via the
`sstable_compression_user_table_options.sstable_compression` option in
scylla.yaml.

Modify the `test_table_compression` test to get the expected value from
the configuration.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-29 14:52:43 +02:00
Piotr Dulikowski
aba922ea65 Merge 'cdc: improve cdc metadata loading' from Michael Litvak
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes https://github.com/scylladb/scylladb/issues/26732

backport to 2025.4 where cdc with tablets is introduced

Closes scylladb/scylladb#26160

* github.com:scylladb/scylladb:
  test: cdc: extend cdc with tablets tests
  cdc: improve cdc metadata loading
2025-10-29 11:07:48 +01:00
Michał Hudobski
46589bc64c secondary_index: disallow multiple vector indexes on the same column
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.

To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.

We also add a test that checks this behavior.

Fixes: VECTOR-254
Fixes: #26672

Closes scylladb/scylladb#26508
2025-10-29 11:55:38 +02:00
Patryk Jędrzejczak
7304afb75a Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev
This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR.

backport: not needed, this PR contains only renames and code comment fixes

Closes scylladb/scylladb#26745

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: fix comment
  storage_proxy: remove stale comment
  storage_proxy: improve run_fenceable_write comment
  topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes
  storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed
  storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
2025-10-29 10:38:27 +01:00
Nadav Har'El
492c664fbb docs/alternator: explain alternator_warn_authorization
The previous patches added the ability to set
alternator_warn_authorization. In this patch we add to our
documentation a recommendation that this setting be used as an
intermediate step when wanting to change alternator_enforce_authorization
from "false" to "true". We explain why this is useful and important.

The new documentation is in docs/alternator/compatibility.md, where
we previously explained the alternator_enforce_authorization configuration.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-29 11:16:29 +02:00
Nadav Har'El
2dbd1a85a3 test/alternator: tests for new auth failure metrics and log messages
This patch adds to test_metrics.py tests that authentication and
authorization errors increment, respectively, the new metrics

    scylla_alternator_authentication_failures
    scylla_alternator_authorization_failures

This patch also adds in test_logs.py tests that verify that that log
messages are generated on different types of authentication/authorization
failures.

The tests also check how configuring alternator_enforce_authorization
and alternator_warn_authorization changes these behaviors:
  * alternator_enforce_authorization determines whether an auth error
    will cause the request to fail, or the failure is counted but then
    ignored.
  * alternator_warn_authorization determines whether an auth error will
    cause a WARN-level log message to be generated (and also the failure
    is counted.
  * If both configuration flags are false, Alternator doesn't even
    attempt to check authentication or authorization - so errors aren't
    even counted.

Because the new tests live-update the alternator_*_authorization
configuration options, they also serve as a test that live-updating
this option works correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-29 11:16:29 +02:00
Nadav Har'El
51186b2f2c alternator: add alternator_warn_authorization config
Before this patch, the configuration alternator_enforce_authorization
is a boolean: true means enforce authentication checks (i.e., each
request is signed by a valid user) and authorization checks (the user
who signed the request is allowed by RBAC to perform this request).

This patch adds a second boolean configuration option,
alternator_warn_authorization. When alternator_enforce_authorization
is false but alternator_warn_authorization is true, authentication and
authorization checks are performed as in enforce mode, but failures
are ignored and counted in two new metrics:

    scylla_alternator_authentication_failures
    scylla_alternator_authorization_failures

additionally,also each authentication or authorization error is logged as
a WARN-level log message. Some users prefer those log messages over
metrics, as the log messages contain additional information about the
failure that can be useful - such as the address of the misconfigured
client, or the username attempted in the request.

All combinations of the two configuration options are allowed:
 * If just "enforce" is true, auth failures cause a request failure.
   The failures are counted, but not logged.
 * If both "enforce" and "warn" are true, auth failures cause a request
   failure. The failures are both counted and logged.
 * If just "warn" is true, auth failures are ignored (the request
   is allowed to compelete) but are counted and logged.
 * If neither "enforce" nor "warn" are true, no authentication or
   authorization check are done at all. So we don't know about failures,
   so naturally we don't count them and don't log them.

This patch is fairly straightforward, doing mainly the following
things:

1. Add an alternator_warn_authorization config parameter.

2. Make sure alternator_enforce_authorization is live-updatable (we'll
   use this in a test in the next patch). It "almost" was, but a typo
   prevented the live update from working properly.

3. Add the two new metrics, and increment them in every type of
   authentication or authorization error.
   Some code that needs to increment these new metrics didn't have
   access to the "stats" object, so we had to pass it around more.

4. Add log messages when alternator_warn_authorization is true.

5. If alternator_enforce_authorization is false, allow the auth check
   to allow the request to proceed (after having counted and/or logged
   the auth error).

A separate patch will follow and add documentation suggesting to users
how to use the new "warn" options to safely switch between non-enforcing
to enforcing mode. Another patch will add tests for the new configuration
options, new metrics and new log messages.

Fixes #25308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-29 11:16:26 +02:00
Dawid Mędrek
48cbf6b37a test/cluster/test_tablets: Migrate dtest
We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node`
from dtests to the repository of Scylla. We divide the test into two
steps to make testing easier and even possible with RF-rack-valid
keyspaces being enforced.

Closes scylladb/scylladb#26285
2025-10-29 11:09:48 +02:00
Karol Nowacki
9f1fd7f5a0 cql3: Rename indexed_table_select_statement
To align with `vector_indexed_table_select_statement`, this commit renames
`indexed_table_select_statement` to `view_indexed_table_select_statement`
to clarify its usage with materialized views.
2025-10-29 08:37:25 +01:00
Karol Nowacki
357c0a8218 cql3: Move vector search select to dedicated class
The execution of SELECT statements with ANN ordering (vector search) was
previously implemented within `indexed_table_select_statement`. This was
not ideal, as vector search logic is independent of secondary index selects.

This resulted in unnecessary complexity because vector search queries don't
use features like aggregates or paging. More importantly,
`indexed_table_select_statement` assumed a non-null `view_schema` pointer,
which doesn't hold for vector indexes (where `view_ptr` is null).
This caused null pointer dereferences during ANN ordered selects, leading
to crashes (VECTOR-179). Other parts of the class still dereference
`view_schema` without null checks.

Moving the vector search select logic out of
`indexed_table_select_statement` simplifies the code and prevents these
null pointer dereferences.
2025-10-29 08:37:21 +01:00
Taras Veretilnyk
e62ebdb967 table: enable integrity checks for streaming reader
Previously, streaming readers only verified the checksum of compressed SSTables.
This patch extends checks to also include the digest and the uncompressed checksum (CRC).

These additional checks require reading the digest and CRC components from disk,
which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data,
while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.
If the reader range doesn't cover the full SSTable, the digest check is skipped.
2025-10-28 19:27:35 +01:00
Taras Veretilnyk
06e1b47ec6 table: Add integrity option to table::make_sstable_reader() 2025-10-28 19:27:35 +01:00
Taras Veretilnyk
deb8e32e86 sstables: Add integrity option to create_single_key_sstable_reader
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
2025-10-28 19:27:35 +01:00
Petr Gusev
b6bcd062de test_automatic_cleanup: fix comment 2025-10-28 17:55:20 +01:00
Petr Gusev
d49be677d5 storage_proxy: remove stale comment 2025-10-28 17:55:20 +01:00
Petr Gusev
c60223f009 storage_proxy: improve run_fenceable_write comment 2025-10-28 17:55:20 +01:00
Petr Gusev
58d100a0cb topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes 2025-10-28 17:55:20 +01:00
Petr Gusev
fa9dc71f30 storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed 2025-10-28 17:55:19 +01:00
Pavel Emelyanov
e99c8eee08 commitlog: Remove unused work::r stream variable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:46:29 +03:00
Dawid Mędrek
5e03b01107 test/cluster: Add test_simulate_upgrade_legacy_to_raft_listener_registration
We provide a reproducer test of the bug described in
scylladb/scylladb#18049. It should fail before the fix introduced in
scylladb/scylladb@7ea6e1ec0a, and it
should succeed after it.

Refs scylladb/scylladb#18049
Fixes scylladb/scylladb#18071

Closes scylladb/scylladb#26621
2025-10-28 17:32:15 +01:00
Pavel Emelyanov
92462e502f ec2_snitch: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:31:08 +03:00
Pavel Emelyanov
7640ade04d ec2_snitch: Coroutinize the aws_api_call_once()
The method connects a socket, grabs in/out streams from it then writes
HTTP request and reads+parses the response. For that it uses class
variables for socket and streams, but there's no real need for that --
all three actually exists throughput the method "lifetime".

To fix it, coroutinizes the method. The same could be achieved my moving
the connected socket and streams into do_with() context, but coroutine
is better than that.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:29:25 +03:00
Pavel Emelyanov
5d89816fed sstable: Construct output_stream for data instantly
This changes makes local output_stream variable be constructed in the
declaration statement with the help of ternary operator thus avoiding
both -- default-initialization and move-assignment depending on the
standalone condition checking.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:27:22 +03:00
Pavel Emelyanov
37b9cccc1c test: Don't reuse on-stack input stream
The test consists of several snippets, each creating an input_stream for
some short operation and checking the result. Each snipped over-writes
the local `input_stream in` variable with the new one.

This change wraps each of those snippets into own code block in order to
have own new `input_stream in` variable in each.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:25:07 +03:00
Yauheni Khatsianevich
99dc31e71a tests(lwt): new test for LWT testing during tablet resize
– Workload: N workers perform CAS updates
 UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
 at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
 as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
 flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.

Closes scylladb/scylladb#26113
2025-10-28 16:48:57 +01:00
Petr Gusev
d300adc10c storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
This cleanup is only for vnodes-based tables, reflect this in the function name.
2025-10-28 15:37:28 +01:00
Michael Litvak
4cc0a80b79 test: cdc: extend cdc with tablets tests
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
2025-10-28 15:06:21 +01:00
Pavel Emelyanov
ae0136792b utils: Make directory_lister use generator lister from seastar
The directory_lister uses utils::lister under the hood which accepts a
callback to put directory_entry-s in. The directory_lister's callback
then puts the entries into a queue and its .get() method pops up entries
from there to return to caller.

This patch simplifies this code by switching the directory_lister to use
experimental generator lister from seastar. With it, the entries to be
returned from .get() are simply co_await-ed from calling the generator
object (wich co_yield-s them).

As a result the directory_lister becomes smaller and drops the need for
utils::lister. Since directory_lister was created as a replacement for
that callback-based lister, the latter can be eventually removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26586
2025-10-28 15:20:20 +02:00
Pavel Emelyanov
948cefa5f9 test: Extend API consistency test with tokens_endpoint endpoint
Recently (#26231) there was added a test to check that several API
endpoints, that return tokens and corresponding replica nodes, are
consistent with tablet map. This patch adds one more API endpoint to the
validation -- the /storage_service/tokens_endpoint one.

The extention is pretty straightforward, but the new endpoint returns
back a single (primary) replica for a token, so the test check is
slightly modified to account for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26580
2025-10-28 15:18:09 +02:00
Dawid Mędrek
535d31b588 test/cluster/random_failures: Re-enable index events
We've enabled the configuration option `rf_rack_valid_keyspaces`,
so we can finally re-enable the events creating and dropping secondary
indexes.

Fixes scylladb/scylladb#26422
2025-10-28 14:17:14 +01:00
Dawid Mędrek
b4898e50bf test/cluster/random_failures: Enable rf_rack_valid_keyspaces
Now that the test has been adjusted to work with the configuration
option, we enable it.
2025-10-28 14:17:09 +01:00
Dawid Mędrek
59b2a41c49 test/cluster/random_failures: Adjust to RF-rack-validity
We adjust the test to work with the configuration option
`rf_rack_valid_keyspaces` enabled. For that, we ensure that there is
always at least one node in each of the three racks. This way, all
keyspaces we create and manipulate will remain RF-rack-valid since they
all use RF=3.

------------------------------------------------------------------------

To achieve that, we only need to adjust the following events:

1. `init_tablet_transfer`
   The event creates a new keyspace and table and manually migrates
   a tablet belonging to it. As long as we make sure the migration occurs
   within the same rack, there will be no problem.

   Since RF == #racks, each rack will have exactly one tablet replica,
   so we can migrate the tablet to an arbitrary node in the same rack.

   Note that there must exist a node that's not a replica. If there weren't
   such a node, the test wouldn't have worked before this commit because
   it's not possible to migrate a tablet from one node being its replica to
   another. In other words, we have a guarantee that there are at least 4 nodes
   in the cluster when we try to migrate a tablet replica.

   That said, we check it anyway. If there's no viable node to migrate the
   tablet replica to, we log that information and do nothing. That should be
   an acceptable solution.

2. `add_new_node`
   As long as we add a node to an existing rack, there's no way to
   violate the invariant imposed by the configuration option, so we pick
   a random rack out of the existing three and create a node in it.

3. `decommission_node`
   We need to ensure that the node we'll be trying to decommission is
   not the only one in its rack.

   Following pretty much the same reasoning as in `init_tablet_transfer`,
   we conclude there must be a rack with at least two nodes in it. Otherwise
   we'd end up having to migrate a tablet from one replica node to another,
   which is not possible.

   What's more, decommissioning a node is not possible if any node in
   the cluster is dead, so we can assume that `manager.running_servers`
   returns the whole cluster.

4. `remove_node`
   The same as `decommission_node`. Just note although the node we choose to
   remove must be first stopped, none other node can be dead, so the whole
   cluster must be returned by `manager.running_servers`.

------------------------------------------------------------------------

There's one more important thing to note. The test may sometimes trigger
a sequence of events where a new node is started, but, due to an error
injection, its initialization is not completed. Among other things, the
node may NOT have a host ID recognized by the rest of the nodes in the
cluster, and operations like tablet migration will fail if they target
it.

Thankfully, there seems to be a way to avoid problems stemming from
that. When a new node is added to the cluster, it should appear at the
end of the list returned by `manager.running_servers`. This most likely
stems from how dictionaries work in Python:

"Keys and values are iterated over in insertion order."
-- https://docs.python.org/3/library/stdtypes.html#dict-views

and the fact that we keep track of running servers using a dictionary.

Furthermore, we rely on the assumption that the test currently works
correctly.

Assume, to the contrary, that among the nodes taking part in the operations
listed above, there is at most one node per rack that has its host ID recognized
by the rest of the cluster. Note that only those nodes can store any tablets.
Let's refer to the set of those nodes as X.

Assume that we're dealing with tablet migration, decommissioning, or removing
a node. Since those operations involve tablet migration, at least one tablet
will need to be migrated from the node in question to another node in X.
However, since X consists of at most three nodes, and one of them is losing
its tablet, there is no viable target for the tablet, so the operation fails.

Using those assumptions, an auxiliary function, `select_viable_rack`,
was designed to carefully choose a correct rack, which we'll then pick nodes
from to perform the topological operations. It's simple: we just find the first
rack in the list that has at least two nodes in it. That should ensure that we
perform an operation that doesn't lead to any unforeseen disaster.

------------------------------------------------------------------------

Since the test effectively becomes more complex due to more care for keeping
the topology of the cluster valid, we extend the log messages to make them
more helpful when debugging a failure.
2025-10-28 14:15:57 +01:00
Nadav Har'El
87573197d4 test/alternator: reproducers for missing headers and request limit
This patch adds reproducing tests in test/alternator for issue #23438,
which is about missing checks for the length of headers and the URL
in Alternator requests. These should be limited, because Seastar's
HTTP server, which Scylla uses, reads them into memory so they can OOM
Scylla.

The tests demonstrate that DynamoDB enforces a 16 KB limit on the
headers and the URL of the request, but Scylla doesn't (a code
inspection suggests it does not in fact have any limit).

The two tests pass on DynamoDB and currently xfail on Alternator.

Refs #23438.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23442
2025-10-28 15:12:25 +02:00
Pavel Emelyanov
d9bfbeda9a lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595
2025-10-28 15:10:22 +02:00
Botond Dénes
ac618a53f4 Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

Closes scylladb/scylladb#26319

* github.com:scylladb/scylladb:
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2025-10-28 14:52:59 +02:00
Botond Dénes
f3cec5f11a Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.

Implementation strategy:

1. Move code responsible for producing the schema
   of a secondary index to the file that handles
   `CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.

Fixes scylladb/scylladb#26542

Backport: we can discuss it.

Closes scylladb/scylladb#26543

* github.com:scylladb/scylladb:
  index: Set tombstone_gc when creating secondary index
  index: Make `create_view_for_index` method of `create_index_statement`
  index: Move code for creating MV of secondary index to cql3
  db, cql3: Move creation of underlying MV for index
2025-10-28 14:42:42 +02:00
Nadav Har'El
c3593462a4 alternator: improve protection against oversized requests
Following DynamoDB, Alternator also places a 16 MB limit on the size of
a request. Such a limit is necessary to avoid running out of memory -
because the AWS message authentication protocol requires reading the
entire request into memory before its signature can be verified.

Our implementation for this limit used Seastar's HTTP server's
content_length_limit feature. However, this Seastar feature is
incomplete - it only works when the request uses the Content-Length
header, and doesn't do anything if the request doesn't have a
Content-Length (it may use chunked encoding, or have no length at all).
So malicious users can cause Scylla to OOM by sending a huge request
without a Content-Length.

So in this patch we stop using the incomplete Seastar feature, and
implement the length limit in Scylla in a way that works correctly with
or without Content-Length: We read from the input stream and if we go
over 16MB, we generate an error.

Because we dropped Seastar's protection against a long Content-Length,
we also need to fix a piece of code which used Content-Length to reserve
some semaphore units to prevent reading many large requests in parallel.
We fix two problems in the code:
1. If Content-Length is over the limit, we shouldn't attempt to reserve
   semaphore units - this should just be a Payload Too Large error.
2. If Content-Length is missing, the existing code did nothing and had
   a TODO that we should. In this patch we implement what was suggested
   in that TODO: We temporarily reserve the whole 16 MB limit, and
   after reading the actual request, we return part of the reservation
   according to the real request size.

That last fix is important, because typically the largest requests will be
BatchWriteItem where a well-written client would want to use chunked
encoding, not Content-Length, to avoid materializing the entire request
up-front. For such clients, the memory use semaphore did nothing, and
now it does the right thing.

Note that this patch does *not* solve the problem #12166 that existed
with Seastar's length-limiting implementation but still exists in the
new in-Scylla length-limiting implementation: The fact we send an
error response in the middle of the request and then close the
connection, while the client continues to send the request, can lead
to an RST being sent by the server kernel. Usually this will be fine -
well-written client libraries will be able to read the response before
the RST. But even with a well-written library in some rare timings
the client may get the RST before the response, and will miss the
response, and get an empty or partial response or "connection reset
by peer". This issue existed before this patch, and still exists, but
is probably of minor impact.

Fixes #8196

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23434
2025-10-28 15:24:46 +03:00
Lakshmi Narayanan Sreethar
64c1ec99e0 cmake: link crypto lib to utils
The utils library requires OpenSSL's libcrypto for cryptographic
operations and without linking libcrypto, builds fail with undefined
symbol errors. Fix that by linking `crypto` to `utils` library when
compiled with cmake. The build files generated with configure.py already
have `crypto` lib linked, so they do not have this issue.

Fix #26705

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26707
2025-10-28 14:11:03 +02:00
Ferenc Szili
10f07fb95a load_balancer: load_stats reconcile after tablet migration and table resize
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
2025-10-28 12:12:09 +01:00
Anna Stuchlik
6fa342fb18 doc: add OS support for version 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26450

Closes scylladb/scylladb#26616
2025-10-28 13:29:40 +03:00
Radosław Cybulski
ea6b22f461 Add max trace size output configuration variable
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
 #26634 will help.

- add configuration varable `alternator_max_users_query_size_in_trace_output`
   with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
  variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
  size, previously trunctation would also happen when data arrived in
  more than one chunk
- update `truncated_content_view` to better handle truncated value
  (limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
  name that is not existing will return `Items` array empty
  (but present) - this would raise array access exception few lines
  below.
- add test

Refs #26634
Refs #24031

Closes scylladb/scylladb#26618
2025-10-28 13:29:15 +03:00
Pavel Emelyanov
ac1d709709 Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.

test_tablets2::test_tablet_load_and_stream  was enhanced to also verify that SSTables are removed after being streamed.

Fixes #26606

Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.

Closes scylladb/scylladb#26622

* github.com:scylladb/scylladb:
  test_tablets2: verify SSTable cleanup after tablet load and stream
  tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
2025-10-28 13:25:40 +03:00
Patryk Jędrzejczak
5321720853 test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638
2025-10-28 13:24:23 +03:00
Anna Stuchlik
bd5b966208 doc: add --list-active-releases to Web Installer
Fixes https://github.com/scylladb/scylladb/issues/26688

V2 of https://github.com/scylladb/scylladb/pull/26687

Closes scylladb/scylladb#26689
2025-10-28 13:21:57 +03:00
Pavel Emelyanov
54a117b19d Merge 'retry_strategy: Switch to using seastar's retry_strategy (take two)' from Ernest Zaslavsky
With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB.

Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible

ref: https://github.com/scylladb/seastar/issues/2803

depends on: https://github.com/scylladb/seastar/pull/2960

Closes scylladb/scylladb#25801

* github.com:scylladb/scylladb:
  s3_client: remove unnecessary `co_await` in `make_request`
  s3 cleanup: remove obsolete retry-related classes
  s3_client: remove unused `filler_exception`
  s3_client: fix indentation
  s3_client: simplify chunked download error handling using `make_request`
  s3_client: reformat `make_request` functions for readability
  s3_client: eliminate duplication in `make_request` by using overload
  s3_client: reformat `make_request` function declarations for readability
  s3_client: reorder `make_request` and helper declarations
  s3_client: add `make_request` override with custom retry and error handler
  s3_client: migrate s3_client to Seastar HTTP client
  s3_client: fix crash in `copy_s3_object` due to dangling stream
  s3_client: coroutinize `copy_s3_object` response callback
  aws_error: handle missing `unexpected_status_error` case
  s3_creds: use Seastar HTTP client with retry strategy
  retry_strategy: add exponential backoff to `default_aws_retry_strategy`
  retry_strategy: introduce Seastar-based retry strategy
  retry_strategy: update CMake and configure.py for new strategy
  retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy`
  retry_strategy: fix include
  retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh
  retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc
2025-10-28 13:08:42 +03:00
Patryk Jędrzejczak
820c8e7bc4 Merge 'LWT: use shards_ready_for_reads for replica locks' from Petr Gusev
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.

The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).

Fixes scylladb/scylladb#26727

backport: need to backport to 2025.4

Closes scylladb/scylladb#26719

* https://github.com/scylladb/scylladb:
  paxos_state: use shards_ready_for_reads
  paxos_state: inline shards_for_writes into get_replica_lock
2025-10-28 10:37:53 +01:00
Avi Kivity
d81796cae3 Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

Fixes https://github.com/scylladb/scylladb/issues/25341

Closes scylladb/scylladb#25456

* github.com:scylladb/scylladb:
  mv: limit concurrent view updates from all sources
  database: rename _view_update_concurrency_sem to _view_update_memory_sem
2025-10-28 11:13:24 +02:00
Aleksandra Martyniuk
910cd0918b locator: use get_primary_replica for get_primary_endpoints
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.

Use get_primary_replica for get_primary_endpoints.

Fixes: https://github.com/scylladb/scylladb/issues/21883.

Closes scylladb/scylladb#26385
2025-10-28 09:56:08 +02:00
Michael Litvak
8743422241 cdc: improve cdc metadata loading
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes scylladb/scylladb#26732
2025-10-28 08:54:09 +01:00
Wojciech Mitros
f07a86d16e mv: limit concurrent view updates from all sources
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.

Fixes https://github.com/scylladb/scylladb/issues/25341
2025-10-27 18:55:41 +01:00
Pavel Emelyanov
81f598225e error_injection: Add template parameter default for in release mode
The std::optional<T> inject_parameter(...) method is a template, and in
dev/debug modes this parameter is defaulted to std::string_view, but for
release mode it's not. This patch makes it symmetrical.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26706
2025-10-27 16:39:22 +01:00
Taras Veretilnyk
1361ae7a0a test_tablets2: verify SSTable cleanup after tablet load and stream
Modify existing test_tablet_load_and_stream testcase to verify
that SSTable files are properly deleted from the upload
directory after streaming.
2025-10-27 16:36:08 +01:00
Taras Veretilnyk
517a4dc4df tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets.
The previous implementation unlinked SSTables immediately after streaming them for the first tablet,
potentially making them partially unavailable for subsequent tablets.
This patches replaces unlink() call with mark_for_deletion()
2025-10-27 16:30:05 +01:00
Petr Gusev
478f7f545a paxos_state: use shards_ready_for_reads
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.

Fixes scylladb/scylladb#26727
2025-10-27 16:22:28 +01:00
Piotr Dulikowski
fd966ec10d Merge 'cdc: garbage collect CDC streams for tablets' from Michael Litvak
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.

the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.

in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.

Fixes https://github.com/scylladb/scylladb/issues/26669

Closes scylladb/scylladb#26410

* github.com:scylladb/scylladb:
  cdc: garbage collect CDC streams periodically
  cdc: helpers for garbage collecting old streams for tablets
2025-10-27 16:16:55 +01:00
Michał Hudobski
541b52cdbf cql: fail with a better error when null vector is passed to ann query
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").

This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.

Fixes: VECTOR-257

Closes scylladb/scylladb#26510
2025-10-27 16:09:08 +02:00
Botond Dénes
417270b726 Merge 'Port dtest EAR tests to test.py/pytest in scylla CI' from Calle Wilund
Fixes #26641

* Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman)
* Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing
* Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest.
* Shared KMIP mock between boost test and pytest and speeds up boost test shutdown.

When merged, the dtest counterpart can be decommissioned.

Closes scylladb/scylladb#26642

* github.com:scylladb/scylladb:
  test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
  test::cluster::test_encryption: Port dtest EAR tests
  test::cluster::conftest: Add key_provider fixture
  test::pylib::encryption_provider: Port dtest encryption provider classes
  test::pylib::dockerized_service: Add helper for running docker/podman
  test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
  test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
2025-10-27 15:42:52 +02:00
Patryk Jędrzejczak
e1c3f666c9 Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev
Problems addressed by this PR

* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
  * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
  * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
  * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
  * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.

Fixes scylladb/scylladb#26150

backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting

Closes scylladb/scylladb#26315

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
  test_fencing: fix due to new version increment
  test_automatic_cleanup: clean it up
  storage_proxy: wait for closing sessions in sstable cleanup fiber
  storage_proxy: rename await_pending_writes -> await_stale_pending_writes
  storage_proxy: use run_fenceable_write
  storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
  storage_proxy: introduce run_fenceable_write
  storage_proxy: move update_fence_version from shared_token_metadata
  storage_proxy: fix start_write() operation scope in apply_locally
  storage_proxy: move post fence check into handle_write
  storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
  storage_proxy::handle_read: add fence check before get_schema
  storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
  sstable_cleanup_fiber: use coroutine::parallel_for_each
  storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
  topology_coordinator: barrier before cleanup
  topology_coordinator: small start_cleanup refactoring
  global_token_metadata_barrier: add fenced flag
2025-10-27 12:35:13 +01:00
Petr Gusev
5ab2db9613 paxos_state: inline shards_for_writes into get_replica_lock
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.

Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.
2025-10-27 11:12:29 +01:00
Michael Litvak
6109cb66be cdc: garbage collect CDC streams periodically
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
2025-10-26 11:01:20 +01:00
Michael Litvak
440caeabcb cdc: helpers for garbage collecting old streams for tablets
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
  all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
  builds mutations that update the internal cdc tables and remove the
  older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
  above to find a new base and build mutations to update it for a
  specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
2025-10-26 11:01:20 +01:00
Avi Kivity
b843d8bc8b Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.

The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.

Fixes: https://github.com/scylladb/scylladb/issues/26506

New feature, no backport.

Closes scylladb/scylladb#26515

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
  replica/mutation_dump: add support for virtual tables
  tools/scylla-sstable: print_query_results_json(): handle empty value buffer
  tools/scylla-sstable: add cql support to write operation
  tools/scylla-sstable: write_operation(): fix indentation
  tools/scylla-sstable: write_operation(): prepare for a new input-format
  tools/scylla-sstable: generalize query_operation_validate_query()
  tools/scylla-sstable: move query_operation_validate_query()
  tools/scylla-sstable: extract schema transformation from query operation
  replica/table: add virtual write hook to the other apply() overload too
2025-10-24 23:32:40 +03:00
Avi Kivity
997b52440e Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.

Fixes scylladb/scylladb#23707.

Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.

Closes scylladb/scylladb#26227

* github.com:scylladb/scylladb:
  replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
  replica: add tombstone_gc_enabled parameter to mutation query methods
  mutation/mutation_compactor: remove _can_gc member
  tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
2025-10-24 23:26:16 +03:00
Patryk Jędrzejczak
5ae1aba107 test: unskip test_raft_recovery_entry_loss
The issue has been fixed in #26612.

Closes scylladb/scylladb#26614
2025-10-24 21:23:41 +03:00
Ferenc Szili
b4ca12b39a load_stats: change data structure which contains tablet sizes
This patch changes the tablet size map in load_stats. Previously, this
data structure was:

std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;

and is changed into:

std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;

This allows for improved performance of tablet tablet size reconciliation.
2025-10-24 14:37:00 +02:00
Andrzej Jackowski
8642629e8e test: add test_anonymous_user to test_raft_service_levels
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.

Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.

Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.

This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.

Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581

Closes scylladb/scylladb#26589
2025-10-24 12:23:34 +02:00
Ernest Zaslavsky
e8ce49dadf s3_client: remove unnecessary co_await in make_request
Eliminates a redundant `co_await` by directly returning the `future`,
simplifying the control flow without affecting behavior.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
71ea973ae4 s3 cleanup: remove obsolete retry-related classes
Delete `default_retry_strategy` and `retryable_http_client`, no longer
used in `s3_client` after recent refactors.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
d44bbb1b10 s3_client: remove unused filler_exception
Eliminate the now-obsolete `filler_exception`, rendered redundant by
earlier refactors that streamlined error handling in the S3 client.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
d3c6338de6 s3_client: fix indentation
Fix indentation in background download fiber in `chunked_download_source`
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
47704deb1e s3_client: simplify chunked download error handling using make_request
Refactor `chunked_download_source` to eliminate redundant exception
handling by leveraging the new `make_request` override with custom
retry strategy. This streamlines the download fiber logic, improving
readability and maintainability.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
2bc9b205b6 s3_client: reformat make_request functions for readability
Reformats `make_request` functions with long argument lists to improve
readability and comply with formatting guidelines.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
bf39412f4a s3_client: eliminate duplication in make_request by using overload
Removes redundant code in the `make_request` function by invoking the
appropriate overload, simplifying logic and improving maintainability.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
695e70834e s3_client: reformat make_request function declarations for readability
Reformats the `make_request` function declarations to improve readability
due to the large number of arguments. This aligns with our formatting
guidelines and makes the code easier to maintain.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
9f01c1f3ff s3_client: reorder make_request and helper declarations
Performs minor reordering of helper functor declarations in the header
file to improve readability and maintain logical grouping.
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
3d51124cb0 s3_client: add make_request override with custom retry and error
handler

Introduce an override for `make_request` in `s3_client` to support
custom retry strategies and error handlers, enabling flexibility
beyond the default client behavior and improving control over request
handling
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
bdb3979456 s3_client: migrate s3_client to Seastar HTTP client
Eliminate use of `retryable_http_client` in `s3_client` and adopt
Seastar's native HTTP client.
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
2025760e75 s3_client: fix crash in copy_s3_object due to dangling stream
In the `copy_part` method, move the `input_stream<char>` argument
into a local variable before use. Failing to do so can lead to a
SIGSEGV or trigger an abort under address sanitizer.
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
0983c791e9 s3_client: coroutinize copy_s3_object response callback
coroutinize `copy_s3_object` response callback for a bugfix in the following commit to prevent failing on dangling stream
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
237217c798 aws_error: handle missing unexpected_status_error case
Add a missing `case` clause to the `switch` statement to correctly
handle scenarios where `unexpected_status_error` is thrown. This
fixes overlooked error handling and improves robustness.
2025-10-23 15:58:10 +03:00
Ernest Zaslavsky
4f6384b1a0 s3_creds: use Seastar HTTP client with retry strategy
In AWS credentials providers, replace `retryable_http_client` with
Seastar's native HTTP client. Integrate the newly added
`default_aws_retry_strategy` to handle retries more efficiently and
reduce dependency on external retry logic.
2025-10-23 15:58:07 +03:00
Ernest Zaslavsky
3851ee58d7 retry_strategy: add exponential backoff to default_aws_retry_strategy
Add exponential backoff to `default_aws_retry_strategy` and call it to `sleep` before returning `true`, no-op in case of non-retryable error
2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
524737a579 retry_strategy: introduce Seastar-based retry strategy
Add a new class derived from Seastar's `default_retry_strategy`.
Relocate the `should_retry` implementation from Scylla's
`default_retry_strategy` into the new class to centralize and
standardize retry behavior.
2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
51aadd0ab3 retry_strategy: update CMake and configure.py for new strategy
Include `default_aws_retry_strategy` in the build system by updating
CMake and `configure.py` to ensure it is properly compiled and linked.
2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
5d65b47a15 retry_strategy: rename default_retry_strategy to default_aws_retry_strategy
Renames the `default_retry_strategy` class to `default_aws_retry_strategy`
to clarify its association with the S3 client implementation. This avoids
confusion with the unrelated `seastar::default_retry_strategy` class.
2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
cc200ced67 retry_strategy: fix include
Fix header inclusion in "newly" created file
2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
d679fd514c retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh 2025-10-23 15:49:34 +03:00
Ernest Zaslavsky
7cd4be4c49 retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc 2025-10-23 15:49:34 +03:00
Aleksandra Martyniuk
6fc43f27d0 db: fix indentation 2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk
1935268a87 test: add reproducer for data resurrection
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.

If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk
d436233209 repair: fail tablet repair if any batch wasn't sent successfully
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.

If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
2025-10-23 10:39:42 +02:00
Aleksandra Martyniuk
e1b2180092 db/batchlog_manager: fix making decision to skip batch replay
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.

To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
    now() < written_at + min(timeout + propagation_delay)

repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
    repair_time <= now()

So we know that:
    repair_time < written_at + propagation_delay

With this condition we are sure that GC won't happen.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
7f20b66eff db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
904183734f db/batchlog_manager: delete batch with incorrect or unknown version
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.

Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
502b03dbc6 db/batchlog_manager: coroutinize replay_all_failed_batches 2025-10-23 10:38:31 +02:00
Ernest Zaslavsky
abd3abc044 cmake: fix the seastar API level
Fix the build to make it compile when using CMake by defining the right Seastar API level

Closes scylladb/scylladb#26690
2025-10-23 11:20:20 +03:00
Botond Dénes
f8b0142983 Merge 'Add --drop-unfixable-sstables flag for scrub in segregate mode' from Taras Veretilnyk
This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and  set of tests to ensure correct behavior and error handling.

Fixes #19060

Backport is not required, it is a new feature

Closes scylladb/scylladb#26579

* github.com:scylladb/scylladb:
  sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
  test/nodetool: add scrub drop-unfixable-sstables option testcase
  scrub: add support for dropping unfixable sstables in segregate mode
2025-10-23 11:06:19 +03:00
Wojciech Mitros
c0d0f8f85b database: rename _view_update_concurrency_sem to _view_update_memory_sem
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
2025-10-23 10:00:15 +02:00
Tomasz Grabiec
564cebd0e6 Merge 'tablet_metadata_guard: fix split/merge handling' from Petr Gusev
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations.

The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper).

Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437)

backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.

Closes scylladb/scylladb#26619

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_tablets_merge_waits_for_lwt
  test.py: add universalasync_typed_wrap
  tablet_metadata_guard: fix split/merge handling
  tablet_metadata_guard: add debug logs
  paxos_state: shards_for_writes: improve the error message
  storage_service: barrier_and_drain – change log level to info
  topology_coordinator: fix log message
2025-10-22 20:56:21 +02:00
Taras Veretilnyk
60334c6481 sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test,
which verifies that when the drop-unfixable-sstables flag is enabled in segregate
mode, corrupted SSTables are correctly dropped.
2025-10-22 17:16:55 +02:00
Taras Veretilnyk
11874755a3 test/nodetool: add scrub drop-unfixable-sstables option testcase
This patches introduces the test_scrub_drop_unfixable_sstables_option testcase,
which verifies that correct request is generated when the --drop-unfixable-sstables flag is used.
It also validates that an error is thrown if the drop-unfixable-sstables
flag is enabled and mode is not set to SEGREGATE.

This patch introduces test_scrub_drop_unfixable_sstables_option, which test
2025-10-22 17:16:55 +02:00
Taras Veretilnyk
42da7f1eb6 scrub: add support for dropping unfixable sstables in segregate mode
This patch adds a new flag `drop-unfixable-sstables` to the scrub operation
in segregate mode, allowing to automatically drop SSTables that
cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables'
paramater and validation to ensure this flag is not enabled in other modes rather than segragate.
2025-10-22 17:16:49 +02:00
Radosław Cybulski
621e88ce52 Fix spelling errors
Closes scylladb/scylladb#26652
2025-10-22 16:46:31 +02:00
Petr Gusev
22271b9fe7 test_automatic_cleanup: add test_cleanup_waits_for_stale_writes 2025-10-22 16:31:43 +02:00
Petr Gusev
d1fc111dd7 test_fencing: fix due to new version increment
Topology version is now bumped when a node finishes bootstrapping.
As a result, fence_version == version - 1, and decrementing version
in the test no longer triggers a stale topology exception.

Fix: run cleanup_all to invoke the global barrier, which synchronizes
fence_version := version on all nodes.
2025-10-22 16:31:43 +02:00
Petr Gusev
5bdeb4ec66 test_automatic_cleanup: clean it up
Remove redundant imports and variables. Extract cleanup_all
function. Add logs. Remove pytest.mark.prepare_3_racks_cluster --
the test doesn't actually need a 3 node cluster, one initial
node is enough.
2025-10-22 16:31:43 +02:00
Petr Gusev
f34126aacf storage_proxy: wait for closing sessions in sstable cleanup fiber
Ensure that no stale streaming or repair sessions are active before
proceeding with the cleanup.
2025-10-22 16:31:43 +02:00
Petr Gusev
7e2959a1bf storage_proxy: rename await_pending_writes -> await_stale_pending_writes 2025-10-22 16:31:43 +02:00
Petr Gusev
1dd05f4404 storage_proxy: use run_fenceable_write
Switch local write code sites from start_write() to
run_fenceable_write().
2025-10-22 16:31:43 +02:00
Petr Gusev
d56495fd9c storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
All mutation_holder::apply_locall() implementations now do the same
post fence chech. In this commit we hoist this check up to
abstract_write_response_handler::apply_locally().
2025-10-22 16:31:43 +02:00
Petr Gusev
24f8962938 storage_proxy: introduce run_fenceable_write
This function is intended to replace start_write() in subsequent
commits. It provides the following benefits:
* Remove duplication: All start_write() call sites must run the fence
check after the operation completes. run_fenceable_write() encapsulates
this pattern.
* Fix a race: To ensure no new stale write operations occur during
cleanup, a fence check before start_write() was previously used.
However, yields in several code paths between the check and
start_write() made it non-atomic, allowing a stale operation to slip in
if the fence_version was updated in between.
* Optimize waiting: We do not need to wait for all operations—only for
vnode-based, non-local tables with versions smaller than the current
fence_version.
2025-10-22 16:31:43 +02:00
Petr Gusev
c5f447224a storage_proxy: move update_fence_version from shared_token_metadata
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
2025-10-22 16:31:43 +02:00
Petr Gusev
659c5912e0 storage_proxy: fix start_write() operation scope in apply_locally
The operation must be held during the local write. Before this commit,
its scope ended after returning from apply_locally(), so it
did not actually provide any protection.
2025-10-22 16:31:43 +02:00
Petr Gusev
27915befac storage_proxy: move post fence check into handle_write
handle_write() is invoked from receive_mutation_handler() and
handle_paxos_learn(), and both previously performed a fence check in
apply_fn. This commit hoists the fence check into handle_write() to
reduce code duplication.

Additionally, move start_write() after get_schema_for_write(), since
there is no need to hold the operation while querying the schema.
2025-10-22 16:31:43 +02:00
Petr Gusev
41077138bf storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
As noted in the code comments, start_write() does not need to be held
during counter replication; it is required only while performing local
storage modifications. Move the start_write() call and the fence
check down to mutate_counter_on_leader_and_replicate().

Additionally, mutate_counters_on_leader() is updated to check for
possible stale_topology_exception() and properly package them
in the resulting exception_variant structure.
2025-10-22 16:31:43 +02:00
Petr Gusev
a6208b2d67 storage_proxy::handle_read: add fence check before get_schema
Avoid querying the schema for outdated requests by adding a fence check
at the start of handle_read.
2025-10-22 16:31:43 +02:00
Petr Gusev
263cbef68e storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
The function applies only to vnode-based tables. Rename it to
vnodes_cleanup_fiber for clarity.
2025-10-22 16:31:42 +02:00
Petr Gusev
03aa856da3 sstable_cleanup_fiber: use coroutine::parallel_for_each
A refactoring commit -- no need to allocate a dedicated
std::vector<future<>>.
2025-10-22 16:31:42 +02:00
Petr Gusev
4a781b67b5 storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
The flush_all_tables() call ensures that no obsolete, cleanup-eligible
writes remain in the commitlog. This does not need to run under the
group0 lock, so move it outside.

Also, run await_pending_writes() before flush_all_tables(), since
pending writes may include data that must be cleaned up.

Finally, add more detailed info-level logs to trace the stages of the
cleanup procedure.
2025-10-22 16:31:42 +02:00
Petr Gusev
a54ebe890b topology_coordinator: barrier before cleanup
Cleanup needs a barrier to make sure that no request coordinators
are sending requests to old replicas/ranges that we're going to cleanup.
For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.

We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence
is committed to group0, we can safely proceed, since any late request
with the old topology will be fenced out on the replica.

The test for this case is added in a separate commit
"test_automatic_cleanup: add test_cleanup_waits_for_stale_writes"
2025-10-22 16:31:42 +02:00
Petr Gusev
1b791dacde topology_coordinator: small start_cleanup refactoring
Rename start_cleanup -> start_vnodes_cleanup for clarity.
Pass topology_request and server_id in start_vnodes_cleanup, we will
need them for better logging later.
2025-10-22 16:31:42 +02:00
Petr Gusev
d53e24812f global_token_metadata_barrier: add fenced flag
Cleanup needs a barrier. For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.

We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence is
committed to group0, we can safely proceed, since any late request with
the old topology will be fenced out on the replica.

To support this, introduce a "fenced" flag. The client can pass a pointer
to a bool, which will be set to true after the new fenced_version is
committed.
2025-10-22 16:31:42 +02:00
Calle Wilund
c4427f6d4f test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
Use the shared logic of DockerizedServer to provide the fake-gcs-server
docker helper.
2025-10-22 14:06:30 +00:00
Calle Wilund
1aa8014f8f test::cluster::test_encryption: Port dtest EAR tests
Moves the tests, reexamined and simplified, to unit tests
instead of dtest.
2025-10-22 14:06:30 +00:00
Asias He
5f1febf545 repair: Remove the regular mode name in the tablet repair api
The patch e34deb72f9 (repair: Rename incremental mode name)
missed one place that references the removed regular mode name.

Fixes #26503

Closes scylladb/scylladb#26660
2025-10-22 16:55:55 +03:00
Botond Dénes
1c7f1f16c8 Merge 'raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Patryk Jędrzejczak
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.

We fix this issue in this PR and add a regression test.

This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.

Fixes #26534

This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).

Closes scylladb/scylladb#26612

* github.com:scylladb/scylladb:
  test: test group0 tombstone GC in the Raft-based recovery procedure
  group0_state_id_handler: remove unused group0_server_accessor
  group0_state_id_handler: consider state IDs of all non-ignored topology members
2025-10-22 16:40:11 +03:00
Ernest Zaslavsky
a09ec56e3d cmake: fix s3_test linkage
Fix missing `s3_test` executable linkage with `scylla_encryption`

Closes scylladb/scylladb#26655
2025-10-22 14:14:43 +03:00
Anna Stuchlik
9c0ff7c46b doc: add support for Debian 12
Fixes https://github.com/scylladb/scylladb/issues/26640

Closes scylladb/scylladb#26668
2025-10-22 14:09:13 +03:00
Calle Wilund
93e335f861 test::cluster::conftest: Add key_provider fixture
Iterates test functions across all mockable providers and
provides a key provider instance handling EAR setup.
2025-10-22 10:53:02 +00:00
Calle Wilund
6406879092 test::pylib::encryption_provider: Port dtest encryption provider classes
Adds virtual interface for running scylla with EAR and various providers
we can do mock for. Note: GCP KMS not implemented.
2025-10-22 10:53:02 +00:00
Petr Gusev
03d6829783 test_tablets_lwt: add test_tablets_merge_waits_for_lwt 2025-10-22 11:33:20 +02:00
Petr Gusev
33e9ea4a0f test.py: add universalasync_typed_wrap
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.

In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.

Fixes: scylladb/scylladb#26639
2025-10-22 11:32:37 +02:00
Petr Gusev
b23f2a2425 tablet_metadata_guard: fix split/merge handling
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.

Fixes scylladb/scylladb#26437
2025-10-22 11:32:37 +02:00
Petr Gusev
ec6fba35aa tablet_metadata_guard: add debug logs 2025-10-22 11:32:37 +02:00
Petr Gusev
64ba427b85 paxos_state: shards_for_writes: improve the error message
Add the current token and tablet info, remove 'this_shard_id'
since it's always written by the logging infrastructure.
2025-10-22 11:32:37 +02:00
Petr Gusev
6f4558ed4b storage_service: barrier_and_drain – change log level to info
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.
2025-10-22 11:32:37 +02:00
Petr Gusev
e1667afa50 topology_coordinator: fix log message 2025-10-22 11:32:37 +02:00
Nadav Har'El
895d89a1b7 Update seastar submodule
Among other things, the merge includes the patch "http: add "Connection:
close" header to final server response.". This Fixes #26298: A missing
response header meant that a test's client code sometimes didn't notice
that the server closed the connection (since the client didn't need to
use the connection again), which made one test flaky.

* seastar bd74b3fa...63900e03 (6):
  > Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov
    output_stream: Fix indentation of the slow_write() method
    output_stream: Remove pointless else
    output_stream: Replace std::swap with std::exchange
    output_stream: Unify some code-paths of slow_write()
  > Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov
    iostream: Deprecate input/output stream default constructor and move-assignment operator
    test: Sub-split test-cases
    test: Don't reuse output_stream in file demo
    test: Keep input_/output_stream as optional
    util: Construct file_data_source in with_file_input_stream()
    websocket: Construct in/out in initializer list
    rpc: Wrap socket and buffers
  > scripts/perftune.py: detect corrupted NUMA topology information
  > Merge 'memory, smp: support more than 256 shards' from Avi Kivity
    reactor, smp: allocate smp queues across all shards
    memory: increase maximum shard count
    memory: make cpu_id_shift and related mask dynamic
    resource, memory: move memory limit calculation to memory.cc
    resource: don't error if --overprovisioned and asking for more vcpus than available
  > Merge 'Update perf_test text output, make columns selectable' from Travis Downs
    perf_tests: enhance text output
    perf_test_tests: add some check_output tests
2025-10-22 11:26:40 +03:00
Nadav Har'El
7c9f5ef59e Merge 'alternator/executor: instantly mark view as built when creating it with base table' from Michał Jadwiszczak
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.

In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).

Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.

Fixes scylladb/scylladb#26615

This fix should be backported to 2025.4.

Closes scylladb/scylladb#26657

* github.com:scylladb/scylladb:
  test/alternator/test_tablets: add test for GSI backfill with tablets
  test/alternator/test_tablets: add reproducer for GSI with tablets
  alternator/executor: instantly mark view as built when creating it with base table
2025-10-22 10:44:28 +03:00
Calle Wilund
31cc1160b4 test::pylib::dockerized_service: Add helper for running docker/podman
While there is a docker interface for python, need to deal with
the docker-in-docker issues etc. This uses pure subprocess and
stream parse. Meant to provide enough flexibility for all our
docker mock server needs.
2025-10-21 23:26:50 +00:00
Avi Kivity
ab488fbb3f Merge 'Switch to seastar API level 9 (no more packet-s in output_stream/data_sink API)' from Pavel Emelyanov
Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone.

Using newer seastar API, no need to backport

Closes scylladb/scylladb#26592

* github.com:scylladb/scylladb:
  code: Fix indentation after previous patch
  code: Switch to seastar API level 9
  transport: Open-code invoke_with_counting into counting_data_sink::put
  transport: Don't use scattered_message
  utils: Implement memory_data_sink::put(net::packet)
2025-10-22 01:51:43 +03:00
Michał Jadwiszczak
34503f43a1 test/alternator/test_tablets: add test for GSI backfill with tablets
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.

But it's better to add this test.
2025-10-22 00:34:49 +02:00
Michał Jadwiszczak
bdab455cbb test/alternator/test_tablets: add reproducer for GSI with tablets 2025-10-22 00:34:10 +02:00
Andrei Chekun
24d17c3ce5 test.py: rewrite the wait_for_first_completed
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited

Closes scylladb/scylladb#26435
2025-10-22 01:13:43 +03:00
Takuya ASADA
eb30594a60 dist: detect corrupted NUMA topology information
There are some environment which has corrupted NUMA topology
information, such as some instance types on AWS EC2 with specific Linux
kernel images.
On such environment, we cannot get HW information correctly from hwloc,
so we cannot proceed optimization on perftune.
To avoid causing script error, check NUMA topology information and skip
running perftune if the information corrupted.

Related scylladb/seastar#2925

Closes scylladb/scylladb#26344
2025-10-22 01:11:14 +03:00
Michał Jadwiszczak
8fbf122277 alternator/executor: instantly mark view as built when creating it with base table
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.

In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).

Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.

Fixes scylladb/scylladb#26615
2025-10-22 00:05:40 +02:00
Avi Kivity
029513bee9 Merge 'storage_proxy: wait for write handlers destruction' from Petr Gusev
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.

A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.

Fixes scylladb/scylladb#26355

backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets

Closes scylladb/scylladb#26408

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_lwt_shutdown
  storage_proxy: wait for write handler destruction
  storage_proxy: coroutinize cancel_write_handlers
  storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
2025-10-22 00:02:08 +03:00
Michał Hudobski
5c957e83cb vector_search: remove dependence on cql3
This patch removes the dependence of vector search module
on the cql3 module by moving the contents of cql3/type_json.hh
to types/json_utils.hh and removing the usage of cql3 primary_key
object in vector_store_client. We also make the needed adjustments
to files that were previously using the afformentioned type_json.hh
file.

This fixes the circular dependency cql3 <-> vector_search.

Closes scylladb/scylladb#26482
2025-10-21 17:41:55 +03:00
Michael Litvak
35711a4400 test: cdc: test cdc compatible schema
Add a simple test verifying our changes for the compatible CDC schema.
The test checks we can write to a table with CDC enabled after ALTER and
after node restart.
2025-10-21 14:14:34 +02:00
Michael Litvak
448e14a3b7 cdc: use compatiable cdc schema
in the CDC log transformer, when augmenting a base mutation, use the CDC
log schema that is compatible with the base schema, if set.

Now that the base schema has a pointer to its CDC schema, we can use it
instead of getting the current schema from the db, which may not be
compatible with the base schema.

The compatible CDC schema may not be set if the cluster is not using
raft mode for schema. In this case, we maintain the previous behavior.
2025-10-21 14:14:33 +02:00
Michael Litvak
6e2513c4d2 db: schema_applier: create schema with pointer to CDC schema
When creating a schema for a non-CDC table in the schema_applier, find
its CDC schema that we created previously in the same operation, if any,
and create the schema with a pointer to the CDC schema.

We use the fact that for a base table with CDC enabled, its CDC schema
is created or altered together in the same group0 operation.

Similarly, in schema_tables, when creating table schemas from the
schema tables, first create all schemas that don't have CDC enabled,
then create schemas that have CDC enabled by extending them with the
pointer to the CDC schema that we created before.

There are few additional cases where we create schemas that we need to
consider how to handle.

When loading a schema from schema tables in the schema_loader we decide
not to set the CDC schema, because this schema is mostly used for tools
and it's not used for generating CDC mutations.

When transporting a schema by RPC in the migration manager, we don't
transport its CDC schema, and we always set it to null. Because we use
raft we expect this shouldn't have any effect, because the schema is
synchronized through raft and not through the RPC.
2025-10-21 14:13:43 +02:00
Michael Litvak
4fe13c04a9 db: schema_applier: extract cdc tables
Previously in the schema applier we have two maps of schema_mutations,
for tables and for views. Now create another map for CDC tables by
extracting them from the non-views tables map.

We maintain the previous behavior by applying each operation that's done
on the tables map, to the CDC map as well.

Later we will want to handle CDC and non-CDC tables differently. We want
to be able to create all CDC schemas first, so when we create the
non-CDC tables we can create them with a pointer to their CDC schemas.
2025-10-21 14:13:43 +02:00
Michael Litvak
ac96e40f13 schema: add pointer to CDC schema
Add to the schema object a member that points to the CDC schema object
that is compatible with this schema, if any.

The compatible CDC schema is created and altered with its base schema in
the same group0 operation.

When generating CDC log mutations for some base mutation we want them to
be created using a compatible schema thas has a CDC column corresponding
to each base column. This change will allow us to find the right CDC
schema given a base mutation.

We also update the relevant structures in the schema registry that are
related to learning about schemas and transporting schemas across
shards or nodes.

When transporting a schema as frozen_schema, we need to transport the
frozen cdc schema as well, and set it again when unfreezing and
reconstructing the schema.

When adding a schema to the registry, we need to ensure its CDC schema
is added to the registry as well.

Currently we always set the CDC schema to nullptr and maintain the
previous behavior. We will change it in a later commit. Until then, we
mark all places where CDC schema is passed clearly so we don't forget
it.
2025-10-21 14:13:43 +02:00
Michael Litvak
60f5c93249 schema_registry: remove base_info from global_schema_ptr
remove the _base_info member from global_schema_ptr, and used the
base_info we have stored in the schema registry entry instead.

Currently when constructing a global_schema_ptr from a schema_ptr it
extracts and stores the base_info from the schema_ptr. Later it uses it
to reconstruct the schema_ptr, together with the frozen schema from the
schema registry entry.

But we can use the base_info that is already stored in the
schema registry entry.
2025-10-21 14:13:43 +02:00
Michael Litvak
085abef05d schema_registry: use extended_frozen_schema in schema load
Change the schema loader type in the schema_registry to return a
extended_frozen_schema instead of view_schema_and_base_info, and
remove view_schema_and_base_info which is not used anymore.

The casting between them is trivial.
2025-10-21 14:13:43 +02:00
Michael Litvak
8c7c1db14b schema_registry: replace frozen_schema+base_info with extended_frozen_schema
The schema_registry_entry holds a frozen_schema and a base_info. The
base_info is extracted from the schema_ptr on load of a schema_ptr, and
it is used when unfreezing the schema.

But this is exactly what extended_frozen_schema is doing, so we can
just store an object of this type in the schema_registry_entry.

This makes the code simpler because the schema registry doesn't need to
be aware of the base_info.
2025-10-21 14:13:43 +02:00
Michael Litvak
278801b2a6 frozen_schema: extract info from schema_ptr in the constructor
Currently we construct a frozen schema with base info in few places, and
the caller is responsible for constructing the frozen schema and extracting
the base info if it's a view table.

We change it to make it simpler and remove the burden from the caller.
The caller can simply pass the schema_ptr, and the constructor for
extended_frozen_schema will construct the frozen schema and extract
the additional info it needs. This will make it easier to add additional
fields, and reduces code duplication.

We also make temporary castings between extended_frozen_schema and
view_schema_and_base_info for the transition, which are trivial, until
they are combined to a single type.
2025-10-21 14:13:42 +02:00
Michael Litvak
154d5c40c8 frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
This commit starts a series of refactoring commits of the frozen_schema
to reduce duplication and make it easier to extend.

Currently there are two essentially identical types,
frozen_schema_with_base_info and view_schema_and_base_info in the
schema_registry that hold a frozen_schema together with a base_info for
view schemas.

Their role is to pass around a frozen schema together with additional
info that is extracted from the schema and passed around with it when
transporting it across shards or nodes, and is needed for
reconstructing it, and it is not part of the schema mutations.

Our goal is to combine them to a single type that we will call
extended_frozen_schema.
2025-10-21 14:13:42 +02:00
Emil Maskovsky
cf93820c0a test/cluster: fix missing await in test_group0_tombstone_gc
The recursive call to alter_system_schema() was missing the await
keyword, which meant the coroutine was never actually executed and
the test wasn't doing what it was supposed to do.

Not backporting: Test fix only.

Closes scylladb/scylladb#26623
2025-10-21 11:22:39 +02:00
Calle Wilund
91db8583f8 test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
Add `serve` impl that does not mess with signals, and shutdown
that does not mess with threads. Also speed up standalone shutdown
to make boost tests less slow.
2025-10-21 09:01:55 +00:00
Calle Wilund
772bd856e2 test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
Prepare for re-use in python tests as well as boost ones.
2025-10-21 09:01:54 +00:00
Avi Kivity
0ed178a01e build: disable the -fextend-variable-liveness clang option
In clang 21, the -fextend-variable-liveness option was made
default [1] with -Og. It helps reduce "optimized out" problems while
debugging.

However, it conflicts [2] with coroutines.

To prevent problems during the upgrade to Clang 21, disable the option.

[1] 36af7345df
[2] https://github.com/llvm/llvm-project/issues/163007

Closes scylladb/scylladb#26573
2025-10-21 10:47:34 +03:00
Botond Dénes
fbceb8c16b Merge 's3_client: handle failures which require http::request updating' from Ernest Zaslavsky
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether

Fixes: https://github.com/scylladb/scylladb/issues/26483

Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions

Closes scylladb/scylladb#26527

* github.com:scylladb/scylladb:
  s3_client: tune logging level
  s3_client: add logging
  s3_client: improve exception handling for chunked downloads
  s3_client: fix indentation
  s3_client: add max for client level retries
  s3_client: remove `s3_retry_strategy`
  s3_client: support high-level request retries
  s3_client: just reformat `make_request`
  s3_client: unify `make_request` implementation
2025-10-21 10:40:38 +03:00
Botond Dénes
c543059f86 Merge 'Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

Closes scylladb/scylladb#26456

* github.com:scylladb/scylladb:
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
2025-10-21 09:43:38 +03:00
Tomasz Grabiec
ba692d1805 schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF
In 380f243986 we added support for rack
lists in replication options. Drivers which are not prepared to parse
that (as of now, all of them), will not create metadata object for
that keyspace. This breaks, for example, the "copy to/from" cqlsh
command. Potentially other things too.

To fix that, keep the "replication" column in the old format, and
store numeric RF there, which corresponds to the number of
replicas. Accurate options in the new format are put in
"replication_v2".

We set replication_v2 in the schema only when it differs from the old
"replication" so that the new column is not set during upgrade,
otherwise downgrade would fail. Partition tombstone is added to ensure
that pre-alter replication_v2 value is deleted on alters which change
replication to a value which is the same as the post-alter
"replication" value.

Fixes #26415

Closes scylladb/scylladb#26429
2025-10-21 09:11:25 +03:00
Tomasz Grabiec
e4e79be295 Merge 'tablet_allocator: allow merges in base tables if rf-rack-valid=true' from Piotr Dulikowski
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273

Marked for backport to 2025.4 as MVs are getting un-experimentaled there.

Closes scylladb/scylladb#26278

* github.com:scylladb/scylladb:
  test: mv: add a test for tablet merge
  tablet_allocator, tests: remove allow_tablet_merge_with_views injection
  tablet_allocator: allow merges in base tables if rf-rack-valid=true
2025-10-21 00:18:30 +02:00
Raphael S. Carvalho
4654cdc6fd test: Add reproducer for l-a-s and split synchronization issue
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-10-20 19:17:25 -03:00
Raphael S. Carvalho
3abc66da5a sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes #26455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-10-20 19:17:22 -03:00
Piotr Dulikowski
f76917956c view_building_worker: access tablet map through erm on sstable discovery
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.

Fixes: scylladb/scylladb#26403

Closes scylladb/scylladb#26588
2025-10-21 00:14:39 +02:00
Petr Gusev
8925f31596 test_tablets_lwt: add test_lwt_shutdown 2025-10-20 20:16:09 +02:00
Petr Gusev
bbcf3f6eff storage_proxy: wait for write handler destruction
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.

A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.

Fixes scylladb/scylladb#26355
2025-10-20 20:10:42 +02:00
Petr Gusev
b269f78fa6 storage_proxy: coroutinize cancel_write_handlers
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.

The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
2025-10-20 19:49:02 +02:00
Petr Gusev
bf2ac7ee8b storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.

This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
2025-10-20 19:49:02 +02:00
Piotr Wieczorek
a3ec6c7d1d alternator/streams: Support userIdentity field for TTL deletions
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.

This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.

This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523

Closes scylladb/scylladb#26460
2025-10-20 17:15:59 +02:00
Nadav Har'El
eb06ace944 Merge 'auth: implement vector store authorization' from Michał Hudobski
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:

- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation

These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.

Fixes: VECTOR-201

Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.

Closes scylladb/scylladb#25976

* github.com:scylladb/scylladb:
  docs: adjust docs for VS auth changes
  test: add tests for VECTOR_SEARCH_INDEXING permission
  cql: allow VECTOR_SEARCH_INDEXING users to select
  auth: add possibilty to check for any permission in set
  auth: add a new permission VECTOR_SEARCH_INDEXING
2025-10-20 17:32:00 +03:00
Ernest Zaslavsky
fdd0d66f6e s3_client: tune logging level
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
4497325cd6 s3_client: add logging
Add logging for the case when we encounter expired credentials, shouldnt happen but just in case
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
1d34657b14 s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
58a1cff3db s3_client: fix indentation
Reformat `client::make_request` to fix the indentation of `if` block
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
43acc0d9b9 s3_client: add max for client level retries
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
116823a6bc s3_client: remove s3_retry_strategy
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request
2025-10-20 17:12:59 +03:00
Ernest Zaslavsky
185d5cd0c6 s3_client: support high-level request retries
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.
2025-10-20 17:12:59 +03:00
Dawid Mędrek
7e201eea1a index: Set tombstone_gc when creating secondary index
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. That
was a bug and we fix it now.

Two reproducer tests is added for validation. They reproduce the problem
and don't pass before this commit.

Fixes scylladb/scylladb#26542
2025-10-20 14:04:45 +02:00
Dawid Mędrek
e294b80615 index: Make create_view_for_index method of create_index_statement 2025-10-20 14:04:16 +02:00
Dawid Mędrek
fe00485491 index: Move code for creating MV of secondary index to cql3
We move the code responsible for creating the schema for the underlying
materialized view of a secondary index from `index/` to `cql3/` so that
it's close to that responsible for performing `CREATE INDEX`. That's in
line with how other CQL statements are designed.

Note that the moved method is still a method of `secondary_index_manager`.
We'll make it a method of `create_index_statement` in the following
commit.
2025-10-20 14:04:11 +02:00
Dawid Mędrek
20761b5f13 db, cql3: Move creation of underlying MV for index
The main goal of this patch is to give more control over the creation
of the underlying view on an index to `create_index_statement.cc`.
That goal is in line with how the other statements are executed:
the schema is built in the cql3 module and only the ready schema_ptr
is passed further. That should also make the code cleaner and easier
to understand.

There are a few important things to note here:

* A call to `service::prepare_new_view_announcement` appears out of nowhere.
  Aside from some validation checks and logging, that function does pretty
  much the same as the pre-existing code we remove:

  a. It creates Raft mutations based on the passed `view_ptr`.
  b. It creates Raft mutations responsible for view building tasks.
  c. It notifies about a new column family.

* We seemingly get rid of the code that creates view building tasks. That's not
  true: we still do that via `service::prepare_new_view_announcement`.

That should explain why the change doesn't remove any relevant logic.
On the other hand, it might be more difficult to explain why moving the
code is correct. I'll touch on it below.

Before that, it may also be important to highlight that this commit only
affects the logic responsible for creating an index. There should be no
effect on any other part of how Scylla behaves.

---

Proving the correctness of the solution would take quite a lot of space,
so I'll only summarize it. It relies on a few things:

1. Two schema changes cannot happen in one operation. We allow for more
   but only when those changes are dependent on each other and when
   the additional ones are internal for Scylla, e.g. creating an index
   leads to creating the underlying materialized view.
2. There are no entities or components that rely on indexes.
3. Each index is uniquely defined by the keyspace it belongs to
   and the name of the index.
4. There is a bijection between rows in `system_schema.indexes`
   and the currently existing indexes.
5. The name of an unnamed index depends on the name of the base table
   and the names of the indexed columns. The name of an unnamed index
   may have a number attached to it, but that number only depends on
   the state of the schema at the time of creation of the index, and
   it never changes later on. There are no other things the name of
   an unnamed index depends on.
6. Scylla doesn't allow for changing any column in the base table
   that has an index depending on it.

Based on that, we conclude that every existing index has exactly one
entry in `system_schema.indexes`, and the primary key of that entry
never changes.

The columns of `system_schema.indexes` that are not part of the primary
key are: `kind` and `options`. Both values are only decided at the time
of creation of an index, and currently there's no way to modify them.

That implies that there are only two events when an entry in the system
table can change: when creating an index and when dropping an index.

---

When we consider the previous place of the logic that this commit moves
to `cql3/statements/create_index_statement.cc`, it works like this:

1. We compare the sets of indexes defined on a specific table
   (in the form of a structure called `index_metadata`) before and
   after an operation.
2. We divide the entries into three sets: those present in both sets
   and those present in only one of them.
3. We handle each of those three sets separately.

The structure `index_metadata` is a reflection of entries in
`system_schema.indexes`. It stores one more parameter -- `local` --
but its value depends on the other values of an entry, so we can ignore
it in this reasoning.

Because an index cannot be modified -- it can only be created or dropped
-- there are at most two non-empty sets: the set of new indexes and the
set of dropped indexes. Those sets are only non-empty during an operation
like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`,
`DROP KEYSPACE`. Note that it's impossible to drop an index by dropping
the underlying materialized view -- Scylla doesn't allow for that.

However, the code in `migration_manager.cc` we call
(`prepare_column_family_update_announcement`) and the code that we call
in `schema_tables.cc` (`make_update_table_mutations`) is only triggered
by *updates* related to the base table. In the context of `DROP TABLE`
or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement`
instead. In other words, we're only concerned with `CREATE INDEX` and
`DROP INDEX`.

---

A conclusion from this reasoning is that we only need to consider those
two situations when talking about correctness of this change. The impact
of this commit is that we may have potentially reordered mutations in the
resulting vector that will be applied to the Raft log.

The only mutations we may have reordered are the mutations responsible for
creating the underlying view and the mutations responsible for updating
columns in the base table. It's clear then that this commit brings no change
at all: we only give `cql3/statements/create_index_statement.cc` more
control over creating the underlying view.

---

We leave a remnant of the code in `db/schema_tables.cc` responsible
for dropping an index along with its underlying view. It would require
changing a bit more of the logic, and we don't need it for the rest
of this sequence of changes.

Refs scylladb/scylladb#16454
2025-10-20 14:04:06 +02:00
Łukasz Paszkowski
7ec369b900 database: Log message after critical_disk_utilization mode is set
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030

The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.

It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.

Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.

The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated

The tests are updated to wait for the second message.

Fixes https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26392
2025-10-20 13:24:10 +03:00
Asias He
33bc1669c4 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547
2025-10-20 13:21:59 +03:00
Pavel Emelyanov
44ed3bbb7c Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.

Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.

This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.

Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).

Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.

Fixes #25359
Fixes #26453

Closes scylladb/scylladb#26186

* github.com:scylladb/scylladb:
  docs::dev::object_storage: Add some initial info on GS storage
  docs/dev: Add mention of (nested) docker usage in testing.md
  sstables::object_storage_client: Forward memory limit semaphore to GS instance
  utils::gcp::object_storage: Add optional memory limits to up/download
  sstables::object_storage_client: Add multi-upload support for GS
  utils::gcp::storage: Add merge objects operation
  test_backup/test_basic: Make tests multiplex both s3 and gs backends
  test::cluster::conftest: Add support for multiple object storage backends
  boost::gcs_storage_test: reindent
  boost::gcs_storage_test: Convert to use fixture
  tests::boost: Add GS object storage cases to mirror S3 ones
  tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
  tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
  sstables::object_storage_client: Add google storage implementation
  test_services: Allow testing with GS object storage parameters
  utils::gcp::gcp_credentials: Add option to create uninitialized credentials
  utils::gcp::object_storage: Make create_download_source return seekable_data_source
  utils::gcp::object_storage: Add defensive copies of string_view params
  utils::gcp::object_storage: Add missing retry backoff increate
  utils::gcp::object_storage: Add timestamp to object listing
  utils::gcp::object_storage: Add paging support to list_objects
  object_storage_client: Add object_name wrapper type
  utils::gcp::object_storage: Add optional abort_source
  utils::rest::client: Add abort_source support
  sstables: Use object_storage_client for remote storage
  sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
  s3::upload_progress: Promote to general util type
  storage_options: Abstract s3 to "object_storage" and add gs as option
  sstables::file_io_extension: Change "creator" callback to just data_source
  utils::io-wrappers: Add ranged data_source
  utils::io-wrappers: Add file wrapper type for seekable_source
  utils::seekable_source: Add a seekable IO source type
  object_storage_endpoint_param: Add gs storage as option
  config: break out object_storage_endpoint_param preparing for multi storage
2025-10-20 13:14:53 +03:00
Patryk Jędrzejczak
c57f097630 test: test group0 tombstone GC in the Raft-based recovery procedure
We add a regression test for the bug fixed in the previous commits.
2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak
6b2e003994 group0_state_id_handler: remove unused group0_server_accessor
It became unused in the previous commit.
2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak
1d09b9c8d0 group0_state_id_handler: consider state IDs of all non-ignored topology members
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.

We fix this issue in this commit by considering topology members
instead.

We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.

We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.

We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
  effort to minimize external dependencies in Scylla.
2025-10-20 12:05:07 +02:00
Avi Kivity
87c0adb2fe gdb: simplify and future-proof looking up coroutine frame type
llvm recently updated [1] their coroutine debugging instructions.
They now recommend looking up the variable __coro_frame in the coroutine
function rather than constructing the name of the coroutine frame type
from the ramp function plus __coro_frame_ty.

Since the latter method no longer works with Clang 21 (I did not check
why), and since the former method is blessed as being more compatible,
switch to the recommended method. Since it works with both Clang 20 and
Clang 21, it future proofs the script.

[1] 6e784afcb5

Closes scylladb/scylladb#26590
2025-10-20 12:38:53 +03:00
Botond Dénes
1ab697693f Merge 'compaction/twcs: fix use after free issues' from Lakshmi Narayanan Sreethar
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Fixes #25913

Issue probably started after 9d3755f276, so backport to 2025.4

Closes scylladb/scylladb#26593

* github.com:scylladb/scylladb:
  compaction: fix use after free when strategy is altered during compaction
  compaction/twcs: pass compaction_strategy_state to internal methods
  compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
2025-10-20 10:45:47 +03:00
Ernest Zaslavsky
db1ca8d011 s3_client: just reformat make_request
Just reformat previously changed methods to improve readability
2025-10-20 10:44:37 +03:00
Israel Fruchter
986e8d0052 Update tools/cqlsh submodule (v6.0.27)
* tools/cqlsh ff3f572...f852b1f5 (2):
  > Add LZ4 as a required package - so ScyllaDB Python driver could use LZ4 compression
  > github actions: replace macos-13 with macos-15-intel

Closes scylladb/scylladb#26608
2025-10-20 10:03:31 +03:00
Michael Litvak
b808d84d63 storage_service: improve colocated repair error to show table names
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.

Fixes scylladb/scylladb#26567

Closes scylladb/scylladb#26568
2025-10-20 10:03:31 +03:00
Piotr Dulikowski
70b0cfb13e Merge 'test: cluster: Replica exceptions tests' from Dario Mirovic
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.

The tests cover apply counter update and apply functionalities in the database layer.

There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.

Fixes #18164

This is an improvement. No backport needed.

Closes scylladb/scylladb#25992

* github.com:scylladb/scylladb:
  test: cluster: test replica write timeout
  database: parameterize apply_counter_update_delay_5s injector value
  test: cluster: test replica exceptions - test rate limit exceptions
2025-10-20 10:03:31 +03:00
Piotr Dulikowski
a716fab125 Merge 'alternator/metrics: Log operation sizes to histograms' from Piotr Wieczorek
This PR adds operation per-table histograms to Alternator with item sizes involved in an operation, for each of the operations: `GetItem`, `PutItem`, `DeleteItem`, `UpdateItem`, `BatchGetItem`, `BatchWriteItem`. If read-before-write wasn't performed (i.e. it was not needed by the operation and the flag `alternator_force_read_before_write` was disabled), then we log sizes of the items that are in the request. Also, `UpdateItem` logs the maximum of the update size and the existing item size. We'll change it in a next PR.

Fixes: #25143

Closes scylladb/scylladb#25529

* github.com:scylladb/scylladb:
  alternator: Add UpdateItem and BatchWriteItem response size metrics
  alternator: Add PutItem and DeleteItem response size metrics
  alternator: Add BatchGetItem response size metrics
  alternator: Add GetItem response size metrics
  alternator/test: Add more context to test_metrics.py asserts
2025-10-20 10:03:31 +03:00
Lakshmi Narayanan Sreethar
18c071c94b compaction: fix use after free when strategy is altered during compaction
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

Fixes #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-17 22:57:05 +05:30
Lakshmi Narayanan Sreethar
35159e5b02 compaction/twcs: pass compaction_strategy_state to internal methods
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.

This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.

Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-17 21:26:30 +05:30
Lakshmi Narayanan Sreethar
1cd43bce0e compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Refs #26546
Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-17 21:26:30 +05:30
Dario Mirovic
1d93f342f9 test: cluster: test replica write timeout
This patch introduces test `test_replica_database_apply_timeout`.
It tests timeout on database write. The test uses error injection
that returns timeout error if the injection `database_apply_force_timeout`
is enabled.

Refs #18164
2025-10-17 11:52:11 +02:00
Dario Mirovic
ff88fe2d76 database: parameterize apply_counter_update_delay_5s injector value
Parameterize `apply_counter_update_delay_5s` injector value. Instead of
sleeping 5s when the injection is active, read parameter value that
specifies sleep duration. To reflect these changes, it is renamed to
`apply_counter_update_delay_ms` and the sleep duration is specified in
milliseconds.

Refs #18164
2025-10-17 11:52:10 +02:00
Dario Mirovic
7dc0ff2152 test: cluster: test replica exceptions - test rate limit exceptions
This patch introduces two tests for `replica::rate_limit_exception`.
One test is for write/apply limit, the other one for read/query limit.

The tests check the number of rate limit errors reported and the
number of cpp exceptions reported. If somebody adds an exception
throw on the rate limit paths, this test will catch it and fail.

Refs #18164
2025-10-17 10:54:43 +02:00
Botond Dénes
01bcafbe24 Merge 'test: make various improvements in the recovery procedure tests' from Patryk Jędrzejczak
This PR contains various improvements in the recovery procedure
tests, mostly `test_raft_recovery_user_data`:
- decreasing the running time,
- some simplifications,
- making sure group 0 majority is lost when expected.

These are not critical test changes, so no need to backport.

Closes scylladb/scylladb#26442

* github.com:scylladb/scylladb:
  test: assert that majority is lost in some tests of the recovery procedure
  test: rest_client: add timeout support for read_barrier
  test: test_raft_recovery_user_data: lose majority when killing one dc
  test: test_raft_recovery_user_data: shutdown driver sessions
  test: test_raft_recovery_user_data: use a separate driver connection for the write workload
  test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node
  test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s
  test: test_raft_recovery_user_data: speed up replace operations
  test: stop/start servers concurrently in the recovery procedure tests
2025-10-17 10:54:05 +03:00
Piotr Wieczorek
a2b9d7eed5 alternator: Split update_item_operation::apply into smaller methods
This is a minor refactoring aimed at reducing cognitive complexity of
`update_item_operation::apply`. The logic remains unchanged.

Closes scylladb/scylladb#25887
2025-10-17 09:51:05 +02:00
Taras Veretilnyk
d9be2ea69b docs: improve nodetool getendpoints documentation
Clarified and expanded the documentation for the nodetool getendpoints command,
including detailed explanations of the --key and --key-components options.
Added examples demonstrating usage with simple and composite partition keys.

Closes scylladb/scylladb#26529
2025-10-17 10:40:54 +03:00
Pawel Pery
10208c83ca vector_search: fix flaky dns_refresh_aborted test
The test process like that:
- run long dns refresh process
- request for the resolve hostname with short abort_source timer - result
  should be empty list, because of aborted request

The test sometimes finishes long dns refresh before abort_source fired and the
result list is not empty.

There are two issues. First, as.reset() changes the abort_source timeout. The
patch adds a get() method to the abort_source_timeout class, so there is no
change in the abort_source timeout. Second, a sleep could be not reliable. The
patch changes the long sleep inside a dns refresh lambda into
condition_variable handling, to properly signal the end of the dns refresh
process.

Fixes: #26561
Fixes: VECTOR-268

It needs to be backported to 2025.4

Closes scylladb/scylladb#26566
2025-10-17 09:33:17 +02:00
Pavel Emelyanov
7d0722ba5c code: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-17 10:26:50 +03:00
Pavel Emelyanov
a88a36f5b5 code: Switch to seastar API level 9
In the new API the biggest change is to implement the only
data_sink_impl::put(span<temporary_buffer>) overload.

Encrypted file impl and sstables compress sink use fallback_put() helper
that generates a chain of continuations each holding a buffer.

The counting_data_sink in transport had mostly been patched to correct
implementation by the previous patch, the change here is to replace
vector argument with span one.

Most other sinks just re-implement their put(vector<temporary_buffer>)
overload by iterating over span and non-preemptively grabbing buffers
from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-17 10:26:50 +03:00
Pavel Emelyanov
9ece535b5e transport: Open-code invoke_with_counting into counting_data_sink::put
The former helper is implemented like this:

    future<> invoke_with_counting(fn) {
        if (not_needed)
            return fn();

        return futurize_invoke(something).then([fn] {
            return fn()
        }).finally(something_else);
    }

and all put() overloads are like

    future<> put(arg) {
        return invoke_with_counting([this, arg] {
            return lower_sink.put(arg);
        });
    }

The problem is that with seastar API level 9, the put() overload will
have to move the passed buffers into stable storage before preempting.
In its current implementation, when counting is needed the
invoke_with_counting will link lower_sink.put() invocation to the
futurize_invoke(something) future. Despite "something" is
non-preempting, and futurize_invoke() on it returns ready future, in
debug mode ready_future.then() does preempt, and the API level 9 put()
contract will be violated.

To facilitate the switch to new API level, this patch rewrites one of
put() overloads to look like

    future<> put(arg) {
        if (not_needed) {
            return lower_sink.put(arg);
        }

        something;
        return lower_sink(arg).finally(something_else);
    }

Other put()-s will be removed by next patch anyway, but this put() will
be patched and will call lower_sink.put() without preemption.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-17 10:26:30 +03:00
Pavel Emelyanov
068d788084 transport: Don't use scattered_message
The API to put scattered_message into output_stream() is gone in seastar
API level 9, transport is the only place in Scylla that still uses it.

The change is to put the response as a sequence of temporary_buffer-s.
This preserves the zero-copy-ness of the reply, but needs few things to
care about.

First, the response header frame needs to be put as zero-copy buffer
too. Despite output_stream() supports semi-mixed mode, where z.c.
buffers can follow the buffered writes, it won't apply here. The socket
is flushed() in batched mode, so even if the first reply populates the
stream with data and flushes it, the next response may happen to start
putting the header frame before delayed flush took place.

Second, because socket is flushed in batch-flush poller, the temporary
buffers that are put into it must hold the foreigh_ptr with the response
object. With scattered message this was implemented with the help of a
delter that was attached to the message, now the deleter is shared
between all buffers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-17 10:17:08 +03:00
Pavel Emelyanov
d9808fafdb utils: Implement memory_data_sink::put(net::packet)
It's going to be removed by next-after-next patch, but the next one
needs this overload implemented properly, so here it is.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-17 10:17:08 +03:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Tomasz Grabiec
e6c427953e Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.

Key changes:

- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.

Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414

Closes scylladb/scylladb#25302

* github.com:scylladb/scylladb:
  db: schema_applier: update tablet metadata atomically
  db: replica: move tables_metadata locking to commit
  storage_service: abstract schema dependecies during token metadata update
  storage_service: split replicate_to_all_cores to steps
  db: schema_applier: unify token_metadata loading
  replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
  service: fix dependencies during migration_manager startup
  db: schema_applier: move pending_token_metadata to locator
  db: always use _tablet_hint as condition for tablet metadata change
  db: refactor new_token_metadata into pending_token_metadata
  db: rename new_token_metadata to pending_token_metadata
  db: schema_applier: move types storage init to merge_types func
  db: schema_applier: make merge functions non-static members
  db: remove unused proxy from create_keyspace_metadata
2025-10-16 21:43:49 +02:00
Piotr Wieczorek
caa522a29d alternator: Add UpdateItem and BatchWriteItem response size metrics
This commit bundle introduces metrics on item sizes for Alternator operations.

The new metrics are:
- `operation_size_kib op=UpdateItem`: Tracks the size of an `UpdateItem`
  operation. This is calculated as the sum of the existing item's size
  plus the estimated size of the updated fields.
- `operation_size_kib op=BatchWriteItem`: Tracks the total size of items
  within a `BatchWriteItem` request, aggregated on a per-table basis. If
  an item already exists, the logged size is the maximum of the old and
  the new item size.

NOTE: Both metrics rely on read-before-write, so if the
`alternator_force_read_before_write` option is disabled, these metrics
may be incomplete and report inaccurate sizes.
2025-10-16 19:17:27 +02:00
Piotr Wieczorek
5ca42b3baf alternator: Add PutItem and DeleteItem response size metrics
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds `operation_size_kb`
histograms for sizes of items created or replaced by the `PutItem`
operation, and sizes of items deleted by `DeleteItem` requests. The
latter needs a read-before-write, so the metrics may be incomplete if
`alternator_force_read_before_write` is disabled.
2025-10-16 19:17:26 +02:00
Piotr Wieczorek
5c72fd9ea3 alternator: Add BatchGetItem response size metrics
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a `operation_size_kb`
per-table histogram, which contains item sizes in BatchGetItem requests.

A size of a BatchGetItem is the sum of the sizes of all items in the
operation grouped by table. In other words, a single BatchGetItem, and
BatchWriteItem for that matter, updates the histograms for each table
that it has items in.
2025-10-16 19:16:57 +02:00
Piotr Wieczorek
1aa3819b57 alternator: Add GetItem response size metrics
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a per-table
`operation_size_kb` histogram, recording the sizes of the items
contained in GetItem responses.
2025-10-16 19:04:55 +02:00
Piotr Dulikowski
44257f4961 Merge 'raft topology: disable schema pulls in the Raft-based recovery procedure' from Patryk Jędrzejczak
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

We fix this issue and add a regression test in this PR.

Fixes #26569

This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).

Closes scylladb/scylladb#26572

* github.com:scylladb/scylladb:
  test: test_raft_recovery_entry_loss: fix the typo in the test case name
  test: verify that schema pulls are disabled in the Raft-based recovery procedure
  raft topology: disable schema pulls in the Raft-based recovery procedure
2025-10-16 18:48:38 +02:00
Emil Maskovsky
6769c313c2 raft: small fixes for voters code
Minor cleanups and improvements to voter-related code.

No backport: cleanup only, no functional changes.

Closes scylladb/scylladb#26559
2025-10-16 18:41:08 +02:00
Ernest Zaslavsky
55fb2223b6 s3_client: unify make_request implementation
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.
2025-10-16 15:51:28 +03:00
Piotr Wieczorek
1559021c4e alternator/test: Add more context to test_metrics.py asserts
This commit adds more information to the assert messages to ease in
debugging. The semantics of the asserts remains the same.
2025-10-16 14:41:19 +02:00
Piotr Dulikowski
a8d92f2abd test: mv: add a test for tablet merge
The test test_mv_tablets_replace verifies that merging tablets of both a
view and its base table is allowed if rf-rack-valid-keyspaces option is
enabled (and it is enabled by default in the test suite).
2025-10-16 14:07:37 +02:00
Piotr Dulikowski
359ed964e3 tablet_allocator, tests: remove allow_tablet_merge_with_views injection
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.

Remove the error injection from the code and adjust the tests not to use
it.
2025-10-16 14:07:37 +02:00
Piotr Dulikowski
189ad96728 tablet_allocator: allow merges in base tables if rf-rack-valid=true
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273
2025-10-16 13:02:05 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Avi Kivity
8f1de2a7ad Merge 'test/boost: speed up test test_indexing_paging_and_aggregation by making internal page size configurable' from Nadav Har'El
The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times.

It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch):

1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-)
2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`.
3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows.

After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build.

Closes scylladb/scylladb#25368

* github.com:scylladb/scylladb:
  cql: make SELECT's "internal page size" configurable
  secondary index: translate test_indexing_paging_and_aggregation to Python
2025-10-16 11:58:13 +03:00
Marcin Maliszkiewicz
47dba4203a db: schema_applier: update tablet metadata atomically
Before mutable_token_metadata_ptr containing tablet changes
was replicated to all cores in post_commit phase which violated
atomicy guarantee of schema_applier, now it's incorporated into
per shard commit phase.

It uses service::schema_getter abstraction introduced in earlier
commit to inject "pending" schema which is not yet visible to the
whole system.
2025-10-16 10:56:50 +02:00
Marcin Maliszkiewicz
e5fffa158f db: replica: move tables_metadata locking to commit
This keeps the locking scope minimal, and since
unlocking is done in commit(), locking fits here as well.
2025-10-16 10:56:10 +02:00
Marcin Maliszkiewicz
92cfc3c005 storage_service: abstract schema dependecies during token metadata update
The functions prepare_token_metadata_change and commit_token_metadata_change depend
on the current schema through calls to the database service. However, during an
atomic schema change, the current schema does not yet include the pending changes.
Despite that, we want to apply token metadata changes to those pending schema
elements as well.

Currently, this is achieved by postponing token metadata changes until after the rest
of the schema is committed, but this breaks atomicity. To allow incorporating the
prepare and commit phases into schema_applier, we need to abstract the schema
dependency. This will make it possible to provide, in following commits, an
implementation that includes visibility into pending changes, not just the currently
active schema.
2025-10-16 10:56:09 +02:00
Botond Dénes
5d70450917 replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
Make use of the freshly introduced facility to disable
garbage-collection on a per-query basis for range scans. This is needed
so partitions that only contain garbage-collectible data are not missing
from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(),
the user is expecting to see *all* data, even that which is dead and
garbage-collectible.

Include a test which reproduces the issue.
2025-10-16 10:40:28 +03:00
Botond Dénes
734a9934a6 replica: add tombstone_gc_enabled parameter to mutation query methods
Allow disabling tombstone gc on a per-query basis for mutation queries.
This is achieved by a bool flag passed to mutation query variants like
`query_mutations_on_all_shards()` and `database::mutation_query()`,
which is then propagated down to compaction_mutation_state.
The future user (in the next patch) is the SELECT * FROM
MUTATION_FRAGMENTS() statement which wants to see dead partitions
(and rows) when scanning a table. Currently, due to garbage collections,
said statement can miss partitions which only contain
garbage-collectible tombstones.
2025-10-16 10:38:47 +03:00
Botond Dénes
03118a27b8 mutation/mutation_compactor: remove _can_gc member
It is confusing. For query compaction, it initialized to `always_gc`,
for sstable compaction it is initialized to a lambda calling into
`can_gc()`. This makes understanding the purpose of this member very
confusing.
The real use of this member is to bridge
mutation_partition::compact_and_expire() with can_gc(). This patch
ditches the member and creates the lambda near the call sites instead,
just like the other params to `compact_and_expire()` already are.

can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just
blindly returning true when in query mode.

With this patch, whether tombstones are collected or not in query mode
is now consistent and controlled by the tombstone_gc_state.
2025-10-16 10:38:47 +03:00
Botond Dénes
cb27c3d6e9 tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
Currently, to disable tombstone-gc on-demand completely, one has to pass
down a bool flag along with the already required tombstone_gc_state to
the code which does the compacting.
This is redundant and confusing, the tombstone_gc_state is supposed to
encapsulate all tombstone-gc related logic in a transparent way.

Add dedicated factory methods for no-gc and gc-all, to allow creating a
tombstone_gc_state which transparently gcs for all or no tombstones.
2025-10-16 10:38:47 +03:00
Piotr Wieczorek
15c399ed40 test/alternator: Add more Streams tests for UpdateItem and BatchWriteItem
This commit adds tests to `test_streams.py` (i.e. Alternator Streams)
checking the following cases:
* putting an item with BatchWriteItem shouldn't emit a log if the old
  item and the new item are identical,
* deleting an item with BatchWriteItem shouldn't emit a log if the item
  doesn't exist,
* UpdateItem shouldn't emit a log if the old item and the new item are
  identical.

These cases haven't been tested until this commit.

Refs https://github.com/scylladb/scylladb/issues/6918

Closes scylladb/scylladb#26396
2025-10-16 09:34:12 +03:00
Pavel Emelyanov
dbca0b8126 Update seastar submodule
* seastar 270476e7...bd74b3fa (20):
  > memory: Decay large allocation warning threshold
  > iotune: fix very long warm up duration on systems with high cpu count
  > Add lib info to one line backtrace
  > io: Count and export number of AIO retries
  > io_queue: Destroy priority class data with scheduling group
  > Merge 'Expell net::packet from output_stream API stack' from Pavel Emelyanov
    code: Introduce new API level
    iostream: Remove write()-s of packet/scattered_message from new API level
    iostream: Convert output_stream::_zc_bufs to vector of buffers
    code: Add data_sink_impl::put(std::span<temporary_buffer>) method
    code: Prepare some data_sink_impl::do_put(temporary_buffer) methods
    iostream: Introduce output_stream::write(span<temporary_buffer>) overload
    packet: Add packet(std::span<temporary_buffer>) constructor
    temporary_buffer: Add detach_front() helper
  > cooking: update gnutls to 3.7.11
  > file: Configure DMA alignment from block size
  > util: adapt to fmt 12.0.0 API changes
  > Merge 'Internalize reactor::posix_... API methods' from Pavel Emelyanov
    reactor: Deprecate and internalize posix_connect()
    reactor: Deprecate and internalize posix_listen()
  > cooking: update fmt to modern version
  > Merge 'Add prometheus bench, coroutinize prometheus' from Travis Downs
    prometheus: coroutinize metrics writing
    prometheus_test: add global label test
    introduce metrics_perf bench
  > operator co_await: use rvalue reference
  > futurize::invoke: use std::invoke
  > io_tester: Don't skip 0 position in sequential workflows
  > io_queue: Use own logger for messages
  > .clangd: tell the LSP about seastar's header style
  > docker: Update to plucky
  > Merge 'Convert timer test into seastar test (and a bit more)' from Pavel Emelyanov
    test: Remove OK macro
    test: Fix one failure check
    test: Use boost checkers instead of BUG() macro
    test: Fix indentation after previous patch
    test: Convert timer_test into seastar test(s)

Closes scylladb/scylladb#26560
2025-10-16 07:55:17 +03:00
Nadav Har'El
921d07a26b cql: make SELECT's "internal page size" configurable
In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or
secondary index, it needs to perform internal scans. It uses an "internal
page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000.

There was an ad-hoc and undocumented way to override this default in C++
tests, using functions in test/lib/select_statement_utils.hh, but it
was so non-obvious that the test that most needed to override this
default - the very slow test test_indexing_paging_and_aggregation which
would have been must faster with a lower setting - never used it.

So in this patch we replace the ad-hoc configuration functions by a
bona-fide Scylla configuration option named "select_internal_page_size".

The few C++ tests that used the old configuration functions were
modified to use the new configuration parameters. The slow test
test_indexing_paging_and_aggregation still doesn't use the new
configuration to become faster - we'll do this in the next patch.

Another benefit of having this "internal page size" as a configuration
option is that one day a user might realize that the default choice
10,000 is bad for some reason (which I can't envision right now), so
having it configurable might come it handy.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-15 18:42:09 +03:00
Patryk Jędrzejczak
71de01cd41 test: test_raft_recovery_entry_loss: fix the typo in the test case name 2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak
da8748e2b1 test: verify that schema pulls are disabled in the Raft-based recovery procedure
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak
ec3a35303d raft topology: disable schema pulls in the Raft-based recovery procedure
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.

Fixes #26569
2025-10-15 16:58:24 +02:00
Nadav Har'El
afc5379148 secondary index: translate test_indexing_paging_and_aggregation to Python
The Boost test test_indexing_paging_and_aggregation is one of the slowest
boost tests. But it's hard to understand why it needs to be so slow - the
C++ test code is opaque, and uncommented. The test didn't need to be in
C++ - it only uses CQL, not any internal interfaces - but it was written
in 2019, a year before test/cqlpy was created.

So before we can make this test faster, this patch translates it to
Python and adds significant amount of comments. The new Python test is
functionally identical to the old C++ test - it is not (yet) made
smaller or faster. The new test takes a whopping 9 seconds to run on
my laptop (in dev build mode). We'll reduce that in the next patch.

As usual, the cqlpy test can also be tested on Cassandra, and
unsurprisingly, it passes.

Refs #16134 (which asks to translate more MV and SI tests to Python).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-15 17:50:37 +03:00
Piotr Dulikowski
61662bc562 Merge 'alternator: Make CDC use preimages from LWT for Alternator' from Piotr Wieczorek
This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature.

The main changes are:
- add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers,
- add the struct to `cas_request::apply`'s signature,
- add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some **Alternator** requests and it is always safe. It's possible to use this feature also in CQL.

No backport, it's a feature.

Refs https://github.com/scylladb/scylladb/issues/6918
Refs https://github.com/scylladb/scylladb/pull/26121

Closes scylladb/scylladb#26149

* github.com:scylladb/scylladb:
  alternator, cdc: Re-use the row read by LWT as a CDC preimage
  cdc: Support prefetched preimages
  storage: Add cdc options to cas_request::apply
  cdc, storage: Add a struct to pass per-mutation options to CDC
  cdc: Move operations enum to the top of the namespace
2025-10-15 12:30:29 +02:00
Piotr Wieczorek
28eda0203e alternator: Small cleanup, removing unnecessary statements, etc.
Tiny code cleanup to improve readability without changing behavior.

Changes:
- remove unused variables and imports,
- remove redundant whitespaces, and a duplicated `public:` access
  specifier,
- use `is_aws` function to check if running in AWS
  test/alternator/test_metrics.py,
- other trivial changes.

Closes scylladb/scylladb#26423
2025-10-15 12:05:20 +02:00
Pavel Emelyanov
7bd50437ff test: Remove unused operator<<(radix_tree_test::test_data)
It was used while debugging the test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26458
2025-10-15 11:57:56 +02:00
Marcin Maliszkiewicz
106fd39c6c storage_service: split replicate_to_all_cores to steps
In later commits schema merge code will use those prepare
and commit steps. Rest of the code will continue using
replicate_to_all_cores.
2025-10-15 10:54:24 +02:00
Gleb Natapov
eb9112a4a2 db: experimental consistent-tablets option
The option will be used to hid consistent tablets feature until it is
ready.
2025-10-15 11:27:10 +03:00
Dawid Mędrek
3aa07d7dfe test/cluster/mv: Provide reason why test is skipped
We point to the issue explaining why the test was disabled
and what can be done about it.

Closes scylladb/scylladb#26541
2025-10-15 09:22:39 +02:00
Jenkins Promoter
d731d68e66 Update pgo profiles - aarch64 2025-10-15 05:21:46 +03:00
Jenkins Promoter
b6237d7dd4 Update pgo profiles - x86_64 2025-10-15 04:54:54 +03:00
Piotr Dulikowski
aed166814e test: cluster: skip flaky test_raft_recovery_entry_lose test
Unfortunately, the test became flaky and is blocking promotion. The
cause of the flaky is not known yet but unrelated to other items
currently queued on the `next` branch. The investigation continues on
GitHub issue scylladb/scylladb#26534.

In the meantime, skip the test to unblock other work.

Refs: scylladb/scylladb#26534

Closes scylladb/scylladb#26549
2025-10-14 19:35:44 +02:00
Botond Dénes
d0844abb5c mutation/mutation_compactor: compaction:stats: split partitions
Into total and live. Currently only live (those with live content) are
counted. Report live and total seprately, just like we do for rows. This
allows deducing the count of dead partitions as well, which is
particularly interesting for scans.

Closes scylladb/scylladb#26548
2025-10-14 19:08:47 +03:00
Marcin Maliszkiewicz
b0f11b6d91 db: schema_applier: unify token_metadata loading
Putting it into a single place gives more clarity on
how _pending_token_metadata is made and avoids extra
per shard copy when tablets change.
2025-10-14 10:56:37 +02:00
Marcin Maliszkiewicz
d67632bfe2 replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
This copy is now used during the whole duration of schema merge.
If it changes due to tablet_hint then it's replicated to all shards as before.
2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz
389afcdeb6 service: fix dependencies during migration_manager startup
We need to avoid reloading schema early as it goes via
schema_applier which internally depends on storage_service
and on distribued_loader initializing all keyspaces.

Simply moving migration manager startup later in the code is not
easy as some services depend on it being initialized so we just
enable those feature listeners a bit later.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
46bff28a38 db: schema_applier: move pending_token_metadata to locator
It never belonged to tables and views and its placement stems
from location of _tablet_hint handling code.

In the follwing commits we'll reference it in storage_service.cc.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
1a539f7151 db: always use _tablet_hint as condition for tablet metadata change
When all schema_applier code uses this condition it's easier
to grep than when we use different, derived conditions.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
c112916215 db: refactor new_token_metadata into pending_token_metadata
It prepares pending_token_metadata to handle both new and copy
of existing metadata for consistent usage in later commit.

It also adds shared_token_metatada getter so that we don't
need to get it from db.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
668231d97c db: rename new_token_metadata to pending_token_metadata
Part of the refactor done in following commit.
Separated for easier review.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
0c4c995c0d db: schema_applier: move types storage init to merge_types func
Merge_types function groups operation related to types,
types storage fits this group.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
794d68e44c db: schema_applier: make merge functions non-static members
This is mechanical change which simplifies the code. Schema_applier
class is an object which holds schema merging intermediate state
so it's fine that all schema merging functions have access to this state.
2025-10-14 10:56:25 +02:00
Marcin Maliszkiewicz
209563f478 db: remove unused proxy from create_keyspace_metadata 2025-10-14 10:56:25 +02:00
Ernest Zaslavsky
413739824f s3_client: track memory starvation in background filling fiber
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.

Closes scylladb/scylladb#26466
2025-10-14 11:22:54 +03:00
Piotr Wieczorek
5ff2d2d6ab alternator, cdc: Re-use the row read by LWT as a CDC preimage
Propagates the row read by CAS to CDC's preimage to save one
read-before-write.

As of now, a preimage in Alternator Streams always contains the entire
item (see previous_item_read_command in executor.cc), so the resulting
preimage should stay the same. In other words, this change should be
transparent to users.
2025-10-14 07:52:40 +02:00
Piotr Wieczorek
d4581cc442 cdc: Support prefetched preimages
This commit adds support to pass a preimage selected by an upper layer
to CDC. The responsibility for the correctness of the preimage (i.e. the
selected columns, whether it's up to date, etc.) lies with the caller.
It may be improved in the future by validating the preimage, e.g. by
"slicing" the received preimage to the necessary columns.

The motivation behind this change was to reduce the number of
read-before-writes and avoid reading the row twice for Alternator
Streams in an increased compatibility mode with DynamoDB. This is to be
added in a following commit. Until now, this commit should be a no-op.
2025-10-14 07:29:07 +02:00
Łukasz Paszkowski
125bf391a7 utils/directories: ignore files when retrieving stats fails
During Scylla startup, directories are created and verified in
`directories::do_verify_owner_and_mode()`. It is possible that while
retrieving file stats, a file might be removed, leading to Scylla
failing to boot.

This is particularly visible in `storage/test_out_of_space.py` tests,
which use FUSE to mount size-limited volumes. When a file that is open
by another process is removed, FUSE renames it to `.fuse_hidden*`.

In `directories::do_verify_owner_and_mode()`, the code performs a
`scan_dir` to list files and retrieves their stats to verify type, mode,
and ownership. If a file is removed while retrieving its stats, we see
errors such as:

```
Failed to get /scylladir/testlog/x86_64/dev/volumes/e0125c60-1e63-4330-bf6f-c0ea3e466919/scylla-0/hints/1/.fuse_hidden0000001800000005
```

This change makes `do_verify_owner_and_mode()` ignore files when
retrieving stats fails, avoiding spurious errors during verification.

Refs: https://github.com/scylladb/scylladb/issues/26314

Closes scylladb/scylladb#26535
2025-10-13 20:41:25 +03:00
Botond Dénes
46af0127e9 test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
Comprehensive test for the new CQL input format.
2025-10-13 18:10:40 +03:00
Botond Dénes
180bf647f7 replica/mutation_dump: add support for virtual tables
Not supported currently as such tables have no memtables, cache or
sstables, so any select * from mutation_fragments() query will return
empty result.
Detect virtual tables and add return their content with a distinct
'virtual-table' mutation_source designation.
2025-10-13 18:10:40 +03:00
Botond Dénes
64c32ca501 tools/scylla-sstable: print_query_results_json(): handle empty value buffer
Print null, similar to disengaged optional value.
2025-10-13 18:10:40 +03:00
Botond Dénes
e404dd7cf0 tools/scylla-sstable: add cql support to write operation
Add new --input-format command line argument. Possible values are json
(current) and cql (new -- added in this patch).
When --input-format=cql (new default), the input-file is expected to
contain CQL INSERT, UPDATE or DELETE statements, separated by semicolon.
The input file can contain any number of statements, in any order. The
statements will be executed and applied to a memtable, which is then
flushed to create an sstable with the content generated from the
statement. The memtable's size is capped at 1MiB, if it reaches this
size, it is flushed and recreated. Consequently, multiple sstables can
be created from a single scylla-sstable write --input-format=cql
operation.
2025-10-13 18:10:40 +03:00
Dawid Mędrek
7d017748ab db/commitlog: Extend segment truncation error messages
We include more relevant information for debugging purposes:
the remaining bytes and the size. It might be useful to determine
where exactly an error occurred and help reason about it.

Closes scylladb/scylladb#26486
2025-10-13 17:42:31 +03:00
Nadav Har'El
06108ea020 test/alternator: a small cleanup for a test in test_streams.py
This patch makes three small mostly-cosmetic improvements to a test in
test/alternator/test_streams.py:

1. The test is renamed "test_streams_deleteitem_old_image_no_ck" to
   emphasize its focus on the combination of deleteitem, old image,
   and no ck. The "putitem" we had in the name was not relevant, and
   the "old_image" was missing and important.

2. Moreover, using PutItem in this test just to set up the test scenario
   mixed the bug which the test tries to reproduced with a different
   only-recently-fixed bug (that PutItem also generated a spurious
   "REMOVE" event). So I changed the use of PutItem by using UpdateItem,
   to make this test indepedent of the other bug. Test independence is
   important because it allows us - if we want - to backport a fix for
   just one bug independently of the fix to the other bug.

3. Also improved the comment in front of the test to mention where we
   already tested the with-ck case, and also to mention issue 26382
   which this test reproduces (the xfail line also mentions it, but
   the xfail line will be removed when the bug is fixed - but the
   mention in the comment will remain - and should remain.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26526
2025-10-13 17:42:31 +03:00
Piotr Dulikowski
1cf944577b Merge 'Fix vector store client flaky test' from Karol Nowacki
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.

Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.

- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.

- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.

- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.

- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.

Fixes: #26468

It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.

Closes scylladb/scylladb#26374

* github.com:scylladb/scylladb:
  vector_search: Unify test timeouts
  vector_search: Fix missing timeout reset
  vector_search: Refactor ANN request test
  vector_search: Fix flaky connection in tests
  vector_search: Fix flaky test by mocking DNS queries
2025-10-13 17:42:31 +03:00
Botond Dénes
f4f99ece7d tools/scylla-sstable: write_operation(): fix indentation
Left broken in previous patch for easier review.
2025-10-13 17:35:50 +03:00
Botond Dénes
023aed0813 tools/scylla-sstable: write_operation(): prepare for a new input-format
In the next patches a new input-format will be introduced, which can
produce multiple output format. To prepare for this, consolidate the
code which produces an sstable into a reusable lambda function.
Moves code around, reduces churn in next patches. Indentation is left
broken for easier review.
2025-10-13 17:35:50 +03:00
Botond Dénes
f03cec9574 tools/scylla-sstable: generalize query_operation_validate_query()
Make error messages more generic, so they are not specific to select.
Make it a template on the type of cql statement for the final check. To
avoid templating the whole thing, the function is split into two.
Parametrize the name of the allowed statement types in said check.
Prepares the method to be shared between query operation and write
operation (future change).
While at it, also change query param type to std::string_view to avoid
some copies.
2025-10-13 17:35:50 +03:00
Botond Dénes
61e70d1b11 tools/scylla-sstable: move query_operation_validate_query()
Move it in the source code above write_operation(), as said operation
will soon want to use this method too.
2025-10-13 17:35:50 +03:00
Botond Dénes
dfe4cfc0e2 tools/scylla-sstable: extract schema transformation from query operation
This transformation enables an existing schema to be created as a table
in cql_test_env, to be used to read/write sstables belonging to said
schema.
Extract this into a method, to be shared by a future operation which
will also want to do this.
2025-10-13 17:35:50 +03:00
Botond Dénes
970d4f0dcd replica/table: add virtual write hook to the other apply() overload too
Currently only one has it, which means virtual table can potentially
miss some writes.
2025-10-13 17:35:50 +03:00
Calle Wilund
68c109c2df docs::dev::object_storage: Add some initial info on GS storage
Augments the object storage document with config options etc for
using GS instead of S3.

TODO: add proper gsutil command line examples for manual managing of
GCP storage.
2025-10-13 08:53:28 +00:00
Calle Wilund
54a7d7bd47 docs/dev: Add mention of (nested) docker usage in testing.md
As one (of more?) places to document the fact that we partially
rely on resolving docker images for CI.
2025-10-13 08:53:28 +00:00
Calle Wilund
403247243b sstables::object_storage_client: Forward memory limit semaphore to GS instance
Enforces object storage limits to the GS implementation as well.
2025-10-13 08:53:28 +00:00
Calle Wilund
01f4dfed84 utils::gcp::object_storage: Add optional memory limits to up/download
Adds optional memory semaphore to limit the mem buffer usage in sink/source.
Note that we don't bookkeep exact, to avoid deadlock issues in higher layer.

In upload, we overlease on first buffer put to ensure we can at least fill
the desired 8M of buffers. We try to adjust when going over, but if we
fail, we fail, but at least will initiate upload -> soon release memory.
On next put, we try to grab multiples of 8M again, and so forth. Thus
potentially causing waiting for resources, without ending up not uploading
at least one active sink.

For download (source), we try to get lease for as much as we want to read,
but if we fail, we adjust this down to 256k and download anyway. Since this
will typically be released immediately, we at least don't overrun for long,
and again, avoid fully stopping, throttling rate instead.
2025-10-13 08:53:27 +00:00
Calle Wilund
5e4e5b1f4a sstables::object_storage_client: Add multi-upload support for GS
Uses file splitting + object merge to facilitate parallel, resumable
upload of files with known size.
2025-10-13 08:53:27 +00:00
Calle Wilund
bd1304972c utils::gcp::storage: Add merge objects operation
Allows merging 1-32 smaller files into a destination.
2025-10-13 08:53:27 +00:00
Calle Wilund
e940a1362a test_backup/test_basic: Make tests multiplex both s3 and gs backends
Change fixture used + property/config access to allow running with
arbitrary bucket-object based backend.
2025-10-13 08:53:27 +00:00
Calle Wilund
80c02603a8 test::cluster::conftest: Add support for multiple object storage backends
Adds an `object_storage` fixture with paramterization to iterate through
's3' and 'gs' backends.
For the former, will instansiate the `s3_server` backend (modified to better
handle being actual temp, function level server).
For the latter, will either give back a frontend if env vars indicating
"real" GS buckets and endpoints are used, or launch a docker image for
fake-gcs-server on a free port.

Please read the comment in the code about the management of server output,
as this is less than optimal atm, but I can't figure out the issue with it.

All returned fixture objects will respond to `address`, `bucket` properties,
as well as be able to create endpoint config objects for scylla.
2025-10-13 08:53:27 +00:00
Calle Wilund
da36a9d78e boost::gcs_storage_test: reindent
Remove redundant indentation/moosewings.
2025-10-13 08:53:27 +00:00
Calle Wilund
1356f60c69 boost::gcs_storage_test: Convert to use fixture
Instead of test-local server/endpoint etc, use the gcs test fixture,
with the added bonus of a suite-shared one for additional speed.
2025-10-13 08:53:27 +00:00
Calle Wilund
7c6b4bed97 tests::boost: Add GS object storage cases to mirror S3 ones
I.e. run same remote storage backend unit tests for GS backend
2025-10-13 08:53:27 +00:00
Calle Wilund
af2616d750 tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
A text fixture object for either real google storage or fake-gcs-server
using test local podman.

Copied/transposed from gcp_object_storage_test.
2025-10-13 08:53:26 +00:00
Calle Wilund
a33fdd0b62 tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
Move some code to compilation unit + add some overloads.
Add a RAII-object for temporary setting current process env as well.
2025-10-13 08:53:26 +00:00
Calle Wilund
527a6df460 sstables::object_storage_client: Add google storage implementation
Allowing GS config to be honoured.
2025-10-13 08:53:26 +00:00
Calle Wilund
956d26aa34 test_services: Allow testing with GS object storage parameters 2025-10-13 08:53:26 +00:00
Calle Wilund
da7099a56e utils::gcp::gcp_credentials: Add option to create uninitialized credentials
To avoid having to async wait for creating credentials, allow lazy
init (in actual token renew) of credentials. This is not super
pleasant, since it means any error will be late, but it is required
more or less for the code paths into which we intend to place this.
2025-10-13 08:53:26 +00:00
Calle Wilund
fd13ffd95d utils::gcp::object_storage: Make create_download_source return seekable_data_source
Since, given the nature of object storage API:s, it is no more complicated to
provide a reasonable implementation of a seekable, limited, interface,
give this back, which in turn means upper layers can provide easy read-only file
interfaces. Hint hint.
2025-10-13 08:53:26 +00:00
Calle Wilund
cc899d4a86 utils::gcp::object_storage: Add defensive copies of string_view params 2025-10-13 08:53:26 +00:00
Calle Wilund
2093e7457d utils::gcp::object_storage: Add missing retry backoff increate
Ensure we increase wait time on subsequent backoffs
2025-10-13 08:53:26 +00:00
Calle Wilund
74578aaae2 utils::gcp::object_storage: Add timestamp to object listing 2025-10-13 08:53:26 +00:00
Calle Wilund
d0fe001518 utils::gcp::object_storage: Add paging support to list_objects
Allowing listing N entries at a time, with continuation.
2025-10-13 08:53:26 +00:00
Calle Wilund
144b550e4f object_storage_client: Add object_name wrapper type
Remaining S3-centric, but abstracting the object name to
possible implementations not quite formatted the same.
2025-10-13 08:53:25 +00:00
Calle Wilund
9dde8806dd utils::gcp::object_storage: Add optional abort_source
Add forwarded abort_source to lengty ops
2025-10-13 08:53:25 +00:00
Calle Wilund
926177dfb4 utils::rest::client: Add abort_source support
Add optional forwarded abort_source
2025-10-13 08:53:25 +00:00
Calle Wilund
5d4558df3b sstables: Use object_storage_client for remote storage
Replaces direct s3 interfaces with the abstraction layer, and open
for having multiple implentations/backends
2025-10-13 08:53:25 +00:00
Calle Wilund
25e932230a sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
Adds abstraction layer for creating the various ops and IO objects for storing
sstable data on cloud storage
2025-10-13 08:53:25 +00:00
Calle Wilund
a2cd061a5d s3::upload_progress: Promote to general util type 2025-10-13 08:53:25 +00:00
Calle Wilund
868d057aae storage_options: Abstract s3 to "object_storage" and add gs as option
Since both are bucket+prefix oriented, we can basically use same
options for both, only distinguished by actual protocol.
Abstract the types and the helper parse etc routines to handle either.
Use "gs" as term for gcs (google compute storage), since this is the
URL scheme used.
2025-10-13 08:53:25 +00:00
Calle Wilund
ac438c61e6 sstables::file_io_extension: Change "creator" callback to just data_source
Because the concept of pushing reading range does not work for the wrapping
we do (i.e. encryption), there is no point having it here. We need to do
said range handling higher up.
Also, must allow multi-layered wrapping.
2025-10-13 08:53:25 +00:00
Calle Wilund
2e49270da5 utils::io-wrappers: Add ranged data_source
Provides a data_source wrapper to read a specific range of
a source stream.
2025-10-13 08:53:25 +00:00
Calle Wilund
91c0467282 utils::io-wrappers: Add file wrapper type for seekable_source
Provides a read-only file interface for a seekable_source object.
2025-10-13 08:53:25 +00:00
Calle Wilund
e62a18304e utils::seekable_source: Add a seekable IO source type
Extension of data_source, with the ability to
a.) Seek in any direction, i.e. move backwards. Thus not pure stream.
b.) Read a limited number of bytes.

The very transparent reason for the interface is to have a base
abstraction for providing a read-only file layer for networked
resources.
2025-10-13 08:53:24 +00:00
Calle Wilund
14dada350a object_storage_endpoint_param: Add gs storage as option 2025-10-13 08:53:24 +00:00
Calle Wilund
78d9dda060 config: break out object_storage_endpoint_param preparing for multi storage
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
2025-10-13 08:53:24 +00:00
Avi Kivity
c783f0e539 Merge 'index: Prefer const qualifiers wherever possible' from Dawid Mędrek
We add missing `const`-qualifiers wherever possible in the module.
A few smaller changes were included as a bonus.

Backport: not needed. This is a cleanup.

Closes scylladb/scylladb#26485

* github.com:scylladb/scylladb:
  index/secondary_index_manager: Take std::span instead of std::vector
  index/secondary_index_manager: Add missing const qualifier
  index/vector_index: Add missing const qualifiers
  cql3/statements/index_prop_defs.cc: Remove unused include
  cql3/statements/index_prop_defs.cc: Mark function as TU-local
  cql3/statements/index_prop_defs: Mark methods as const-qualified
2025-10-12 19:47:53 +03:00
Michał Chojnowski
93dac3d773 sstables/compressor: relax a large allocation warning in ZSTD_CDict creation
ZSTD_CDict needs a big contiguous allocation and there's no way around that.
The only thing to do is relax the warning appropriately.

Closes scylladb/scylladb#25393
2025-10-12 18:21:11 +03:00
Botond Dénes
24c6476f73 mutation/mutation_compactor: add tombstone_gc_state to query ctor
So tombstones can be purged correctly based on the tombstone gc mode.
Currently if repair-mode is used, tombstones are not purged at all,
which can lead to purged tombstone being re-replicated to replicas which
already purged them via read-repair.
This is not a correctness problem, tombstones are not included in data
query resutl or digest, these purgable tombstone are only a nuissance
for read repair, where they can create extra differences between
replicas. Note that for the read repair to trigger, some difference
other than in purgable tombstones has to exist, because as mentioned
above, these are not included in digets.

Fixes: scylladb/scylladb#24332

Closes scylladb/scylladb#26351
2025-10-12 17:48:15 +03:00
Botond Dénes
d9c3772e20 service/storage_proxy: send batches with CL=EACH_QUORUM
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.

Fixes: scylladb/scylladb#25432

Closes scylladb/scylladb#26304
2025-10-12 17:18:41 +03:00
Michał Chojnowski
7c6e84e2ec test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472
2025-10-12 17:16:51 +03:00
Piotr Wieczorek
8cd9f5d271 test/alternator: Add a Streams test reproducing #26382
This commit adds a test that reproduces an issue, wherein OldImage isn't
included in the REMOVE events produced by Alternator Streams.

Refs https://github.com/scylladb/scylladb/issues/26382

Closes scylladb/scylladb#26383
2025-10-12 11:09:57 +03:00
Piotr Wieczorek
a55c5e9ec7 alternator: Correct RCU undercount in BatchGetItem
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.

This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.

The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.

Fixes: https://github.com/scylladb/scylladb/pull/25847

Closes scylladb/scylladb#25842
2025-10-12 10:42:32 +03:00
Karol Nowacki
62deea62a4 vector_search: Unify test timeouts
The test previously used separate timeouts for requests (5s) and the
overall test case (10s).

This change unifies both timeouts to 10 seconds.
2025-10-10 16:49:06 +02:00
Karol Nowacki
0de1fb8706 vector_search: Fix missing timeout reset
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.

The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.
2025-10-10 16:48:54 +02:00
Karol Nowacki
d99a4c3bad vector_search: Refactor ANN request test
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.

This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.

Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.
2025-10-10 16:48:20 +02:00
Karol Nowacki
2eb752e582 vector_search: Fix flaky connection in tests
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.

This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.

The fix is to always read the request body in the mock server.
2025-10-10 16:48:09 +02:00
Karol Nowacki
ac5e9c34b6 vector_search: Fix flaky test by mocking DNS queries
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.

This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.

`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.

The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.

The change also simplifies the test making it more readable.
2025-10-10 16:47:03 +02:00
Patryk Jędrzejczak
5f68b9dc6b test: test_raft_no_quorum: test_can_restart: deflake the read barrier call
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.

To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.

Fixes #26457

Closes scylladb/scylladb#26489
2025-10-10 15:22:39 +03:00
Asias He
13dd88b010 repair: Rename incremental mode name
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.

Before:

- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

After:

- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

Fixes #26503

Closes scylladb/scylladb#26504
2025-10-10 15:21:54 +03:00
Michał Chojnowski
85fd4d23fa test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.

Fix that by making the test wait for CQL availability via `get_ready_cql()`.

Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.

Fixes scylladb/scylladb#25362

Closes scylladb/scylladb#25366
2025-10-10 14:27:02 +03:00
Piotr Dulikowski
0b800aab17 Merge 'db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground' from Michał Jadwiszczak
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).

Fixes https://github.com/scylladb/scylladb/issues/26417

The patch should be backported to 2025.4

Closes scylladb/scylladb#26446

* github.com:scylladb/scylladb:
  db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
  db/view/view_building_worker: futurize and rename `start_background_fibers()`
2025-10-09 18:24:50 +02:00
Michał Jadwiszczak
8d0d53016c db/view/view_building_worker: update state again if some batch was finished during the update
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).

State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.

This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.

Fixes scylladb/scylladb#26204

Closes scylladb/scylladb#26289
2025-10-09 18:17:22 +02:00
Avi Kivity
55d4d39ae3 Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski
This is a cherry-pick of https://github.com/scylladb/scylladb/pull/25412 commits, as the changes were reverted in 364316dd2f2212bbbb446eaa2a4b0bd53d125ad5 due to https://github.com/scylladb/scylladb/issues/26163.
The underlying problem (https://github.com/scylladb/scylladb/issues/26190) was fixed in seastar (https://github.com/scylladb/seastar/pull/2994), so https://github.com/scylladb/scylladb/pull/25412 commits are restored without changes (only rebase conflicts were resolved).

===
This patch series:
 - Increases the number of allowed scheduling groups to allow creation of `sl:driver`
 - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created
 - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader`
 - Modifies `topology_coordinator` to use  create `sl:driver` after upgrades.
 - Implements using `sl:driver` for new connections in `transport/server`
 - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`.
 - Adds tests to verify the new functionality
 - Modifies existing tests to let them pass after `sl:driver` is added
 - Modifies the documentation to contain new `sl:driver`

The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)):
 - Start ScyllaDB with one node
 - Create 1000 keyspaces, 1 table in each keyspace
 - Start `cassandra-stress` (`-rate threads=50  -mode native cql3`)
 - Run connection storm with 1000 session (100 python processes, 10 sessions each)

The maximum latency during connection storm dropped **from 224.94ms to 41.43ms** (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after).

The snippet of cassandra-stress output from the moment of connection storm:
Before:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        789206,   85887,   85887,   85887,     0.6,     0.3,     2.0,     2.0,     2.5,     5.0,    9.0,  0.09679,      0,      0,       0,       0,       0,       0
total,        909322,  120116,  120116,  120116,     0.4,     0.2,     1.9,     2.0,     2.1,     3.1,   10.0,  0.09053,      0,      0,       0,       0,       0,       0
total,        964392,   55070,   55070,   55070,     0.9,     0.4,     2.0,     4.5,     7.7,    18.9,   11.0,  0.09203,      0,      0,       0,       0,       0,       0
total,        975705,   11313,   11313,   11313,     4.4,     3.5,     6.5,    24.5,    82.7,    83.0,   12.0,  0.11713,      0,      0,       0,       0,       0,       0
total,        987548,   11843,   11843,   11843,     4.2,     3.5,     6.5,    33.7,    48.6,    51.5,   13.0,  0.13366,      0,      0,       0,       0,       0,       0
total,        995422,    7874,    7874,    7874,     6.3,     4.0,     7.7,    85.6,   112.9,   113.5,   14.0,  0.14753,      0,      0,       0,       0,       0,       0
total,       1007228,   11806,   11806,   11806,     4.3,     3.5,     6.5,    29.1,    43.8,    87.1,   15.0,  0.15598,      0,      0,       0,       0,       0,       0
total,       1012840,    5612,    5612,    5612,     8.2,     5.0,    11.5,   121.8,   166.6,   170.1,   16.0,  0.16535,      0,      0,       0,       0,       0,       0
total,       1016186,    3346,    3346,    3346,    13.4,     7.4,    20.1,   204.9,   207.6,   210.4,   17.0,  0.17405,      0,      0,       0,       0,       0,       0
total,       1025462,    9276,    9276,    9276,     6.3,     3.9,     9.6,    74.6,   206.8,   210.0,   18.0,  0.17800,      0,      0,       0,       0,       0,       0
total,       1035979,   10517,   10517,   10517,     4.8,     3.5,     6.7,    38.5,    82.6,    83.0,   19.0,  0.18120,      0,      0,       0,       0,       0,       0
total,       1047488,   11509,   11509,   11509,     4.3,     3.5,     6.0,    32.6,    72.3,    74.0,   20.0,  0.18334,      0,      0,       0,       0,       0,       0
total,       1077456,   29968,   29968,   29968,     1.7,     1.6,     2.9,     3.6,     7.0,     8.2,   21.0,  0.17943,      0,      0,       0,       0,       0,       0
total,       1105490,   28034,   28034,   28034,     1.8,     1.8,     3.5,     4.6,     5.3,    13.8,   22.0,  0.17609,      0,      0,       0,       0,       0,       0
total,       1132221,   26731,   26731,   26731,     1.9,     1.8,     3.8,     5.2,     8.4,    11.1,   23.0,  0.17314,      0,      0,       0,       0,       0,       0
total,       1162149,   29928,   29928,   29928,     1.7,     1.7,     3.0,     4.5,     8.0,     9.1,   24.0,  0.16950,      0,      0,       0,       0,       0,       0
...
```

After:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        822863,   94379,   94379,   94379,     0.5,     0.3,     2.0,     2.0,     2.1,     3.7,    9.0,  0.06669,      0,      0,       0,       0,       0,       0
total,        937337,  114474,  114474,  114474,     0.4,     0.2,     2.0,     2.0,     2.1,     3.4,   10.0,  0.06301,      0,      0,       0,       0,       0,       0
total,        986630,   49293,   49293,   49293,     1.0,     1.0,     2.0,     2.1,    17.9,    19.0,   11.0,  0.07318,      0,      0,       0,       0,       0,       0
total,       1026734,   40104,   40104,   40104,     1.2,     1.0,     2.0,     2.2,     6.3,     7.1,   12.0,  0.08410,      0,      0,       0,       0,       0,       0
total,       1066124,   39390,   39390,   39390,     1.3,     1.0,     2.0,     2.2,     2.6,     3.4,   13.0,  0.09108,      0,      0,       0,       0,       0,       0
total,       1103082,   36958,   36958,   36958,     1.3,     1.1,     2.1,     2.5,     3.1,     4.2,   14.0,  0.09643,      0,      0,       0,       0,       0,       0
total,       1141987,   38905,   38905,   38905,     1.3,     1.0,     2.0,     2.4,    11.4,    12.7,   15.0,  0.09894,      0,      0,       0,       0,       0,       0
total,       1180023,   38036,   38036,   38036,     1.3,     1.0,     2.0,     3.7,     5.6,     7.1,   16.0,  0.10070,      0,      0,       0,       0,       0,       0
total,       1216481,   36458,   36458,   36458,     1.4,     1.0,     2.1,     3.6,     4.7,     5.0,   17.0,  0.10210,      0,      0,       0,       0,       0,       0
total,       1256819,   40338,   40338,   40338,     1.2,     1.0,     2.0,     2.2,     3.5,     5.4,   18.0,  0.10173,      0,      0,       0,       0,       0,       0
total,       1295122,   38303,   38303,   38303,     1.3,     1.0,     2.0,     2.4,    21.0,    21.1,   19.0,  0.10136,      0,      0,       0,       0,       0,       0
total,       1334743,   39621,   39621,   39621,     1.3,     1.0,     2.0,     2.3,     3.3,     4.0,   20.0,  0.10055,      0,      0,       0,       0,       0,       0
total,       1375579,   40836,   40836,   40836,     1.2,     1.0,     2.0,     2.1,     3.4,     5.7,   21.0,  0.09927,      0,      0,       0,       0,       0,       0
total,       1415576,   39997,   39997,   39997,     1.2,     1.0,     2.0,     2.3,     3.2,     4.1,   22.0,  0.09807,      0,      0,       0,       0,       0,       0
total,       1449268,   33692,   33692,   33692,     1.5,     1.4,     2.5,     3.2,     4.2,     5.6,   23.0,  0.09800,      0,      0,       0,       0,       0,       0
total,       1471873,   22605,   22605,   22605,     2.2,     2.0,     4.8,     5.9,     7.0,     7.9,   24.0,  0.10015,      0,      0,       0,       0,       0,       0
...
```

Fixes: https://github.com/scylladb/scylladb/issues/24411

This is a new feature, so no backport needed.

Closes scylladb/scylladb#26411

* github.com:scylladb/scylladb:
  docs: workload-prioritization: add driver service level
  test: add test to verify use of `sl:driver`
  transport: use `sl:driver` to handle driver's control connections
  transport: whitespace only change in update_scheduling_group
  transport: call update_scheduling_group for non-auth connections
  generic_server: transport: start using `sl:driver` for new connections
  test: add test_desc_* for driver service level
  test: service_levels: add tests for sl:driver creation and removal
  test: add reload_raft_topology_state() to ScyllaRESTAPIClient
  service_level_controller: automatically create `sl:driver`
  service_level_controller: methods to create driver service level
  service_level_controller: handle special sl:driver in DESC output
  topology_coordinator: add service_level_controller reference
  system_keyspace: add service_level_driver_created
  test: add MAX_USER_SERVICE_LEVELS
2025-10-09 17:28:39 +03:00
Dawid Mędrek
ecc955fbe0 index/secondary_index_manager: Take std::span instead of std::vector 2025-10-09 16:17:07 +02:00
Dawid Mędrek
074f0f2e4c index/secondary_index_manager: Add missing const qualifier 2025-10-09 16:06:50 +02:00
Dawid Mędrek
7baf95bc4b index/vector_index: Add missing const qualifiers 2025-10-09 16:06:24 +02:00
Dawid Mędrek
4486ac0891 cql3/statements/index_prop_defs.cc: Remove unused include 2025-10-09 16:01:56 +02:00
Dawid Mędrek
d50c2f7c74 cql3/statements/index_prop_defs.cc: Mark function as TU-local 2025-10-09 16:00:44 +02:00
Dawid Mędrek
89b3d0c582 cql3/statements/index_prop_defs: Mark methods as const-qualified 2025-10-09 15:53:29 +02:00
Avi Kivity
bb02295695 setup: add the lazytime XFS mount option
In f828fe0d59 ("setup: add the lazytime XFS version") we added the
lazytime mount option to /var/lib/scylla, but it was quickly reverted
(8f5e80e61a) as it caused a regression on CentOS 7.

We reinstate it now with a kernel version check. This will avoid
the lazytime mount option on CentOS 7, which is unsupported anyway.

The lazytime option avoids marking the inode as dirty if it's only for the
purpose of updating mtime/ctime. This won't help much while writing sstables
(since the write also updates extent information), but may help a little
with with commitlog writes, since those are pure overwrites.

It likely won't help with the RWF_NOWAIT violations seen in [1], since
those are likely due to in-memory locking, not flushing dirty inodes
to disk.

Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run.
The lazytime option was added the the .mount file and showed up in
the live mount.

[1] https://github.com/scylladb/seastar/issues/2974

Closes scylladb/scylladb#26436
Fixes #26002
2025-10-09 15:55:58 +03:00
Ernest Zaslavsky
c2bab430d7 s3_client: fix when condition to prevent infinite locking
Refine condition variable predicate in filling fiber to avoid
indefinite waiting when `close` is invoked.

Closes scylladb/scylladb#26449
2025-10-09 15:55:37 +03:00
Piotr Wieczorek
b54ad9e22f storage: Add cdc options to cas_request::apply 2025-10-09 12:28:10 +02:00
Piotr Wieczorek
2c1e699864 cdc, storage: Add a struct to pass per-mutation options to CDC
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
2025-10-09 12:28:10 +02:00
Piotr Wieczorek
66935bedac cdc: Move operations enum to the top of the namespace 2025-10-09 12:28:10 +02:00
Michał Chojnowski
c35b82b860 test/cluster/test_bti_index.py: avoid a race with CQL tracing
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).

Let's just dodge the issue by using --smp=1.

Fixes scylladb/scylladb#26432

Closes scylladb/scylladb#26434
2025-10-09 13:22:06 +03:00
Michał Chojnowski
87e3027c81 docs: fix a parameter name in API calls in sstable-dictionary-compression.rst
The correct argument name is `cf`, not `table`.

Fixes scylladb/scylladb#25275

Closes scylladb/scylladb#26447
2025-10-09 13:18:47 +03:00
Robert Bindar
2c74a6981b Make scylla_io_setup detect request size for best write IOPS
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.

This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.

This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#25315
2025-10-08 14:30:52 +03:00
Piotr Dulikowski
fe7ffc5e5d Merge 'service/qos: set long timeout for auth queries on SL cache update' from Michael Litvak
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

Closes scylladb/scylladb#26180

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-08 12:37:01 +02:00
Michał Jadwiszczak
84e4e34d81 db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).

Fixes scylladb/scylladb#26417
2025-10-08 11:16:07 +02:00
Michał Jadwiszczak
575dce765e db/view/view_building_worker: futurize and rename start_background_fibers()
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.
2025-10-08 10:19:41 +02:00
Andrzej Jackowski
0072b75541 docs: workload-prioritization: add driver service level
Refs: scylladb/scylladb#24411
2025-10-08 08:25:38 +02:00
Andrzej Jackowski
f720ce0492 test: add test to verify use of sl:driver
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.

Refs: scylladb/scylladb#24411
2025-10-08 08:25:33 +02:00
Andrzej Jackowski
f99b8c4a55 transport: use sl:driver to handle driver's control connections
Before `sl:driver` was introduced, service levels were assigned as
follows:
 1. New connections were processed in `main`.
 2. After user authentication was completed, the connection's SL was
    changed to the user's SL (or `sl:default` if the user had no SL).

This commit introduces `service_level_state` to `client_state` and
implements the following logic in `transport/server`:

 1. If `sl:driver` is not present in the system (for example, it was
    removed), service levels behave as described above.
 2. If `sl:driver` is present, the flow is:
   I.   New connections use `sl:driver`.
   II.  After user authentication is completed, the connection's SL is
        changed to the user's SL (or `sl:default`).
   III. If a REGISTER (to events) request is handled, the client is
        processing the control connection. We mark the client_state
        to permanently use `sl:driver`.

The aforementioned state `2.III` is represented by
`_control_connection` flag in `client_state`.

Fixes: scylladb/scylladb#24411
2025-10-08 08:25:28 +02:00
Andrzej Jackowski
fd36bc418a transport: whitespace only change in update_scheduling_group
The indentation is changed because it will be required in the next
commit of this patch series.
2025-10-08 08:25:22 +02:00
Andrzej Jackowski
278019c328 transport: call update_scheduling_group for non-auth connections
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.

Fixes: scylladb/scylladb#26040
2025-10-08 08:25:17 +02:00
Andrzej Jackowski
14081d0727 generic_server: transport: start using sl:driver for new connections
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.

We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.

Refs: scylladb/scylladb#24411
2025-10-08 08:25:12 +02:00
Andrzej Jackowski
b62135f767 test: add test_desc_* for driver service level
Driver service level is a special service level that is created
automatically by the system. Therefore, it requires special handling
in DESC SCHEMA WITH INTERNALS and those test verifies the special
behavior.

Refs: scylladb/scylladb#24411
2025-10-08 08:25:07 +02:00
Andrzej Jackowski
0ddf46c7b4 test: service_levels: add tests for sl:driver creation and removal
Refs: scylladb/scylladb#24411
2025-10-08 08:25:02 +02:00
Andrzej Jackowski
9e9bca9bdb test: add reload_raft_topology_state() to ScyllaRESTAPIClient
To encapsulate `/storage_service/raft_topology/reload` API call
2025-10-08 08:24:57 +02:00
Andrzej Jackowski
c59a7db1c9 service_level_controller: automatically create sl:driver
This commit:
  - Increases the number of allowed scheduling groups to allow the
    creation of `sl:driver`.
  - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
    `sl:driver` until all nodes have increased the number of
    scheduling groups.
  - Starts using `get_create_driver_service_level_mutations`
    to unconditionally create `sl:driver` on
    `raft_initialize_discovery_leader`. The purpose of this code
    path is ensuring existence of `sl:driver` in new system and tests.
  - Starts using `migrate_to_driver_service_level` to create `sl:driver`
    if it is not already present. The creation of `sl:driver` is
    managed by `topology_coordinator`, similar to other system keyspace
    updates, such as the `view_builder` migration. The purpose of this
    code path is handling upgrades.
  - Modifies related tests to pass after `sl:driver` is added.

Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:43 +02:00
Andrzej Jackowski
923559f46a service_level_controller: methods to create driver service level
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:38 +02:00
Andrzej Jackowski
2d296a2f9b service_level_controller: handle special sl:driver in DESC output
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
  1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
  2. If `sl:driver` exists, its configuration is fully restored (emit
     ALTER SERVICE LEVEL).
  3. If `sl:driver` was removed, the information is retained (emit
     DROP SERVICE LEVEL instead of CREATE/ALTER).

Refs: scylladb/scylladb#24411
2025-10-08 08:24:33 +02:00
Andrzej Jackowski
1ff605005e topology_coordinator: add service_level_controller reference
This adds a reference to sl_controller so that, later in this patch
series, topology_coordinator can manage creating `sl:driver` once
group0 is fully operational.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:28 +02:00
Andrzej Jackowski
8953f96609 system_keyspace: add service_level_driver_created
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.

A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:23 +02:00
Andrzej Jackowski
7d2db37831 test: add MAX_USER_SERVICE_LEVELS
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
2025-10-08 08:24:17 +02:00
Patryk Jędrzejczak
d391b9d7d9 test: assert that majority is lost in some tests of the recovery procedure
The voter handler caused `test_raft_recovery_user_data` to stop losing
group 0 majority when expected. We make sure this won't happen again
in this commit.

We don't change `test_raft_recovery_entry_lose` because it has some
checks that would fail with group 0 majority (schema versions would
match).

Note that it's possible to timeout the read barrier quickly without the
`timeout` parameter. See e.g. `test_cannot_add_new_node` in
`test_raft_no_quorum.py`. We don't take this approach here because we
don't want to change the default Raft parameters in the recovery
procedure tests.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
d623844c1c test: rest_client: add timeout support for read_barrier
Scylla already handles the `timeout` parameter, so the change is simple.

We use the `timeout` parameter in the following commit.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
91c8466f47 test: test_raft_recovery_user_data: lose majority when killing one dc
After introducing the voter handler, the test stopped losing group 0
majority when expected because the killed dc contained 2 out of 5
voters. We fix it in this commit. The fix relies on the voter handler
not doing unnecessary work. The first dc should keep its voters and
majority.

The test was functional even though majority wasn't lost when expected.
Stopping the recovery leader before restarting it with `recovery_leader`
caused majority loss in the old group 0. Hence, there is no need to
backport this commit.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
c8a5e7a74e test: test_raft_recovery_user_data: shutdown driver sessions
Shutting down `ccluster_all_nodes` in the previous commit is necessary
to avoid flakiness. It turns out that leaked driver sessions can impact
another run of the test case (with different parameterization). Here,
without shutting down `ccluster_all_nodes`, we could observe the DDL
requests from `start_writes` fail in the second test case run
(where `remove_dead_nodes_with == "replace"`) like this:
```
>       await cql.run_async(f"USE {ks_name}")
E       cassandra.cluster.NoHostAvailable: ('Unable to complete the
            operation against any hosts', {<Host: 127.46.35.70:9042 dc1>:
            ConnectionException('Host has been marked down or removed'),
            <Host: 127.46.35.71:9042 dc1>: ConnectionException('Host has
            been marked down or removed'), <Host: 127.46.35.3:9042 dc1>:
            ConnectionException('Host has been marked down or removed'),
            <Host: 127.46.35.25:9042>: ConnectionException('Host has
            been marked down or removed')})
```
We could also see errors like this on the driver:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query]
    message="Keyspace 'test_1759763911381_oktks' does not exist"
```

It turned out that `test_1759763911381_oktks` was created in the first
test case run (where `remove_dead_nodes_with == "remove"), and somehow
the driver session created in the second test case run was still using
this keyspace in some way. The DDL requests were failing on the Scylla
side with the error above, and after some retries, the driver marked
nodes as down. I didn't try to investigate what exactly the driver was
doing.

In this commit, we shut down other driver sessions used in this test.
They didn't cause problems so far, but we'd better use the Python driver
correctly and be safe.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
a35740cbe8 test: test_raft_recovery_user_data: use a separate driver connection for the write workload
It's simpler than pausing the workload for the `cql` reconnection.

Moreover, the removed `start_writes` call required group 0 majority for
(redundant) CREATE KEYSPACE IF NOT EXISTS and CREATE TABLE IF NOT EXISTS
statements. The test shouldn't have group 0 majority at that point,
which is fixed in one of the following commits.

Using a separate driver connection also allows us to call
`finish_writes()` a bit later, after the `cql` reconnection.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
d1a944251e test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node
We have the global request queue now, so we can't hit "Another global
topology request is ongoing, please retry." anymore.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
9a98febac5 test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s
It looks like decreasing `failure_detector_timeout_in_ms` doesn't make
the shutdown faster anymore.

We had some changes related to requests during shutdown like #24499
and #24714. They are probably the reason.
2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
2b1d7f0e83 test: test_raft_recovery_user_data: speed up replace operations 2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak
dbd998bc15 test: stop/start servers concurrently in the recovery procedure tests
This change makes these tests a bit faster.
2025-10-07 17:48:51 +02:00
Dawid Mędrek
a9577e4d52 replica/database: Fix description of validate_tablet_views_indexes
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.

Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.

Fixes scylladb/scylladb#26420

Closes scylladb/scylladb#26421
2025-10-07 17:39:43 +02:00
Artsiom Mishuta
99455833bd test.py: reintroducing sudo in resource_gather.py
conditionally reintroducing sudo for resource gathering
when running under docker

related: https://github.com/scylladb/scylladb/pull/26294#issuecomment-3346968097

fixes: https://github.com/scylladb/scylladb/issues/26312

Closes scylladb/scylladb#26401
2025-10-07 14:42:15 +02:00
Piotr Dulikowski
264cf12b66 Merge 'view building coordinator - add missing tests' from Michał Jadwiszczak
This patch adds tests for:
- tablet migration during view building
- tablet merge during view building.

Those tests were missing from the original testing plan.

We want to backport it to 2025.4 to ensure the release is bug-free.

Closes scylladb/scylladb#26414

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add test for tablet merge
  test/cluster/test_view_building_coordinator: add test for tablet migration
2025-10-07 14:25:04 +02:00
Botond Dénes
8b0bfb817e Merge 'Switch REST API server to use content-streaming' from Pavel Emelyanov
Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently.

This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream.

Using newer seastar API, no need to backport

Closes scylladb/scylladb#26418

* github.com:scylladb/scylladb:
  api: Switch to request content streaming
  api: Fix indentation after previous patch
  api: Coroutinize set_relabel_config handler
  api: Coroutinize set_error_injection handler
2025-10-07 14:13:47 +03:00
Michał Chojnowski
3cf51cb9e8 sstables: fix some typos in comments
I added those typos recently, and spellcheckers complain.

Closes scylladb/scylladb#26376
2025-10-07 13:20:06 +03:00
Botond Dénes
8beea931be Merge 'Remove system_keyspace from column_family API' from Pavel Emelyanov
This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family.

API dependencies cleanup work, no need to backport

Closes scylladb/scylladb#26381

* github.com:scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Coroutinize get_built_indexes handler code
  api: Remove system_keyspace ref from column_family API block
  api: Move get_built_indexes from column_family to view_builder
2025-10-07 13:07:46 +03:00
Pavel Emelyanov
ed1c049c3b scripts: Add usage to pull_github_pr script
If mis-used, the script says

   error: unrecognized option: ..., see ./scripts/pull_github_pr.sh -h for usage

but if using the suggested -h option it prints just the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26378
2025-10-07 10:28:35 +03:00
Lakshmi Narayanan Sreethar
b8042b66e3 cmake: replace -fvisibility=hidden compiler flag with -fvisibility-inlines-hidden
The PR #26154 dropped the `-fvisibility=hidden` compiler flag and
replaced it with `-fvisibility-inlines-hidden` as the former caused
issues in how the `noncopyable_function::operator bool` method executed
leading to incorrect return values. Apply the same fix to cmake.

Fixes #26391

Closes scylladb/scylladb#26431
2025-10-07 10:10:47 +03:00
Pavel Emelyanov
127afd4da1 api: Switch to request content streaming
There are three handler that need to be patched all at once with the
server itself being marked with set_content_streaming

For two simple handler just get the content string with
read_entire_stream_contiguous helper. This is what httpd server did
anyway.

The "start_restore" handler used the contiguous contents to parse json
from using rjson utility. This handler is patched to use
read_entire_stream() that returns a vector of temporary buffers. The
rjson parser has a helper to pars from that vector, so the change is
also optimization.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-06 16:43:26 +03:00
Pavel Emelyanov
2cfccdac5c api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-06 16:43:12 +03:00
Pavel Emelyanov
5668058cb0 api: Coroutinize set_relabel_config handler
Without the invoke_on_all lambda, for simplicity
Also keep indentation "broken" for the ease of review

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-06 16:42:30 +03:00
Pavel Emelyanov
5017a25c00 api: Coroutinize set_error_injection handler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-06 16:42:14 +03:00
Patryk Jędrzejczak
67d48a459f raft topology: make the voter handler consider only group 0 members
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.

The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
  voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
  example, all 3 nodes that first joined the new group 0 are in the same
  dc and rack, but only one of them becomes a voter because the voter
  handler tries to make non-members in other dcs/racks voters.

Fixes #26321

Closes scylladb/scylladb#26327
2025-10-06 16:27:47 +03:00
Pavel Emelyanov
8002ddf946 code: Use tls_options::bye_timeout instead of deprecated switch
Some code wants its TLS sockets to close immediately without sending BYE
message and waiting for the response. Recent seastar update changed the
way this functionality is requested (scylladb/seastar#2986)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26253
2025-10-06 16:25:35 +03:00
Michał Jadwiszczak
279a8cbba3 test/cluster/test_view_building_coordinator: add test for tablet merge
The test pauses processing of the view building task and triggers tablet
merge.
2025-10-06 15:06:11 +02:00
Michał Jadwiszczak
fc7e5370a1 test/cluster/test_view_building_coordinator: add test for tablet migration
The test pauses processing of the view building task and migrates it to
another node.
2025-10-06 15:02:42 +02:00
Michał Chojnowski
3b338e36c2 utils/config_file: fix a missing allowed_values propagation in one of named_value constructors
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.

(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).

Fix that.

Fixes scyllladb/scylladb#26371

Closes scylladb/scylladb#26196
2025-10-06 15:33:11 +03:00
Michał Chojnowski
dbddba0794 sstables/trie: actually apply BYPASS CACHE to index reads
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.

But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.

Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.

Fixes scylladb/scylladb#26372

Closes scylladb/scylladb#26373
2025-10-06 15:32:05 +03:00
Andrzej Jackowski
c3dd383e9e test: add reproduction of name reuse bug to service level tests
This commit adds a reproduction test for scylladb/scylladb#26190 to the
service levels test suite. Although the bug was fixed internally in
Seastar, the corner-case service level name reuse scenario should be
covered by tests to prevent regressions.

Refs: https://github.com/scylladb/scylladb/issues/26190

Closes scylladb/scylladb#26379
2025-10-06 14:19:22 +02:00
Piotr Dulikowski
380f243986 Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }

Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.

Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.

Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.

New feature, no backport required.

Co-authored with @bhalevy

Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525

Closes scylladb/scylladb#26358

* github.com:scylladb/scylladb:
  tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
  locator: Make hasher for endpoint_dc_rack globally accessible
  test: tablets: Add test for replica allocation on rack list changes
  test: lib: topology_builder: generate unique rack names
  test: Add tests for rack list RF
  doc: Document rack-list replication factor
  topology_coordinator: Restore formatting
  topology_coordinator: Cancel keyspace alter on broader set of errors
  topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
  cql3: ks_prop_defs: Preserve old options
  cql3: ks_prop_defs: Introduce flattened()
  locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
  tablet_allocator: Respect binding replicas to racks
  locator: network_topology_strategy: Respect rack list when reallocating tablets
  cql3: ks_prop_defs: Fail with more information when options are not in expected format
  locator, cql3: Support rack lists in replication options
  cql3: Fail early on vnode/tablet flavor alter
  cql3: Extract convert_property_map() out of Cql.g
  schema: Use definition from the header instead of open-coding it
  locator: Abstract obtaining the number of replicas from replication_strategy_config_option
  cql3, locator: Use type aliases for option maps
  locator: Add debug logging
  locator: Pass topology to replication strategy constructor
  abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
2025-10-06 14:14:09 +02:00
Piotr Dulikowski
e7907b173a Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.

After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.

In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.

After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.

High-level implementation strategy:

1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
   and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
   and add a new one.
5. Update the user documentation.

Fixes scylladb/scylladb#23030

Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.

Closes scylladb/scylladb#25802

* github.com:scylladb/scylladb:
  view: Stop requiring experimental feature
  db/view: Verify valid configuration for tablet-based views
  db/view: Require rf_rack_valid_keyspaces when creating view
  test/cluster/random_failures: Skip creating secondary indexes
  test/cluster/mv: Mark test_mv_rf_change as skipped
  test/cluster: Adjust MV tests to RF-rack-validity
  test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
  db/view: Name requirement for views with tablets
2025-10-06 12:46:46 +02:00
Pavel Emelyanov
6ad8dc4a44 Merge 'root,replica: mv querier to replica/' from Botond Dénes
The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people.
The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear.

Code cleanup, no backport.

Closes scylladb/scylladb#26280

* github.com:scylladb/scylladb:
  replica: move querier code to replica namespace
  root,replica: mv querier to replica/
2025-10-06 08:26:05 +03:00
Pavel Emelyanov
5cf9043d74 Merge 'sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db' from Michał Chojnowski
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).

But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.

The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.

Also, extend a related test so that it would catch the problem before the fix.

Fixes scylladb/scylladb#26393

Bugfix, needs backport to 2025.4.

Closes scylladb/scylladb#26394

* github.com:scylladb/scylladb:
  sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
  test/boost/database_test: fix two no-op distributed loader tests
2025-10-06 08:23:03 +03:00
Andrzej Jackowski
3411089f5d treewide: seastar module update
The reason for this seastar update is fixing #26190 - a service
level bug caused by a problem in scheduling group in seastar
implementation (seastar#2992).

* ./seastar 9c07020a...270476e7 (10):
  > core: restore seastar_logger namespace in try_systemwide_memory_barrier
  > Merge 'coroutines: support coroutines that copy their captures into the coroutine frame' from Avi Kivity
    coroutines: advertise lambda-capture-by-value and test it
    future: invoke continuation functions as temporaries
    future: handle lvalue references in future continuations early
  > resource: Tune up some allocate_io_queues() arguments
  > Merge 'Add perf test hooks' from Travis Downs
    perf_tests:add tests to verify pre-run hooks
    per_tests: add pre-run hooks
    perf-tests.md: update on measurement overhead
    perf_tests_perf: a few more test variations
    remove vestigial register_test method
  > Add `touch` command to `rl` file processing
  > Merge 'execution_stage: update stage name on scheduling_group rename' from Andrzej Jackowski
    test: add sg_rename_recreate_with_the_same_name
    test: add test_renaming_execution_stage in metric_test
    test: add test_execution_stage_rename
    execution_stage: update stage name on scheduling_group rename
    execution_stage: reorganize per_group_stage_type
    execution_stage: add concrete_execution_stage_base
    execution_stage: move metrics setup to a separate method
  > iotune: Fix warmup calculation bug and botched rebase
  > Add missing `#pragma once` to ascii.rl
  > iotune: Ignore measurements during warmup period

Fixes: https://github.com/scylladb/scylladb/issues/26190

Closes scylladb/scylladb#26388
2025-10-06 08:13:37 +03:00
Michał Chojnowski
6efb807c1a sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).

But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.

The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.

Fixes scylladb/scylladb#26393
2025-10-04 00:45:55 +02:00
Michał Chojnowski
16cb223d7f test/boost/database_test: fix two no-op distributed loader tests
There are two tests which effectively check nothing.

They intend to check that distributed loader removes "leftover" sstable
files. So they create some incomplete sstables, run the test env
on the directory, and the files disappeared.
But the test env completely clears the test directory before
the distributed loader looks at the files, so the tests succeed trivially.

Fix that by adding a config knob to the test env which instructs it
not to clear the directory before the test.
2025-10-04 00:44:49 +02:00
Michał Hudobski
3db2e67478 docs: adjust docs for VS auth changes
We adjust the documentation to include the new
VECTOR_SEARCH_INDEXING permission and its usage
and also to reflect the changes in the maximal
amount of service levels.
2025-10-03 16:55:57 +02:00
Michał Hudobski
e8fb745965 test: add tests for VECTOR_SEARCH_INDEXING permission
This commit adds tests to verify the expected
behavior of the VECTOR_SEARCH_INDEXING permission,
that is, allowing GRANTing this permission only on
ALL KEYSPACES and allowing SELECT queries only on tables
with vector indexes when the user has this permission
2025-10-03 16:55:57 +02:00
Michał Hudobski
6a69bd770a cql: allow VECTOR_SEARCH_INDEXING users to select
This patch allows users with the VECTOR_SEARCH_INDEXING permission
to perform SELECT queries on tables that have a vector index.
This is needed for the Vector Store service, which
reads the vector-indexed tables, but does not require
the full SELECT permission.
2025-10-03 16:55:57 +02:00
Michał Hudobski
3025a35aa6 auth: add possibilty to check for any permission in set
This commit adds a new version of command_desc struct
that contains a set of permissions instead of a singular
permission. When this struct is passed to ensure/check_has_permission,
we check if the user has any of the included permission on the resource.
2025-10-03 16:55:57 +02:00
Michał Hudobski
ae86bfadac auth: add a new permission VECTOR_SEARCH_INDEXING
This patch adds a new permission: VECTOR_SEARCH_INDEXING,
that is grantable only for ALL KEYSPACES. It will allow selecting
from tables with vector search indexes. It is meant to be used
by the Vector Store service to allow it to build indexes without
having full SELECT permissions on the tables.
2025-10-03 16:36:54 +02:00
Ferenc Szili
20aeed1607 load balancing: extend locator::load_stats to collect tablet sizes
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.

This is the first change in the size based load balancing series.

Closes scylladb/scylladb#26035
2025-10-03 13:37:22 +02:00
Pavel Emelyanov
37f59cef04 Merge 'tools: fix documentation links after change to source-available' from Botond Dénes
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

Closes scylladb/scylladb#26322

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-03 13:53:19 +03:00
Pavel Emelyanov
7116e7dac6 api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:51:25 +03:00
Pavel Emelyanov
42657105a3 api: Coroutinize get_built_indexes handler code
"While at it". It looks much simpler this way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:50:58 +03:00
Pavel Emelyanov
f77f9db96c api: Remove system_keyspace ref from column_family API block
This reference was only needed to facilitate get_built_indexes handler
to work. Now it's gone and the sys.ks. reference is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:50:22 +03:00
Pavel Emelyanov
95b616d0e5 api: Move get_built_indexes from column_family to view_builder
The handler effectively works with the view_builder and should be
registerd in the block that has this service captured.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:49:33 +03:00
Tomasz Grabiec
9ebdeb261f tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
The old logic assumes that replicas are spread across whole DC when
determining how many tablets we need to have at least 10 tablets per
shard. If replicas are actually confined to a subset of racks, that
will come up with a too high count and overshoot actual per-shard
count in this rack.

Similar problem happens for scaling-down of tablet count, when we try
to keep per-shard tablet count below the goal. It should be tracked
per-rack rather than per-DC, since racks can differ in how loaded they
are by RF if it's a rack-list.
2025-10-02 19:45:00 +02:00
Tomasz Grabiec
6962464be7 locator: Make hasher for endpoint_dc_rack globally accessible 2025-10-02 19:45:00 +02:00
Tomasz Grabiec
85ddb832b4 test: tablets: Add test for replica allocation on rack list changes 2025-10-02 19:45:00 +02:00
Benny Halevy
4955ca3ddd test: lib: topology_builder: generate unique rack names
Encode the dc identifier into each rack name so each dc will have its
own unique racks.

Just for easier distinction in logs.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-10-02 19:45:00 +02:00
Tomasz Grabiec
5fc617ecf5 test: Add tests for rack list RF 2025-10-02 19:45:00 +02:00
Tomasz Grabiec
1d34614421 doc: Document rack-list replication factor
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-10-02 19:45:00 +02:00
Tomasz Grabiec
655f4ffa3c topology_coordinator: Restore formatting 2025-10-02 19:45:00 +02:00
Tomasz Grabiec
a21bbd4773 topology_coordinator: Cancel keyspace alter on broader set of errors
We now include keyspace metadata construction, which can throw if
validation fails. We want to fail the ALTER rather than keep retrying.
2025-10-02 19:44:59 +02:00
Tomasz Grabiec
d02f93e77e topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
There are several problems with how ALTER execution works with tablets.

1) Currently, option processing bypasses
   ks_prop_defs::prepare_options(), and will pass them directly to
   keyspace_metadata. This deviates from the vnode path, causing
   discrepancy in logic. But also there will be some non-trivial options
   post-processing added there - numeric RF will be replaced with a
   rack list. We should preserve it in the tablet path which alters
   the keyspace, otherwise it will fail when trying to construct
   network_topology_strategy.

2) Option merging happens on the flat version of the map, which won't
   work correctly with extended map which contains lists. We want
   the new list to replace the old list or numeric RF, not get its items
   merged. For example:

    We want:

      {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}

    If we merge flattened options, we would get incorrect flattened options:

      {'dc1': 3,
       'dc1:0', 'rack1'
       'dc1:1', 'rack2'}

3) We lose atomicity of update. Validation and merging which happens on the CQL
   coordinator is done in a different group0 transaction context than mutation
   generation inside topology coordinator later.

   Fixes https://github.com/scylladb/scylladb/issues/25269
2025-10-02 19:44:29 +02:00
Tomasz Grabiec
849ab5545f cql3: ks_prop_defs: Preserve old options
In 2d9b8f2, semantics of ALTER was changed for tablet-based keyspaces
which makes "replication" assignment act like +=, where replication
options are merged with the old options.

This merging is currently performed in the CQL statement level on
options map, before passing to topology coordinator. This will change
in later commit, so move merging here. Merging options of flattened
level will not be correct because it doesn't recognize nested
collections, like rack lists.

We want:

  {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}

If we merge flattened options, we would get incorrect flattened options:

  {'dc1': 3,
   'dc1:0', 'rack1'
   'dc1:1', 'rack2'}

Which cannot be parsed back into ks_prop_defs on the topology coordinator.

Refs https://github.com/scylladb/scylladb/pull/20208#issuecomment-3174728061
Refs #25549
2025-10-02 19:42:39 +02:00
Tomasz Grabiec
0d0c06da06 cql3: ks_prop_defs: Introduce flattened() 2025-10-02 19:42:39 +02:00
Tomasz Grabiec
6b7b0cb628 locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() 2025-10-02 19:42:39 +02:00
Tomasz Grabiec
e5b7452af2 tablet_allocator: Respect binding replicas to racks 2025-10-02 19:42:39 +02:00
Tomasz Grabiec
6de342ed3e locator: network_topology_strategy: Respect rack list when reallocating tablets 2025-10-02 19:42:39 +02:00
Tomasz Grabiec
8e9a58b89f cql3: ks_prop_defs: Fail with more information when options are not in expected format
Before, we would throw vague sstring_out_of_range from substr() when
the name doesn't have a nested key separate with ":", e.g "durable_writes"
instead of "durable_writes:durable_writes".
2025-10-02 19:42:39 +02:00
Tomasz Grabiec
66755db062 locator, cql3: Support rack lists in replication options
Allows per-DC replication factor to be either a string, holding a
numerical value, or a list of strings, holding a list of rack names.

The rack list is not respected yet by the tablet allocator, this is
achieved in subsequent commit.

This changes the format of options stored in the flattened map
in system_schema.keyspaces#replication. Values which are rack lists,
are converted into multiple entries, with the list index appended to
the key with ':' as the separator:

For example, this extended map:

   {
      'dc1': '3',
      'dc2': ['rack1', 'rack2']
   }

is stored as a flattened map:

  {
    'dc1': '3',
    'dc2:0': 'rack1',
    'dc2:1': 'rack2'
  }

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-10-02 19:42:39 +02:00
Botond Dénes
9310d3eff3 scylla-gdb.py: small-objects: don't cast free_object to void*
When walking the free-list of a pool or a span, the small-object code
casts the dereferenced `free_object*` to `void*`. This is unnecessary,
just use the `next` field of the `free_object` to look up the next free
object. I think this monkey business with `void*` was done to speed up
walking the free-list, but recently we've seen small-object --summarize
fail in CI, and it could be related.

Fixes: #25733

Closes scylladb/scylladb#26339
2025-10-02 13:36:49 +03:00
Botond Dénes
9d08a380db Merge 'Fix getendpoints command for compound keys containing ':'' from Taras Veretilnyk
Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key.

This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys.

The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility.

Backport is not required, since this change relies on recent Seastar updates

Fixes #16596

Closes scylladb/scylladb#26169

* github.com:scylladb/scylladb:
  docs: document --key-components option for getendpoints
  test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
  nodetool: Introduce new option --key-components to specify compound partition keys as array
  rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
  api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
  rest_api_mock: support duplicate query parameters
  test/rest_api: support multiple query values per key in RestApiSession.send()
  nodetool: add support of new seastar query_parameters_type to scylla_rest_client
2025-10-02 09:04:40 +03:00
Aleksandra Martyniuk
0e73ce202e test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.

Fixes: #26328

Closes scylladb/scylladb#26349
2025-10-02 06:41:36 +03:00
Avi Kivity
7230a04799 dht, sstables: replace vector with chunked_vector when computing sstable shards
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).

Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.

Fixes #24198.

Closes scylladb/scylladb#26353
2025-10-02 00:47:42 +02:00
Michał Jadwiszczak
d92628e3bd test/cluster/test_view_building_coordinator: skip reproducer instead of xfail
The reproducer for issue scylladb/scylladb#26244 takes some time
and since the test is failing, there is no point in wasting resources on
it.
We can change the xfail mark to skip.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26350
2025-10-01 18:33:05 +02:00
Tomasz Grabiec
c5731221c0 cql3: Fail early on vnode/tablet flavor alter
Some tests expect this error. Later, prepare_options() will be changed
in a way which would fail to accept new options in such case before
vnode/tablet flavor change is detected, tripping the tests.
2025-10-01 16:06:52 +02:00
Tomasz Grabiec
11b4a1ab58 cql3: Extract convert_property_map() out of Cql.g
So that complex code is in a .cc file for better IDE assistance.
2025-10-01 16:06:52 +02:00
Tomasz Grabiec
b6df186e54 schema: Use definition from the header instead of open-coding it 2025-10-01 16:06:52 +02:00
Tomasz Grabiec
726548b835 locator: Abstract obtaining the number of replicas from replication_strategy_config_option
It will become more complex when options will contain rack lists.

It's a good change regardless, as it reduces duplication and makes
parsing uniform. We already diverged to use stoi / stol / stoul.

The change in create_keyspace_statement.cc to add a catch clause is
needed because get_replication_factor() now throws
configuration_exception on parsing errors instead of
std::invalid_argument, so the existing catch clause in the outer scope
is not effective. That loop is trying to interpret all options as RF
to run some validations. Not all options are RF, and those are
supposed to be ignored.
2025-10-01 16:06:52 +02:00
Tomasz Grabiec
91e51a5dd1 cql3, locator: Use type aliases for option maps
In preparation for changing their structure.

1) std::map<sstring, sstring> -> replication_strategy_config_options

  Parsed options. Values will become std::variant<sstring, rack_list>

2) std::map<sstring, sstring> -> property_definitions::map_type

  Flattened map of options, as stored system tables.
2025-10-01 16:06:51 +02:00
Tomasz Grabiec
3c31e148c5 locator: Add debug logging 2025-10-01 16:06:28 +02:00
Benny Halevy
da6e2fdb1b locator: Pass topology to replication strategy constructor 2025-10-01 16:06:28 +02:00
Benny Halevy
3965e29075 abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
Prepare for supporting also list of rack names.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-10-01 16:06:27 +02:00
Taras Veretilnyk
6d8224b726 docs: document --key-components option for getendpoints 2025-10-01 15:53:25 +02:00
Taras Veretilnyk
6381c63d65 test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
Adds a parameterized test to verify that multiple --key-components arguments
are handled correctly by nodetool's getendpoints command. Ensures the
constructed REST request includes all key_component values in the expected format.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
78888dd76c nodetool: Introduce new option --key-components to specify compound partition keys as array
Allows getendpoints to accept components of partition key using the --key-components option.
Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
2456ebd7c2 rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint,
verifying that it correctly resolves natural endpoints for a composite partition key
passed as multiple `key_component` query parameters.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
89d474ba59 api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys,
which causes ambiguity when key components contained colons.

This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components
via repeated `key_component` query parameters to avoid this issue.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
65ade28a9c rest_api_mock: support duplicate query parameters
Previously, only the last value of a repeated query parameter was captured,
which could cause inaccurate request matching in tests. This update ensures
that all values are preserved by storing duplicates as lists in the `params` dict.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
b60afeaa46 test/rest_api: support multiple query values per key in RestApiSession.send()
Previously, the send() method in RestApiSession only supported one value per query parameter key.
This patch updates it to support passing lists of values, allowing the same key to appear multiple
times in the query string (e.g. ?key=value1&key=value2).
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
53883958d6 nodetool: add support of new seastar query_parameters_type to scylla_rest_client 2025-10-01 15:52:18 +02:00
Avi Kivity
15fa1c1c7e Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.

In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.

perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch        9233   109.930us     0.000ns   109.930us   109.930us    4356.000       0.000   2615394.3    614709.6

After:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch       50952    19.487us     0.000ns    19.487us    19.487us     198.000       0.000    603120.1    109042.9
```

Enhancement, backport not required.

Closes scylladb/scylladb#26302

* github.com:scylladb/scylladb:
  sstables/trie: BTI-translate the entire partition key at once
  sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
  sstables/trie: perform the BTI-encoding of position_in_partition eagerly
  types/comparable_bytes: add comparable_bytes_from_compound
  test/perf: add perf_bti_key_translation
2025-10-01 14:59:06 +03:00
Dawid Mędrek
b409e85c20 view: Stop requiring experimental feature
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.

We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.

Fixes scylladb/scylladb#23030
2025-10-01 09:01:53 +02:00
Dawid Mędrek
288be6c82d db/view: Verify valid configuration for tablet-based views
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:

* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.

Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.

We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
2025-10-01 09:01:53 +02:00
Dawid Mędrek
00222070cd db/view: Require rf_rack_valid_keyspaces when creating view
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.

We add a validation test to verify the changes.

Refs scylladb/scylladb#23030
2025-10-01 09:01:50 +02:00
Dawid Mędrek
71606ffdda test/cluster/random_failures: Skip creating secondary indexes
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.
2025-10-01 09:01:38 +02:00
Dawid Mędrek
6322b5996d test/cluster/mv: Mark test_mv_rf_change as skipped
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.
2025-10-01 09:01:29 +02:00
Botond Dénes
bdca5600ef Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy
Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet.
With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings.

This series takes a different approach than allowing freeze to yield.
`tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation.  Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation.

In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size.

Closes scylladb/scylladb#18162

* github.com:scylladb/scylladb:
  schema_tables: convert_schema_to_mutations: simplify check for system keyspace
  tablets: read_tablet_mutations: use unfreeze_and_split_gently
  storage_service: merge_topology_snapshot: freeze snp.mutations gently
  mutation: async_utils: add unfreeze_and_split_gently
  mutation: add for_each_split_mutation
  tablets: tablet_map_to_mutations: maybe split tablets mutation
  tablets: tablet_map_to_mutations: accept process_func
  perf-tablets: change default tables and tablets-per-table
  perf-tablets: abort on unhandled exception
2025-10-01 07:04:09 +03:00
Ernest Zaslavsky
043d2dfb30 treewide: seastar module update
Seastar module update
```
9c07020a Merge 'http: Introduce retry strategy machinery for http client (take two)' from Ernest Zaslavsky
58404b81 http: check for abort at start of `make_request`
35a9e086 http: support per-call `retry_strategy` in `make_request`
96538b92 http: integrate `retry_strategy` into HTTP client
77c3ba14 http: initial implementation of `retry_strategy`
b9b9e7bf memory: Call finish_allocation() at the end of allocate_aligned()
2052c200 Merge 'file: coroutinize some functions' from Avi Kivity
7b65e50c file: reindent after coroutinization
837f64b5 file: coroutinize dma_read_impl()
9220607b file: coroutinize dma_read_exactly_impl()
d1414541 file: coroutinize set_lifetime_hint_impl()
94d8fd08 file: coroutinize get_lifetime_hint_impl()
392efff4 file: coroutinize maybe_read_eof()
e68a3173 file: do_dma_read_bulk: remove "rstate" local
14ac42cd file: coroutinize do_dma_read_bulk()
5446cbab net: Use future::then_unpack() helper to unpack tuples
9e88c4d8 posix-stack: Initialize unique_ptr-s with new result directly
51fb302e rpc: connection::process() use structured binding
be2c2b54 http: Explicitly deprecate request::content
```

Closes scylladb/scylladb#26342
2025-10-01 06:44:31 +03:00
Jenkins Promoter
f8c02a420d Update pgo profiles - aarch64 2025-10-01 05:32:35 +03:00
Jenkins Promoter
b45a57f65e Update pgo profiles - x86_64 2025-10-01 04:54:14 +03:00
Dawid Mędrek
994f09530f test/cluster: Adjust MV tests to RF-rack-validity
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:

`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`

We handle it separately in the following commit.
2025-09-30 20:01:25 +02:00
Luis Freitas
884c584faf Update ScyllaDB version to: 2026.1.0-dev 2025-09-30 18:54:09 +03:00
Benny Halevy
1ceb49f6c1 schema_tables: convert_schema_to_mutations: simplify check for system keyspace
Currently, the function unfreezes each schema mutation partition
and then checks if it's for a system keyspace.
This isn't really needed since we can check the partition key
using the frozen_mutation, skip it if the partition is for a system keyspace.

Note that the constructed partition_key just copies
the frozen partition_key_view, without copying or deserializing the
actual key contents.

Also, reserve `results` capacity using the queried
partitions' size to prevent reallocations of the results
vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
b17a36c071 tablets: read_tablet_mutations: use unfreeze_and_split_gently
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
25b68fd211 storage_service: merge_topology_snapshot: freeze snp.mutations gently
We don't need to store all snp.mutations in a vector
and then freeze the whole vector.  They can be frozen
one at a time and collected into a vector, while
maybe yielding between each mutation to prevent
stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
fd38cfaf69 mutation: async_utils: add unfreeze_and_split_gently
Unfreeze the frozen_mutation, possibly splitting it
based on max_rows.  The process_mutation function
is called for each split mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
faa0ee9844 mutation: add for_each_split_mutation
Allows processing of the split mutations one at a time.
This can reduce memory footprint as the caller
won't have to store a vector of the split mutations
and then convert it (e.g. freeze the mutations
or convert them to canonical mutations).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
d21984d0cc tablets: tablet_map_to_mutations: maybe split tablets mutation
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.

Note that this will convert large mutation to long
vectors of mutations.  A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
aaddff5211 tablets: tablet_map_to_mutations: accept process_func
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.

This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.

Next patch will split large tablet mutations to prevent stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:38 +03:00
Avi Kivity
5d1846d783 dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445
2025-09-30 17:14:36 +03:00
Benny Halevy
3c07e0e877 perf-tablets: change default tables and tablets-per-table
tablets-per-table must be a power of 2, so round up 10000 to 16K.
also, reduce number of tables to have a total of about 100K
tablets, otherwise we hit the maximum commitlog mutation size
limit in save_tablet_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:07:06 +03:00
Benny Halevy
2c3fb341e9 perf-tablets: abort on unhandled exception
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:07:06 +03:00
Nadav Har'El
926089746b message: move RPC compression from utils/ to message/
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.

This patch moves 8 files out of utils/:

    utils/advanced_rpc_compressor.hh
    utils/advanced_rpc_compressor.cc
    utils/advanced_rpc_compressor_protocol.hh
    utils/stream_compressor.hh
    utils/stream_compressor.cc
    utils/dict_trainer.cc
    utils/dict_trainer.hh
    utils/shared_dict.hh

These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.

Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.

Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
   for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
   "netw::" namespace used by RPC. Also needed to change a bunch
   of callers for the new namespace. Also, had to add "utils::"
   explicitly in several places which previously assumed the
   current namespace is "utils::".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25149
2025-09-30 17:03:09 +03:00
Pavel Emelyanov
269aaee1b4 Merge 'test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic
This PR migrates limits tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.

Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.

scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)

Fixes #25097

This is a migration of existing tests to this repository. No need for backport.

Closes scylladb/scylladb#26077

* github.com:scylladb/scylladb:
  test: dtest: limits_test.py: test_max_cells log level
  test: dtest: limits_test.py: make the tests work
  test: dtest: test_limits.py: remove test that are not being migrated
  test: dtest: copy unmodified limits_test.py
2025-09-30 16:57:32 +03:00
Piotr Dulikowski
5e5a3c7ec5 view_building_worker.cc: fix spelling (commiting -> committing)
The typo is reported by GitHub action on each PR, so let's fix it to
reduce the noise for everybody.

Closes scylladb/scylladb#26329
2025-09-30 16:47:03 +03:00
Emil Maskovsky
b0de054439 docs: fix typos and spelling errors
Corrected spelling mistakes, typos, and minor wording issues to improve
the developer documentation.

No backport: There is no functional change, and the doc is mostly
relevant to master, so it doesn't need to be backported.

Closes scylladb/scylladb#26332
2025-09-30 13:16:49 +02:00
Botond Dénes
fe73c90df9 tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.
2025-09-29 17:34:37 +03:00
Botond Dénes
15a4a9936b release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.
2025-09-29 17:02:55 +03:00
Botond Dénes
5a69838d06 tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.
2025-09-29 17:02:40 +03:00
Dawid Mędrek
d6fcd18540 test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.

That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.

To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.

Refs scylladb/scylladb#23958
2025-09-29 13:07:08 +02:00
Dawid Mędrek
a1254fb6f3 db/view: Name requirement for views with tablets
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.

This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.
2025-09-29 13:07:08 +02:00
Dario Mirovic
b3347bcf84 test: dtest: limits_test.py: test_max_cells log level
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.

Refs #25097
2025-09-29 12:39:53 +02:00
Dario Mirovic
554fd5e801 test: dtest: limits_test.py: make the tests work
Remove unused imports and markers.
Remove Apache license header.

Enable the test in suite.yaml for `dev` and `debug` modes.

Refs #25097
2025-09-29 12:39:53 +02:00
Dario Mirovic
70128fd5c7 test: dtest: test_limits.py: remove test that are not being migrated
Refs #25097
2025-09-29 12:39:52 +02:00
Dario Mirovic
82e9623911 test: dtest: copy unmodified limits_test.py
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.

Disable it for `debug`, `dev`, and `release` mode.

Refs #25097
2025-09-29 12:39:52 +02:00
Botond Dénes
2b4a140610 replica: move querier code to replica namespace
The query namespace is used for symbols which span the coordinator and
replica, or that are mostly coordinator side. The querier is mainly in
this namespace due to its similar name, but this is a mistake which
confuses people. Now that the code was moved to replica/, also fix the
namespace to be namespace replica.
2025-09-29 06:44:52 +03:00
Botond Dénes
ee3d2f5b43 root,replica: mv querier to replica/
The querier object is a confusing one. Based on its name it should be in
the query/ module and it is already in the query namespace. But this is
actually a completely replica-side logic, implementing the caching of
the readers on the replica. Move it to the replica module to make this
more clear.
2025-09-29 06:33:53 +03:00
Michał Chojnowski
ff5add4287 sstables/trie: BTI-translate the entire partition key at once
Delaying the BTI encoding of partition keys is a good idea,
because most of the time they don't have to be encoded.
Usually the token alone is enough for indexing purposes.

But for the translation of the `partition_key` part itself,
there's no good reason to make it lazy,
especially after we made the translation of clustering keys
eager in a previous commit. Let's get rid of the `std::generator`
and convert all cells of the partition key in one go.
2025-09-29 04:10:40 +02:00
Michał Chojnowski
4e35220734 sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
Using `std::generator` could incurs some unnecessary allocation
or confuse the optimizer. Let's replace it with something simpler.
2025-09-29 04:10:40 +02:00
Michał Chojnowski
88c9af3c80 sstables/trie: perform the BTI-encoding of position_in_partition eagerly
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.

In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
so that it performs the translation eagerly,
all components to a single bytes_ostream.

Note: the name *lazy*_comparable_bytes_from_clustering_position
stays, because the interface is still lazy.

perf_bti_key_translation:

Before:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch        9233   109.930us     0.000ns   109.930us   109.930us    4356.000       0.000   2615394.3    614709.6

After:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch       50952    19.487us     0.000ns    19.487us    19.487us     198.000       0.000    603120.1    109042.9
2025-09-29 04:10:40 +02:00
Michał Chojnowski
7d57643361 types/comparable_bytes: add comparable_bytes_from_compound
Add a function which converts compound types (keys and key prefixes)
to BTI encoding.

It's almost the same as the existing `lazy_comparable_bytes_from_compound`
(in bti_key_translation.cc), except it eagerly serializes key components
to a bytes_ostream instead of lazily yielding them from a generator.
We will remove `lazy_comparable_bytes_from_compound` in a later commit.
2025-09-29 04:10:38 +02:00
Michał Chojnowski
3703197c4c test/perf: add perf_bti_key_translation
Add a microbenchmark for translating keys to BTI encoding.
2025-09-29 04:08:00 +02:00
Michael Litvak
ad1a5b7e42 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290
2025-09-25 16:55:29 +02:00
Michael Litvak
3c3dd4cf9d auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
2025-09-25 16:46:50 +02:00
Michael Litvak
a1161c156f auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.
2025-09-25 16:37:04 +02:00
Anna Stuchlik
b18b052d26 doc: remove n1-highmem instances from Recommended Instances 2025-09-22 12:40:36 +02:00
4756 changed files with 79213 additions and 30561 deletions

9
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @nuivall @ptrsmrn
auth/* @nuivall
# CACHE
row_cache* @tgrabiec
@@ -25,11 +25,11 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @nuivall @ptrsmrn
cql3/* @tgrabiec @nuivall
# COUNTERS
counters* @nuivall @ptrsmrn
tests/counter_test* @nuivall @ptrsmrn
counters* @nuivall
tests/counter_test* @nuivall
# DOCS
docs/* @annastuchlik @tzach
@@ -57,7 +57,6 @@ repair/* @tgrabiec @asias
# SCHEMA MANAGEMENT
db/schema_tables* @tgrabiec
db/legacy_schema_migrator* @tgrabiec
service/migration* @tgrabiec
schema* @tgrabiec

97
.github/copilot-instructions.md vendored Normal file
View File

@@ -0,0 +1,97 @@
# ScyllaDB Development Instructions
## Project Context
High-performance distributed NoSQL database. Core values: performance, correctness, readability.
## Build System
### Modern Build (configure.py + ninja)
```bash
# Configure (run once per mode, or when switching modes)
./configure.py --mode=<mode> # mode: dev, debug, release, sanitize
# Build everything
ninja <mode>-build # e.g., ninja dev-build
# Build Scylla binary only (sufficient for Python integration tests)
ninja build/<mode>/scylla
# Build specific test
ninja build/<mode>/test/boost/<test_name>
```
## Running Tests
### C++ Unit Tests
```bash
# Run all tests in a file
./test.py --mode=<mode> test/<suite>/<test_name>.cc
# Run a single test case from a file
./test.py --mode=<mode> test/<suite>/<test_name>.cc::<test_case_name>
# Examples
./test.py --mode=dev test/boost/memtable_test.cc
./test.py --mode=dev test/raft/raft_server_test.cc::test_check_abort_on_client_api
```
**Important:**
- Use full path with `.cc` extension (e.g., `test/boost/test_name.cc`, not `boost/test_name`)
- To run a single test case, append `::<test_case_name>` to the file path
- If you encounter permission issues with cgroup metric gathering, add `--no-gather-metrics` flag
**Rebuilding Tests:**
- test.py does NOT automatically rebuild when test source files are modified
- Many tests are part of composite binaries (e.g., `combined_tests` in test/boost contains multiple test files)
- To find which binary contains a test, check `configure.py` in the repository root (primary source) or `test/<suite>/CMakeLists.txt`
- To rebuild a specific test binary: `ninja build/<mode>/test/<suite>/<binary_name>`
- Examples:
- `ninja build/dev/test/boost/combined_tests` (contains group0_voter_calculator_test.cc and others)
- `ninja build/dev/test/raft/replication_test` (standalone Raft test)
### Python Integration Tests
```bash
# Only requires Scylla binary (full build usually not needed)
ninja build/<mode>/scylla
# Run all tests in a file
./test.py --mode=<mode> <test_path>
# Run a single test case from a file
./test.py --mode=<mode> <test_path>::<test_function_name>
# Examples
./test.py --mode=dev alternator/
./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator
# Optional flags
./test.py --mode=dev cluster/test_raft_no_quorum -v # Verbose output
./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5 # Repeat test 5 times
```
**Important:**
- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)
- To run a single test case, append `::<test_function_name>` to the file path
- Add `-v` for verbose output
- Add `--repeat <num>` to repeat a test multiple times
- After modifying C++ source files, only rebuild the Scylla binary for Python tests - building the entire repository is unnecessary
## Code Philosophy
- Performance matters in hot paths (data read/write, inner loops)
- Self-documenting code through clear naming
- Comments explain "why", not "what"
- Prefer standard library over custom implementations
- Strive for simplicity and clarity, add complexity only when clearly justified
- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate
- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context
## Test Philosophy
- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.
- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)
- Unit tests should ideally test one thing and one thing only.
- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.
- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.
- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.
- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.
- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.

115
.github/instructions/cpp.instructions.md vendored Normal file
View File

@@ -0,0 +1,115 @@
---
applyTo: "**/*.{cc,hh}"
---
# C++ Guidelines
**Important:** Always match the style and conventions of existing code in the file and directory.
## Memory Management
- Prefer stack allocation whenever possible
- Use `std::unique_ptr` by default for dynamic allocations
- `new`/`delete` are forbidden (use RAII)
- Use `seastar::lw_shared_ptr` or `seastar::shared_ptr` for shared ownership within same shard
- Use `seastar::foreign_ptr` for cross-shard sharing
- Avoid `std::shared_ptr` except when interfacing with external C++ APIs
- Avoid raw pointers except for non-owning references or C API interop
## Seastar Asynchronous Programming
- Use `seastar::future<T>` for all async operations
- Prefer coroutines (`co_await`, `co_return`) over `.then()` chains for readability
- Coroutines are preferred over `seastar::do_with()` for managing temporary state
- In hot paths where futures are ready, continuations may be more efficient than coroutines
- Chain futures with `.then()`, don't block with `.get()` (unless in `seastar::thread` context)
- All I/O must be asynchronous (no blocking calls)
- Use `seastar::gate` for shutdown coordination
- Use `seastar::semaphore` for resource limiting (not `std::mutex`)
- Break long loops with `maybe_yield()` to avoid reactor stalls
## Coroutines
```cpp
seastar::future<T> func() {
auto result = co_await async_operation();
co_return result;
}
```
## Error Handling
- Throw exceptions for errors (futures propagate them automatically)
- In data path: avoid exceptions, use `std::expected` (or `boost::outcome`) instead
- Use standard exceptions (`std::runtime_error`, `std::invalid_argument`)
- Database-specific: throw appropriate schema/query exceptions
## Performance
- Pass large objects by `const&` or `&&` (move semantics)
- Use `std::string_view` for non-owning string references
- Avoid copies: prefer move semantics
- Use `utils::chunked_vector` instead of `std::vector` for large allocations (>128KB)
- Minimize dynamic allocations in hot paths
## Database-Specific Types
- Use `schema_ptr` for schema references
- Use `mutation` and `mutation_partition` for data modifications
- Use `partition_key` and `clustering_key` for keys
- Use `api::timestamp_type` for database timestamps
- Use `gc_clock` for garbage collection timing
## Style
- C++23 standard (prefer modern features, especially coroutines)
- Use `auto` when type is obvious from RHS
- Avoid `auto` when it obscures the type
- Use range-based for loops: `for (const auto& item : container)`
- Use standard algorithms when they clearly simplify code (e.g., replacing 10-line loops)
- Avoid chaining multiple algorithms if a straightforward loop is clearer
- Mark functions and variables `const` whenever possible
- Use scoped enums: `enum class` (not unscoped `enum`)
## Headers
- Use `#pragma once`
- Include order: own header, C++ std, Seastar, Boost, project headers
- Forward declare when possible
- Never `using namespace` in headers (exception: `using namespace seastar` is globally available via `seastarx.hh`)
## Documentation
- Public APIs require clear documentation
- Implementation details should be self-evident from code
- Use `///` or Doxygen `/** */` for public documentation, `//` for implementation notes - follow the existing style
## Naming
- `snake_case` for most identifiers (classes, functions, variables, namespaces)
- Template parameters: `CamelCase` (e.g., `template<typename ValueType>`)
- Member variables: prefix with `_` (e.g., `int _count;`)
- Structs (value-only): no `_` prefix on members
- Constants and `constexpr`: `snake_case` (e.g., `static constexpr int max_size = 100;`)
- Files: `.hh` for headers, `.cc` for source
## Formatting
- 4 spaces indentation, never tabs
- Opening braces on same line as control structure (except namespaces)
- Space after keywords: `if (`, `while (`, `return `
- Whitespace around operators matches precedence: `*a + *b` not `* a+* b`
- Line length: keep reasonable (<160 chars), use continuation lines with double indent if needed
- Brace all nested scopes, even single statements
- Minimal patches: only format code you modify, never reformat entire files
## Logging
- Use structured logging with appropriate levels: DEBUG, INFO, WARN, ERROR
- Include context in log messages (e.g., request IDs)
- Never log sensitive data (credentials, PII)
## Forbidden
- `malloc`/`free`
- `printf` family (use logging or fmt)
- Raw pointers for ownership
- `using namespace` in headers
- Blocking operations: `std::sleep`, `std::read`, `std::mutex` (use Seastar equivalents)
- `std::atomic` (reserved for very special circumstances only)
- Macros (use `inline`, `constexpr`, or templates instead)
## Testing
When modifying existing code, follow TDD: create/update test first, then implement.
- Examine existing tests for style and structure
- Use Boost.Test framework
- Use `SEASTAR_THREAD_TEST_CASE` for Seastar asynchronous tests
- Aim for high code coverage, especially for new features and bug fixes
- Maintain bisectability: all tests must pass in every commit. Mark failing tests with `BOOST_FAIL()` or similar, then fix in subsequent commit

View File

@@ -0,0 +1,51 @@
---
applyTo: "**/*.py"
---
# Python Guidelines
**Important:** Match existing code style. Some directories (like `test/cqlpy` and `test/alternator`) prefer simplicity over type hints and docstrings.
## Style
- Follow PEP 8
- Use type hints for function signatures (unless directory style omits them)
- Use f-strings for formatting
- Line length: 160 characters max
- 4 spaces for indentation
## Imports
Order: standard library, third-party, local imports
```python
import os
import sys
import pytest
from cassandra.cluster import Cluster
from test.utils import setup_keyspace
```
Never use `from module import *`
## Documentation
All public functions/classes need docstrings (unless the current directory conventions omit them):
```python
def my_function(arg1: str, arg2: int) -> bool:
"""
Brief summary of function purpose.
Args:
arg1: Description of first argument.
arg2: Description of second argument.
Returns:
Description of return value.
"""
pass
```
## Testing Best Practices
- Maintain bisectability: all tests must pass in every commit
- Mark currently-failing tests with `@pytest.mark.xfail`, unmark when fixed
- Use descriptive names that convey intent
- Docstrings/comments should explain what the test verifies and why, and if it reproduces a specific issue or how it fits into the larger test suite

View File

@@ -62,7 +62,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
if is_draft:
labels_to_add.append("conflicts")
pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
backport_pr.create_issue_comment(pr_comment)
# Apply all labels at once if we have any
@@ -142,20 +142,31 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
def with_github_keyword_prefix(repo, pr):
pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
match = re.findall(pattern, pr.body, re.IGNORECASE)
if not match:
for commit in pr.get_commits():
match = re.findall(pattern, commit.commit.message, re.IGNORECASE)
if match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
break
if not match:
print(f'No valid close reference for {pr.number}')
return False
else:
# GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs
github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
# JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92
jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"
# Check PR body for GitHub issues
github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)
# Check PR body for JIRA issues
jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)
match = github_match or jira_match
if match:
return True
for commit in pr.get_commits():
github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)
jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)
if github_match or jira_match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
return True
print(f'No valid close reference for {pr.number}')
return False
def main():
args = parse_args()

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -0,0 +1,53 @@
name: Backport with Jira Integration
on:
push:
branches:
- master
- next-*.*
- branch-*.*
pull_request_target:
types: [labeled, closed]
branches:
- master
- next
- next-*.*
- branch-*.*
jobs:
backport-on-push:
if: github.event_name == 'push'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'push'
base_branch: ${{ github.ref }}
commits: ${{ github.event.before }}..${{ github.sha }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-on-label:
if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'labeled'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
head_commit: ${{ github.event.pull_request.base.sha }}
label_name: ${{ github.event.label.name }}
pr_state: ${{ github.event.pull_request.state }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-chain:
if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'chain'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
pr_body: ${{ github.event.pull_request.body }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status In Progress
on:
pull_request_target:
types: [opened]
jobs:
call-jira-status-in-progress:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_progress.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status In Review
on:
pull_request_target:
types: [ready_for_review, review_requested]
jobs:
call-jira-status-in-review:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_review.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status Ready For Merge
on:
pull_request_target:
types: [labeled]
jobs:
call-jira-status-update:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_ready_for_merge.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

41
.github/workflows/call_jira_sync.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
name: Sync Jira Based on PR Events
on:
pull_request_target:
types: [opened, ready_for_review, review_requested, labeled, unlabeled, closed]
permissions:
contents: read
pull-requests: write
issues: write
jobs:
jira-sync-pr-opened:
if: github.event.action == 'opened'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_opened.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-in-review:
if: github.event.action == 'ready_for_review' || github.event.action == 'review_requested'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_in_review.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-add-label:
if: github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_add_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-remove-label:
if: github.event.action == 'unlabeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_remove_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-pr-closed:
if: github.event.action == 'closed'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_closed.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,22 @@
name: Sync Jira Based on PR Milestone Events
on:
pull_request_target:
types: [milestoned, demilestoned]
permissions:
contents: read
pull-requests: read
jobs:
jira-sync-milestone-set:
if: github.event.action == 'milestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_set.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-milestone-removed:
if: github.event.action == 'demilestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_removed.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,14 @@
name: Call Jira release creation for new milestone
on:
milestone:
types: [created]
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER,SMI"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,13 @@
name: validate_pr_author_email
on:
pull_request_target:
types:
- opened
- synchronize
- reopened
jobs:
validate_pr_author_email:
uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

View File

@@ -0,0 +1,62 @@
name: Close issues created by Scylla associates
on:
issues:
types: [opened, reopened]
permissions:
issues: write
jobs:
comment-and-close:
runs-on: ubuntu-latest
steps:
- name: Comment and close if author email is scylladb.com
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
// Add the comment
await github.rest.issues.createComment({
owner,
repo,
issue_number,
body,
});
console.log(`Comment added to #${issue_number}`);
// Close the issue
await github.rest.issues.update({
owner,
repo,
issue_number,
state: "closed",
state_reason: "not_planned"
});
console.log(`Issue #${issue_number} closed.`);

View File

@@ -13,5 +13,5 @@ jobs:
- uses: codespell-project/actions-codespell@master
with:
only_warn: 1
ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison"
ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison,iif,tread"
skip: "./.git,./build,./tools,*.js,*.lock,./test,./licenses,./redis/lolwut.cc,*.svg"

View File

@@ -18,6 +18,8 @@ on:
jobs:
release:
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout

View File

@@ -2,6 +2,9 @@ name: "Docs / Build PR"
# For more information,
# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows
permissions:
contents: read
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}

View File

@@ -0,0 +1,37 @@
name: Docs / Validate metrics
permissions:
contents: read
on:
pull_request:
branches:
- master
- enterprise
paths:
- '**/*.cc'
- 'scripts/metrics-config.yml'
- 'scripts/get_description.py'
- 'docs/_ext/scylladb_metrics.py'
jobs:
validate-metrics:
runs-on: ubuntu-latest
name: Check metrics documentation coverage
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.10'
- name: Install dependencies
run: pip install PyYAML
- name: Validate metrics
run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

View File

@@ -14,7 +14,8 @@ env:
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
permissions: {}
permissions:
contents: read
# cancel the in-progress run upon a repush
concurrency:
@@ -34,8 +35,6 @@ jobs:
- uses: actions/checkout@v4
with:
submodules: true
- run: |
sudo dnf -y install clang-tools-extra
- name: Generate compilation database
run: |
cmake \

View File

@@ -10,6 +10,8 @@ on:
jobs:
read-toolchain:
runs-on: ubuntu-latest
permissions:
contents: read
outputs:
image: ${{ steps.read.outputs.image }}
steps:

View File

@@ -3,19 +3,58 @@ name: Trigger Scylla CI Route
on:
issue_comment:
types: [created]
pull_request_target:
types:
- unlabeled
jobs:
trigger-jenkins:
if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Verify Org Membership
id: verify_author
shell: bash
run: |
if [[ "${{ github.event_name }}" == "pull_request_target" ]]; then
AUTHOR="${{ github.event.pull_request.user.login }}"
ASSOCIATION="${{ github.event.pull_request.author_association }}"
else
AUTHOR="${{ github.event.comment.user.login }}"
ASSOCIATION="${{ github.event.comment.author_association }}"
fi
if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then
echo "member=true" >> $GITHUB_OUTPUT
else
echo "::warning::${AUTHOR} is not a member of scylladb (association: ${ASSOCIATION}); skipping CI trigger."
echo "member=false" >> $GITHUB_OUTPUT
fi
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
shell: bash
run: |
BODY=$(cat << 'EOF'
${{ github.event.comment.body }}
EOF
)
CLEAN_BODY=$(echo "$BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT
else
echo "trigger=false" >> $GITHUB_OUTPUT
fi
- name: Trigger Scylla-CI-Route Jenkins Job
if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"
PR_REPO_NAME: "${{ github.event.repository.full_name }}"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail

242
.github/workflows/trigger_ci.yaml vendored Normal file
View File

@@ -0,0 +1,242 @@
name: Trigger next gating
on:
pull_request_target:
types: [opened, reopened, synchronize]
issue_comment:
types: [created]
jobs:
trigger-ci:
runs-on: ubuntu-latest
steps:
- name: Dump GitHub context
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout PR code
uses: actions/checkout@v3
with:
fetch-depth: 0 # Needed to access full history
ref: ${{ github.event.pull_request.head.ref }}
- name: Fetch before commit if needed
run: |
if ! git cat-file -e ${{ github.event.before }} 2>/dev/null; then
echo "Fetching before commit ${{ github.event.before }}"
git fetch --depth=1 origin ${{ github.event.before }}
fi
- name: Compare commits for file changes
if: github.action == 'synchronize'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "Base: ${{ github.event.before }}"
echo "Head: ${{ github.event.after }}"
TREE_BEFORE=$(git show -s --format=%T ${{ github.event.before }})
TREE_AFTER=$(git show -s --format=%T ${{ github.event.after }})
echo "TREE_BEFORE=$TREE_BEFORE" >> $GITHUB_ENV
echo "TREE_AFTER=$TREE_AFTER" >> $GITHUB_ENV
- name: Check if last push has file changes
run: |
if [[ "${{ env.TREE_BEFORE }}" == "${{ env.TREE_AFTER }}" ]]; then
echo "No file changes detected in the last push, only commit message edit."
echo "has_file_changes=false" >> $GITHUB_ENV
else
echo "File changes detected in the last push."
echo "has_file_changes=true" >> $GITHUB_ENV
fi
- name: Rule 1 - Check PR draft or conflict status
run: |
# Check if PR is in draft mode
IS_DRAFT="${{ github.event.pull_request.draft }}"
# Check if PR has 'conflict' label
HAS_CONFLICT_LABEL="false"
LABELS='${{ toJson(github.event.pull_request.labels) }}'
if echo "$LABELS" | jq -r '.[].name' | grep -q "^conflict$"; then
HAS_CONFLICT_LABEL="true"
fi
# Set draft_or_conflict variable
if [[ "$IS_DRAFT" == "true" || "$HAS_CONFLICT_LABEL" == "true" ]]; then
echo "draft_or_conflict=true" >> $GITHUB_ENV
echo "✅ Rule 1: PR is in draft mode or has conflict label - setting draft_or_conflict=true"
else
echo "draft_or_conflict=false" >> $GITHUB_ENV
echo "✅ Rule 1: PR is ready and has no conflict label - setting draft_or_conflict=false"
fi
echo "Draft status: $IS_DRAFT"
echo "Has conflict label: $HAS_CONFLICT_LABEL"
echo "Result: draft_or_conflict = $draft_or_conflict"
- name: Rule 2 - Check labels
run: |
# Check if PR has P0 or P1 labels
HAS_P0_P1_LABEL="false"
LABELS='${{ toJson(github.event.pull_request.labels) }}'
if echo "$LABELS" | jq -r '.[].name' | grep -E "^(P0|P1)$" > /dev/null; then
HAS_P0_P1_LABEL="true"
fi
# Check if PR already has force_on_cloud label
echo "HAS_FORCE_ON_CLOUD_LABEL=false" >> $GITHUB_ENV
if echo "$LABELS" | jq -r '.[].name' | grep -q "^force_on_cloud$"; then
HAS_FORCE_ON_CLOUD_LABEL="true"
echo "HAS_FORCE_ON_CLOUD_LABEL=true" >> $GITHUB_ENV
fi
echo "Has P0/P1 label: $HAS_P0_P1_LABEL"
echo "Has force_on_cloud label: $HAS_FORCE_ON_CLOUD_LABEL"
# Add force_on_cloud label if PR has P0/P1 and doesn't already have force_on_cloud
if [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "false" ]]; then
echo "✅ Rule 2: PR has P0 or P1 label - adding force_on_cloud label"
curl -X POST \
-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/labels" \
-d '{"labels":["force_on_cloud"]}'
elif [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "true" ]]; then
echo "✅ Rule 2: PR has P0 or P1 label and already has force_on_cloud label - no action needed"
else
echo "✅ Rule 2: PR does not have P0 or P1 label - no force_on_cloud label needed"
fi
SKIP_UNIT_TEST_CUSTOM="false"
if echo "$LABELS" | jq -r '.[].name' | grep -q "^ci/skip_unit-tests_custom$"; then
SKIP_UNIT_TEST_CUSTOM="true"
fi
echo "SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM" >> $GITHUB_ENV
- name: Rule 3 - Analyze changed files and set build requirements
run: |
# Get list of changed files
CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }} ${{ github.event.pull_request.head.sha }})
echo "Changed files:"
echo "$CHANGED_FILES"
echo ""
# Initialize all requirements to false
REQUIRE_BUILD="false"
REQUIRE_DTEST="false"
REQUIRE_UNITTEST="false"
REQUIRE_ARTIFACTS="false"
REQUIRE_SCYLLA_GDB="false"
# Check each file against patterns
while IFS= read -r file; do
if [[ -n "$file" ]]; then
echo "Checking file: $file"
# Build pattern: ^(?!scripts\/pull_github_pr.sh).*$
# Everything except scripts/pull_github_pr.sh
if [[ "$file" != "scripts/pull_github_pr.sh" ]]; then
REQUIRE_BUILD="true"
echo " ✓ Matches build pattern"
fi
# Dtest pattern: ^(?!test(.py|\/)|dist\/docker\/|dist\/common\/scripts\/).*$
# Everything except test files, dist/docker/, dist/common/scripts/
if [[ ! "$file" =~ ^test\.(py|/).*$ ]] && [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts/.*$ ]]; then
REQUIRE_DTEST="true"
echo " ✓ Matches dtest pattern"
fi
# Unittest pattern: ^(?!dist\/docker\/|dist\/common\/scripts).*$
# Everything except dist/docker/, dist/common/scripts/
if [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts.*$ ]]; then
REQUIRE_UNITTEST="true"
echo " ✓ Matches unittest pattern"
fi
# Artifacts pattern: ^(?:dist|tools\/toolchain).*$
# Files starting with dist or tools/toolchain
if [[ "$file" =~ ^dist.*$ ]] || [[ "$file" =~ ^tools/toolchain.*$ ]]; then
REQUIRE_ARTIFACTS="true"
echo " ✓ Matches artifacts pattern"
fi
# Scylla GDB pattern: ^(scylla-gdb.py).*$
# Files starting with scylla-gdb.py
if [[ "$file" =~ ^scylla-gdb\.py.*$ ]]; then
REQUIRE_SCYLLA_GDB="true"
echo " ✓ Matches scylla_gdb pattern"
fi
fi
done <<< "$CHANGED_FILES"
# Set environment variables
echo "requireBuild=$REQUIRE_BUILD" >> $GITHUB_ENV
echo "requireDtest=$REQUIRE_DTEST" >> $GITHUB_ENV
echo "requireUnittest=$REQUIRE_UNITTEST" >> $GITHUB_ENV
echo "requireArtifacts=$REQUIRE_ARTIFACTS" >> $GITHUB_ENV
echo "requireScyllaGdb=$REQUIRE_SCYLLA_GDB" >> $GITHUB_ENV
echo ""
echo "✅ Rule 3: File analysis complete"
echo "Build required: $REQUIRE_BUILD"
echo "Dtest required: $REQUIRE_DTEST"
echo "Unittest required: $REQUIRE_UNITTEST"
echo "Artifacts required: $REQUIRE_ARTIFACTS"
echo "Scylla GDB required: $REQUIRE_SCYLLA_GDB"
- name: Determine Jenkins Job Name
run: |
if [[ "${{ github.ref_name }}" == "next" ]]; then
FOLDER_NAME="scylla-master"
elif [[ "${{ github.ref_name }}" == "next-enterprise" ]]; then
FOLDER_NAME="scylla-enterprise"
else
VERSION=$(echo "${{ github.ref_name }}" | awk -F'-' '{print $2}')
if [[ "$VERSION" =~ ^202[0-4]\.[0-9]+$ ]]; then
FOLDER_NAME="enterprise-$VERSION"
elif [[ "$VERSION" =~ ^[0-9]+\.[0-9]+$ ]]; then
FOLDER_NAME="scylla-$VERSION"
fi
fi
echo "JOB_NAME=${FOLDER_NAME}/job/scylla-ci" >> $GITHUB_ENV
- name: Trigger Jenkins Job
if: env.draft_or_conflict == 'false' && env.has_file_changes == 'true' && github.action == 'opened' || github.action == 'reopened'
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
echo "Triggering Jenkins Job: $JOB_NAME"
curl -X POST \
"$JENKINS_URL/job/$JOB_NAME/buildWithParameters? \
PR_NUMBER=$PR_NUMBER& \
RUN_DTEST=$REQUIRE_DTEST& \
RUN_ONLY_SCYLLA_GDB=$REQUIRE_SCYLLA_GDB& \
RUN_UNIT_TEST=$REQUIRE_UNITTEST& \
FORCE_ON_CLOUD=$HAS_FORCE_ON_CLOUD_LABEL& \
SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM& \
RUN_ARTIFACT_TESTS=$REQUIRE_ARTIFACTS" \
--fail \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" \
-i -v
trigger-ci-via-comment:
if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
runs-on: ubuntu-latest
steps:
- name: Trigger Scylla-CI Jenkins Job
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/$JOB_NAME/buildWithParameters?PR_NUMBER=$PR_NUMBER" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v

View File

@@ -49,7 +49,7 @@ include(limit_jobs)
set(CMAKE_CXX_STANDARD "23" CACHE INTERNAL "")
set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")
set(CMAKE_CXX_VISIBILITY_PRESET hidden)
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
if(is_multi_config)
find_package(Seastar)
@@ -90,13 +90,13 @@ if(is_multi_config)
add_dependencies(Seastar::seastar_testing Seastar)
else()
set(Seastar_TESTING ON CACHE BOOL "" FORCE)
set(Seastar_API_LEVEL 8 CACHE STRING "" FORCE)
set(Seastar_API_LEVEL 9 CACHE STRING "" FORCE)
set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)
set(Seastar_APPS ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_IO_URING ON CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 20 CACHE STRING "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
target_compile_definitions (seastar
@@ -116,6 +116,7 @@ list(APPEND absl_cxx_flags
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
list(APPEND absl_cxx_flags "-Wno-deprecated-builtins")
list(APPEND ABSL_LLVM_FLAGS ${absl_cxx_flags})
endif()
set(ABSL_DEFAULT_LINKOPTS
@@ -163,7 +164,45 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")
include(add_version_library)
generate_scylla_version()
option(Scylla_USE_PRECOMPILED_HEADER "Use precompiled header for Scylla" ON)
add_library(scylla-precompiled-header STATIC exported_templates.cc)
target_link_libraries(scylla-precompiled-header PRIVATE
absl::headers
absl::btree
absl::hash
absl::raw_hash_set
Seastar::seastar
Snappy::snappy
systemd
ZLIB::ZLIB
lz4::lz4_static
zstd::zstd_static)
if (Scylla_USE_PRECOMPILED_HEADER)
set(Scylla_USE_PRECOMPILED_HEADER_USE ON)
find_program(DISTCC_EXEC NAMES distcc OPTIONAL)
if (DISTCC_EXEC)
if(DEFINED ENV{DISTCC_HOSTS})
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)
message(STATUS "Disabling precompiled header usage because distcc exists and DISTCC_HOSTS is set, assuming you're using distributed compilation.")
else()
file(REAL_PATH "~/.distcc/hosts" DIST_CC_HOSTS_PATH EXPAND_TILDE)
if (EXISTS ${DIST_CC_HOSTS_PATH})
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)
message(STATUS "Disabling precompiled header usage because distcc and ~/.distcc/hosts exists, assuming you're using distributed compilation.")
endif()
endif()
endif()
if (Scylla_USE_PRECOMPILED_HEADER_USE)
message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")
target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")
target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)
endif()
else()
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)
endif()
add_library(scylla-main STATIC)
target_sources(scylla-main
PRIVATE
absl-flat_hash_map.cc
@@ -178,7 +217,6 @@ target_sources(scylla-main
mutation_query.cc
node_ops/task_manager_module.cc
partition_slice_builder.cc
querier.cc
query/query.cc
query_ranges_to_vnodes.cc
query/query-result-set.cc
@@ -209,6 +247,7 @@ target_link_libraries(scylla-main
ZLIB::ZLIB
lz4::lz4_static
zstd::zstd_static
scylla-precompiled-header
)
option(Scylla_CHECK_HEADERS
@@ -261,7 +300,6 @@ add_subdirectory(locator)
add_subdirectory(message)
add_subdirectory(mutation)
add_subdirectory(mutation_writer)
add_subdirectory(node_ops)
add_subdirectory(readers)
add_subdirectory(replica)
add_subdirectory(raft)

View File

@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form to cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

View File

@@ -43,7 +43,7 @@ $ ./tools/toolchain/dbuild ninja build/release/scylla
$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1
```
Note: do not mix environemtns - either perform all your work with dbuild, or natively on the host.
Note: do not mix environments - either perform all your work with dbuild, or natively on the host.
Note2: you can get to an interactive shell within dbuild by running it without any parameters:
```bash
$ ./tools/toolchain/dbuild
@@ -91,7 +91,7 @@ You can also specify a single mode. For example
$ ninja-build release
```
Will build everytihng in release mode. The valid modes are
Will build everything in release mode. The valid modes are
* Debug: Enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer)
and other sanity checks. It has no optimizations, which allows for debugging with tools like
@@ -361,7 +361,7 @@ avoid that the gold linker can be told to create an index with
More info at https://gcc.gnu.org/wiki/DebugFission.
Both options can be enable by passing `--split-dwarf` to configure.py.
Both options can be enabled by passing `--split-dwarf` to configure.py.
Note that distcc is *not* compatible with it, but icecream
(https://github.com/icecc/icecream) is.
@@ -370,7 +370,7 @@ Note that distcc is *not* compatible with it, but icecream
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
One way to do this is to create a local remote for the Seastar submodule in the Scylla repository:
```bash
$ cd $HOME/src/scylla

View File

@@ -18,7 +18,7 @@ Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++23 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md).
This is a pre-configured Docker image which includes recent versions of all
the required compilers, libraries and build tools. Using the frozen toolchain
allows you to avoid changing anything in your build machine to meet Scylla's
@@ -43,7 +43,7 @@ For further information, please see:
[developer documentation]: HACKING.md
[build documentation]: docs/dev/building.md
[docker image build documentation]: dist/docker/debian/README.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.4.0-dev
VERSION=2026.2.0-dev
if test -f version
then

View File

@@ -18,6 +18,7 @@ target_sources(alternator
consumed_capacity.cc
ttl.cc
parsed_expression_cache.cc
http_compression.cc
${cql_grammar_srcs})
target_include_directories(alternator
PUBLIC
@@ -34,5 +35,8 @@ target_link_libraries(alternator
idl
absl::headers)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(alternator REUSE_FROM scylla-precompiled-header)
endif()
check_headers(check-headers alternator
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -11,7 +11,6 @@
#include "utils/log.hh"
#include <string>
#include <string_view>
#include "bytes.hh"
#include "alternator/auth.hh"
#include <fmt/format.h>
#include "auth/password_authenticator.hh"

View File

@@ -42,7 +42,7 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
if (!comparison_operator.IsString()) {
throw api_error::validation(fmt::format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
std::string op = comparison_operator.GetString();
std::string op = rjson::to_string(comparison_operator);
auto it = ops.find(op);
if (it == ops.end()) {
throw api_error::validation(fmt::format("Unsupported comparison operator {}", op));
@@ -377,8 +377,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
return cmp(unwrap_number(*v1, cmp.diagnostic), unwrap_number(v2, cmp.diagnostic));
}
if (kv1.name == "S") {
return cmp(std::string_view(kv1.value.GetString(), kv1.value.GetStringLength()),
std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));
return cmp(rjson::to_string_view(kv1.value),
rjson::to_string_view(kv2.value));
}
if (kv1.name == "B") {
auto d_kv1 = unwrap_bytes(kv1.value, v1_from_query);
@@ -470,9 +470,9 @@ static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const r
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
return check_BETWEEN(rjson::to_string_view(kv_v.value),
rjson::to_string_view(kv_lb.value),
rjson::to_string_view(kv_ub.value),
bounds_from_query);
}
if (kv_v.name == "B") {
@@ -618,7 +618,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function can throw an ValidationException API error if there
// This function can throw a ValidationException API error if there
// are errors in the format of the condition itself.
bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");

View File

@@ -8,6 +8,8 @@
#include "consumed_capacity.hh"
#include "error.hh"
#include "utils/rjson.hh"
#include <fmt/format.h>
namespace alternator {
@@ -32,18 +34,18 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
if (!return_consumed->IsString()) {
throw api_error::validation("Non-string ReturnConsumedCapacity field in request");
}
std::string consumed = return_consumed->GetString();
std::string_view consumed = rjson::to_string_view(*return_consumed);
if (consumed == "INDEXES") {
throw api_error::validation("INDEXES consumed capacity is not supported");
}
if (consumed != "TOTAL") {
throw api_error::validation("Unknown consumed capacity "+ consumed);
throw api_error::validation(fmt::format("Unknown consumed capacity {}", consumed));
}
return true;
}
void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {
if (_should_add_to_reponse) {
if (_should_add_to_response) {
auto consumption = rjson::empty_object();
rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());
rjson::add(response, "ConsumedCapacity", std::move(consumption));

View File

@@ -28,9 +28,9 @@ namespace alternator {
class consumed_capacity_counter {
public:
consumed_capacity_counter() = default;
consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}
consumed_capacity_counter(bool should_add_to_response) : _should_add_to_response(should_add_to_response){}
bool operator()() const noexcept {
return _should_add_to_reponse;
return _should_add_to_response;
}
consumed_capacity_counter& operator +=(uint64_t bytes);
@@ -44,7 +44,7 @@ public:
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
bool _should_add_to_response = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {

View File

@@ -28,6 +28,7 @@ static logging::logger logger("alternator_controller");
controller::controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
@@ -39,6 +40,7 @@ controller::controller(
: protocol_server(sg)
, _gossiper(gossiper)
, _proxy(proxy)
, _ss(ss)
, _mm(mm)
, _sys_dist_ks(sys_dist_ks)
, _cdc_gen_svc(cdc_gen_svc)
@@ -89,7 +91,7 @@ future<> controller::start_server() {
auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {
return cfg.alternator_timeout_in_ms;
};
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks),
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
@@ -103,11 +105,23 @@ future<> controller::start_server() {
alternator_port = _config.alternator_port();
_listen_addresses.push_back({addr, *alternator_port});
}
std::optional<uint16_t> alternator_port_proxy_protocol;
if (_config.alternator_port_proxy_protocol()) {
alternator_port_proxy_protocol = _config.alternator_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_port_proxy_protocol});
}
std::optional<uint16_t> alternator_https_port;
std::optional<uint16_t> alternator_https_port_proxy_protocol;
std::optional<tls::credentials_builder> creds;
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
if (_config.alternator_https_port() || _config.alternator_https_port_proxy_protocol()) {
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
}
if (_config.alternator_https_port_proxy_protocol()) {
alternator_https_port_proxy_protocol = _config.alternator_https_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_https_port_proxy_protocol});
}
creds.emplace();
auto opts = _config.alternator_encryption_options();
if (opts.empty()) {
@@ -133,18 +147,29 @@ future<> controller::start_server() {
}
}
_server.invoke_on_all(
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds,
[this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds,
_config.alternator_enforce_authorization,
_config.alternator_warn_authorization,
_config.alternator_max_users_query_size_in_trace_output,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);
}).handle_exception([this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}, proxy-protocol port {}, TLS proxy-protocol port {}: {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF",
ep);
return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });
}).then([addr, alternator_port, alternator_https_port] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");
}).then([addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}, proxy-protocol port {}, TLS proxy-protocol port {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF");
}).get();
});
}
@@ -167,7 +192,7 @@ future<> controller::request_stop_server() {
});
}
future<utils::chunked_vector<client_data>> controller::get_client_data() {
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> controller::get_client_data() {
return _server.local().get_client_data();
}

View File

@@ -15,6 +15,7 @@
namespace service {
class storage_proxy;
class storage_service;
class migration_manager;
class memory_limiter;
}
@@ -57,6 +58,7 @@ class server;
class controller : public protocol_server {
sharded<gms::gossiper>& _gossiper;
sharded<service::storage_proxy>& _proxy;
sharded<service::storage_service>& _ss;
sharded<service::migration_manager>& _mm;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
@@ -74,6 +76,7 @@ public:
controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
@@ -93,7 +96,7 @@ public:
// This virtual function is called (on each shard separately) when the
// virtual table "system.clients" is read. It is expected to generate a
// list of clients connected to this server (on this shard).
virtual future<utils::chunked_vector<client_data>> get_client_data() override;
virtual future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data() override;
};
}

View File

@@ -94,6 +94,9 @@ public:
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);
}
static api_error payload_too_large(std::string msg) {
return api_error("PayloadTooLarge", std::move(msg), status_type::payload_too_large);
}
// Provide the "std::exception" interface, to make it easier to print this
// exception in log messages. Note that this function is *not* used to

File diff suppressed because it is too large Load Diff

View File

@@ -17,11 +17,13 @@
#include "service/client_state.hh"
#include "service_permit.hh"
#include "db/timeout_clock.hh"
#include "db/config.hh"
#include "alternator/error.hh"
#include "stats.hh"
#include "utils/rjson.hh"
#include "utils/updateable_value.hh"
#include "utils/simple_value_with_expiry.hh"
#include "tracing/trace_state.hh"
@@ -40,6 +42,8 @@ namespace cql3::selection {
namespace service {
class storage_proxy;
class cas_shard;
class storage_service;
}
namespace cdc {
@@ -56,7 +60,9 @@ class schema_builder;
namespace alternator {
enum class table_status;
class rmw_operation;
class put_or_delete_item;
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
@@ -134,17 +140,24 @@ class expression_cache;
class executor : public peering_sharded_service<executor> {
gms::gossiper& _gossiper;
service::storage_service& _ss;
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
cdc::metadata& _cdc_metadata;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
std::unique_ptr<parsed::expression_cache> _parsed_expression_cache;
struct describe_table_info_manager;
std::unique_ptr<describe_table_info_manager> _describe_table_info_manager;
future<> cache_newly_calculated_size_on_all_shards(schema_ptr schema, std::uint64_t size_in_bytes, std::chrono::nanoseconds ttl);
future<> fill_table_size(rjson::value &table_description, schema_ptr schema, bool deleting);
public:
using client_state = service::client_state;
// request_return_type is the return type of the executor methods, which
@@ -170,6 +183,7 @@ public:
executor(gms::gossiper& gossiper,
service::storage_proxy& proxy,
service::storage_service& ss,
service::migration_manager& mm,
db::system_distributed_keyspace& sdks,
cdc::metadata& cdc_metadata,
@@ -217,6 +231,18 @@ private:
friend class rmw_operation;
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);
future<rjson::value> fill_table_description(schema_ptr schema, table_status tbl_status, service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit);
future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization, const db::tablets_mode_t::mode tablets_mode);
future<> do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit);
future<> cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);
@@ -228,12 +254,15 @@ public:
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
@@ -261,7 +290,7 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);
/**
* Make return type for serializing the object "streamed",

View File

@@ -50,7 +50,7 @@ public:
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
void add_dot(std::string name) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
@@ -85,7 +85,7 @@ struct constant {
}
};
// "value" is is a value used in the right hand side of an assignment
// "value" is a value used in the right hand side of an assignment
// expression, "SET a = ...". It can be a constant (a reference to a value
// included in the request, e.g., ":val"), a path to an attribute from the
// existing item (e.g., "a.b[3].c"), or a function of other such values.
@@ -205,7 +205,7 @@ public:
// The supported primitive conditions are:
// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and
// v1 and v2 are values - from the item (an attribute path), the query
// (a ":val" reference), or a function of the the above (only the size()
// (a ":val" reference), or a function of the above (only the size()
// function is supported).
// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).
// 3. N-ary operator - v1 IN ( v2, v3, ... )

View File

@@ -0,0 +1,301 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "alternator/http_compression.hh"
#include "alternator/server.hh"
#include <seastar/coroutine/maybe_yield.hh>
#include <zlib.h>
static logging::logger slogger("alternator-http-compression");
namespace alternator {
static constexpr size_t compressed_buffer_size = 1024;
class zlib_compressor {
z_stream _zs;
temporary_buffer<char> _output_buf;
noncopyable_function<future<>(temporary_buffer<char>&&)> _write_func;
public:
zlib_compressor(bool gzip, int compression_level, noncopyable_function<future<>(temporary_buffer<char>&&)> write_func)
: _write_func(std::move(write_func)) {
memset(&_zs, 0, sizeof(_zs));
if (deflateInit2(&_zs, std::clamp(compression_level, Z_NO_COMPRESSION, Z_BEST_COMPRESSION), Z_DEFLATED,
(gzip ? 16 : 0) + MAX_WBITS, 8, Z_DEFAULT_STRATEGY) != Z_OK) {
// Should only happen if memory allocation fails
throw std::bad_alloc();
}
}
~zlib_compressor() {
deflateEnd(&_zs);
}
future<> close() {
return compress(nullptr, 0, true);
}
future<> compress(const char* buf, size_t len, bool is_last_chunk = false) {
_zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(buf));
_zs.avail_in = (uInt) len;
int mode = is_last_chunk ? Z_FINISH : Z_NO_FLUSH;
while(_zs.avail_in > 0 || is_last_chunk) {
co_await coroutine::maybe_yield();
if (_output_buf.empty()) {
if (is_last_chunk) {
uint32_t max_buffer_size = 0;
deflatePending(&_zs, &max_buffer_size, nullptr);
max_buffer_size += deflateBound(&_zs, _zs.avail_in) + 1;
_output_buf = temporary_buffer<char>(std::min(compressed_buffer_size, (size_t) max_buffer_size));
} else {
_output_buf = temporary_buffer<char>(compressed_buffer_size);
}
_zs.next_out = reinterpret_cast<unsigned char*>(_output_buf.get_write());
_zs.avail_out = compressed_buffer_size;
}
int e = deflate(&_zs, mode);
if (e < Z_OK) {
throw api_error::internal("Error during compression of response body");
}
if (e == Z_STREAM_END || _zs.avail_out < compressed_buffer_size / 4) {
_output_buf.trim(compressed_buffer_size - _zs.avail_out);
co_await _write_func(std::move(_output_buf));
if (e == Z_STREAM_END) {
break;
}
}
}
}
};
// Helper string_view functions for parsing Accept-Encoding header
struct case_insensitive_cmp_sv {
bool operator()(std::string_view s1, std::string_view s2) const {
return std::equal(s1.begin(), s1.end(), s2.begin(), s2.end(),
[](char a, char b) { return ::tolower(a) == ::tolower(b); });
}
};
static inline std::string_view trim_left(std::string_view sv) {
while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.front())))
sv.remove_prefix(1);
return sv;
}
static inline std::string_view trim_right(std::string_view sv) {
while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.back())))
sv.remove_suffix(1);
return sv;
}
static inline std::string_view trim(std::string_view sv) {
return trim_left(trim_right(sv));
}
inline std::vector<std::string_view> split(std::string_view text, char separator) {
std::vector<std::string_view> tokens;
if (text == "") {
return tokens;
}
while (true) {
auto pos = text.find_first_of(separator);
if (pos != std::string_view::npos) {
tokens.emplace_back(text.data(), pos);
text.remove_prefix(pos + 1);
} else {
tokens.emplace_back(text);
break;
}
}
return tokens;
}
constexpr response_compressor::compression_type response_compressor::get_compression_type(std::string_view encoding) {
for (size_t i = 0; i < static_cast<size_t>(compression_type::count); ++i) {
if (case_insensitive_cmp_sv{}(encoding, compression_names[i])) {
return static_cast<compression_type>(i);
}
}
return compression_type::unknown;
}
response_compressor::compression_type response_compressor::find_compression(std::string_view accept_encoding, size_t response_size) {
std::optional<float> ct_q[static_cast<size_t>(compression_type::count)];
ct_q[static_cast<size_t>(compression_type::none)] = std::numeric_limits<float>::min(); // enabled, but lowest priority
compression_type selected_ct = compression_type::none;
std::vector<std::string_view> entries = split(accept_encoding, ',');
for (auto& e : entries) {
std::vector<std::string_view> params = split(e, ';');
if (params.size() == 0) {
continue;
}
compression_type ct = get_compression_type(trim(params[0]));
if (ct == compression_type::unknown) {
continue; // ignore unknown encoding types
}
if (ct_q[static_cast<size_t>(ct)].has_value() && ct_q[static_cast<size_t>(ct)] != 0.0f) {
continue; // already processed this encoding
}
if (response_size < _threshold[static_cast<size_t>(ct)]) {
continue; // below threshold treat as unknown
}
for (size_t i = 1; i < params.size(); ++i) { // find "q=" parameter
auto pos = params[i].find("q=");
if (pos == std::string_view::npos) {
continue;
}
std::string_view param = params[i].substr(pos + 2);
param = trim(param);
// parse quality value
float q_value = 1.0f;
auto [ptr, ec] = std::from_chars(param.data(), param.data() + param.size(), q_value);
if (ec != std::errc() || ptr != param.data() + param.size()) {
continue;
}
if (q_value < 0.0) {
q_value = 0.0;
} else if (q_value > 1.0) {
q_value = 1.0;
}
ct_q[static_cast<size_t>(ct)] = q_value;
break; // we parsed quality value
}
if (!ct_q[static_cast<size_t>(ct)].has_value()) {
ct_q[static_cast<size_t>(ct)] = 1.0f; // default quality value
}
// keep the highest encoding (in the order, unless 'any')
if (selected_ct == compression_type::any) {
if (ct_q[static_cast<size_t>(ct)] >= ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = ct;
}
} else {
if (ct_q[static_cast<size_t>(ct)] > ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = ct;
}
}
}
if (selected_ct == compression_type::any) {
// select any not mentioned or highest quality
selected_ct = compression_type::none;
for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {
if (!ct_q[i].has_value()) {
return static_cast<compression_type>(i);
}
if (ct_q[i] > ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = static_cast<compression_type>(i);
}
}
}
return selected_ct;
}
static future<chunked_content> compress(response_compressor::compression_type ct, const db::config& cfg, std::string str) {
chunked_content compressed;
auto write = [&compressed](temporary_buffer<char>&& buf) -> future<> {
compressed.push_back(std::move(buf));
return make_ready_future<>();
};
zlib_compressor compressor(ct != response_compressor::compression_type::deflate,
cfg.alternator_response_gzip_compression_level(), std::move(write));
co_await compressor.compress(str.data(), str.size(), true);
co_return compressed;
}
static sstring flatten(chunked_content&& cc) {
size_t total_size = 0;
for (const auto& chunk : cc) {
total_size += chunk.size();
}
sstring result = sstring{ sstring::initialized_later{}, total_size };
size_t offset = 0;
for (const auto& chunk : cc) {
std::copy(chunk.begin(), chunk.end(), result.begin() + offset);
offset += chunk.size();
}
return result;
}
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, std::string&& response_body) {
response_compressor::compression_type ct = find_compression(accept_encoding, response_body.size());
if (ct != response_compressor::compression_type::none) {
rep->add_header("Content-Encoding", get_encoding_name(ct));
rep->set_content_type(content_type);
return compress(ct, cfg, std::move(response_body)).then([rep = std::move(rep)] (chunked_content compressed) mutable {
rep->_content = flatten(std::move(compressed));
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
});
} else {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(response_body);
rep->set_content_type(content_type);
}
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
}
template<typename Compressor>
class compressed_data_sink_impl : public data_sink_impl {
output_stream<char> _out;
Compressor _compressor;
public:
template<typename... Args>
compressed_data_sink_impl(output_stream<char>&& out, Args&&... args)
: _out(std::move(out)), _compressor(std::forward<Args>(args)..., [this](temporary_buffer<char>&& buf) {
return _out.write(std::move(buf));
}) { }
future<> put(std::span<temporary_buffer<char>> data) override {
return data_sink_impl::fallback_put(data, [this] (temporary_buffer<char>&& buf) {
return do_put(std::move(buf));
});
}
private:
future<> do_put(temporary_buffer<char> buf) {
co_return co_await _compressor.compress(buf.get(), buf.size());
}
future<> close() override {
return _compressor.close().then([this] {
return _out.close();
});
}
};
executor::body_writer compress(response_compressor::compression_type ct, const db::config& cfg, executor::body_writer&& bw) {
return [bw = std::move(bw), ct, level = cfg.alternator_response_gzip_compression_level()](output_stream<char>&& out) mutable -> future<> {
output_stream_options opts;
opts.trim_to_size = true;
std::unique_ptr<data_sink_impl> data_sink_impl;
switch (ct) {
case response_compressor::compression_type::gzip:
data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), true, level);
break;
case response_compressor::compression_type::deflate:
data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), false, level);
break;
case response_compressor::compression_type::none:
case response_compressor::compression_type::any:
case response_compressor::compression_type::unknown:
on_internal_error(slogger,"Compression not selected");
default:
on_internal_error(slogger, "Unsupported compression type for data sink");
}
return bw(output_stream<char>(data_sink(std::move(data_sink_impl)), compressed_buffer_size, opts));
};
}
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer) {
response_compressor::compression_type ct = find_compression(accept_encoding, std::numeric_limits<size_t>::max());
if (ct != response_compressor::compression_type::none) {
rep->add_header("Content-Encoding", get_encoding_name(ct));
rep->write_body(content_type, compress(ct, cfg, std::move(body_writer)));
} else {
rep->write_body(content_type, std::move(body_writer));
}
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
}
} // namespace alternator

View File

@@ -0,0 +1,91 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "alternator/executor.hh"
#include <seastar/http/httpd.hh>
#include "db/config.hh"
namespace alternator {
class response_compressor {
public:
enum class compression_type {
gzip,
deflate,
compressions_count,
any = compressions_count,
none,
count,
unknown = count
};
static constexpr std::string_view compression_names[] = {
"gzip",
"deflate",
"*",
"identity"
};
static sstring get_encoding_name(compression_type ct) {
return sstring(compression_names[static_cast<size_t>(ct)]);
}
static constexpr compression_type get_compression_type(std::string_view encoding);
sstring get_accepted_encoding(const http::request& req) {
if (get_threshold() == 0) {
return "";
}
return req.get_header("Accept-Encoding");
}
compression_type find_compression(std::string_view accept_encoding, size_t response_size);
response_compressor(const db::config& cfg)
: cfg(cfg)
,_gzip_level_observer(
cfg.alternator_response_gzip_compression_level.observe([this](int v) {
update_threshold();
}))
,_gzip_threshold_observer(
cfg.alternator_response_compression_threshold_in_bytes.observe([this](uint32_t v) {
update_threshold();
}))
{
update_threshold();
}
response_compressor(const response_compressor& rhs) : response_compressor(rhs.cfg) {}
private:
const db::config& cfg;
utils::observable<int>::observer _gzip_level_observer;
utils::observable<uint32_t>::observer _gzip_threshold_observer;
uint32_t _threshold[static_cast<size_t>(compression_type::count)];
size_t get_threshold() { return _threshold[static_cast<size_t>(compression_type::any)]; }
void update_threshold() {
_threshold[static_cast<size_t>(compression_type::none)] = std::numeric_limits<uint32_t>::max();
_threshold[static_cast<size_t>(compression_type::any)] = std::numeric_limits<uint32_t>::max();
uint32_t gzip = cfg.alternator_response_gzip_compression_level() <= 0 ? std::numeric_limits<uint32_t>::max()
: cfg.alternator_response_compression_threshold_in_bytes();
_threshold[static_cast<size_t>(compression_type::gzip)] = gzip;
_threshold[static_cast<size_t>(compression_type::deflate)] = gzip;
for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {
if (_threshold[i] < _threshold[static_cast<size_t>(compression_type::any)]) {
_threshold[static_cast<size_t>(compression_type::any)] = _threshold[i];
}
}
}
public:
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, const char* content_type, std::string&& response_body);
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer);
};
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include "cdc/cdc_options.hh"
#include "cdc/log.hh"
#include "seastarx.hh"
#include "service/paxos/cas_request.hh"
#include "service/cas_shard.hh"
@@ -56,7 +58,7 @@ public:
static write_isolation get_write_isolation_for_schema(schema_ptr schema);
static write_isolation default_write_isolation;
public:
static void set_default_write_isolation(std::string_view mode);
protected:
@@ -107,10 +109,11 @@ public:
// violating this). We mark apply() "const" to let the compiler validate
// this for us. The output-only field _return_attributes is marked
// "mutable" above so that apply() can still write to it.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const = 0;
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override;
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts, cdc::per_request_options& cdc_opts) override;
virtual ~rmw_operation() = default;
const wcu_consumed_capacity_counter& consumed_capacity() const noexcept { return _consumed_capacity; }
schema_ptr schema() const { return _schema; }
const rjson::value& request() const { return _request; }
rjson::value&& move_request() && { return std::move(_request); }
@@ -124,6 +127,9 @@ public:
stats& per_table_stats,
uint64_t& wcu_total);
std::optional<service::cas_shard> shard_for_execute(bool needs_read_before_write);
private:
inline bool should_fill_preimage() const { return _schema->cdc_options().enabled(); }
};
} // namespace alternator

View File

@@ -12,7 +12,7 @@
#include "serialization.hh"
#include "error.hh"
#include "types/concrete_types.hh"
#include "cql3/type_json.hh"
#include "types/json_utils.hh"
#include "mutation/position_in_partition.hh"
static logging::logger slogger("alternator-serialization");
@@ -282,15 +282,23 @@ std::string type_to_string(data_type type) {
return it->second;
}
bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
std::optional<bytes> try_get_key_column_value(const rjson::value& item, const column_definition& column) {
std::string column_name = column.name_as_text();
const rjson::value* key_typed_value = rjson::find(item, column_name);
if (!key_typed_value) {
throw api_error::validation(fmt::format("Key column {} not found", column_name));
return std::nullopt;
}
return get_key_from_typed_value(*key_typed_value, column);
}
bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
auto value = try_get_key_column_value(item, column);
if (!value) {
throw api_error::validation(fmt::format("Key column {} not found", column.name_as_text()));
}
return std::move(*value);
}
// Parses the JSON encoding for a key value, which is a map with a single
// entry whose key is the type and the value is the encoded value.
// If this type does not match the desired "type_str", an api_error::validation
@@ -380,20 +388,38 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {
return clustering_key::make_empty();
}
std::vector<bytes> raw_ck;
// FIXME: this is a loop, but we really allow only one clustering key column.
// Note: it's possible to get more than one clustering column here, as
// Alternator can be used to read scylla internal tables.
for (const column_definition& cdef : schema->clustering_key_columns()) {
bytes raw_value = get_key_column_value(item, cdef);
auto raw_value = get_key_column_value(item, cdef);
raw_ck.push_back(std::move(raw_value));
}
return clustering_key::from_exploded(raw_ck);
}
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
auto ck = ck_from_json(item, schema);
if (is_alternator_keyspace(schema->ks_name())) {
return position_in_partition::for_key(std::move(ck));
clustering_key_prefix ck_prefix_from_json(const rjson::value& item, schema_ptr schema) {
if (schema->clustering_key_size() == 0) {
return clustering_key_prefix::make_empty();
}
std::vector<bytes> raw_ck;
for (const column_definition& cdef : schema->clustering_key_columns()) {
auto raw_value = try_get_key_column_value(item, cdef);
if (!raw_value) {
break;
}
raw_ck.push_back(std::move(*raw_value));
}
return clustering_key_prefix::from_exploded(raw_ck);
}
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
const bool is_alternator_ks = is_alternator_keyspace(schema->ks_name());
if (is_alternator_ks) {
return position_in_partition::for_key(ck_from_json(item, schema));
}
const auto region_item = rjson::find(item, scylla_paging_region);
const auto weight_item = rjson::find(item, scylla_paging_weight);
if (bool(region_item) != bool(weight_item)) {
@@ -413,8 +439,9 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
} else {
throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));
}
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(ck_prefix_from_json(item, schema)) : std::nullopt);
}
auto ck = ck_from_json(item, schema);
if (ck.is_empty()) {
return position_in_partition::for_partition_start();
}
@@ -469,7 +496,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&
return {"", nullptr};
}
auto it = v.MemberBegin();
const std::string it_key = it->name.GetString();
const std::string it_key = rjson::to_string(it->name);
if (it_key != "SS" && it_key != "BS" && it_key != "NS") {
return {std::move(it_key), nullptr};
}

View File

@@ -55,7 +55,7 @@ partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

View File

@@ -13,6 +13,7 @@
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
@@ -31,6 +32,9 @@
#include "utils/overloaded_functor.hh"
#include "utils/aws_sigv4.hh"
#include "client_data.hh"
#include "utils/updateable_value.hh"
#include <zlib.h>
#include "alternator/http_compression.hh"
static logging::logger slogger("alternator-server");
@@ -108,9 +112,12 @@ class api_handler : public handler_base {
// type applies to all replies, both success and error.
static constexpr const char* REPLY_CONTENT_TYPE = "application/x-amz-json-1.0";
public:
api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle) : _f_handle(
api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle,
const db::config& config) : _response_compressor(config), _f_handle(
[this, _handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped([this, rep = std::move(rep)](future<executor::request_return_type> resf) mutable {
sstring accept_encoding = _response_compressor.get_accepted_encoding(*req);
return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped(
[this, rep = std::move(rep), accept_encoding=std::move(accept_encoding)](future<executor::request_return_type> resf) mutable {
if (resf.failed()) {
// Exceptions of type api_error are wrapped as JSON and
// returned to the client as expected. Other types of
@@ -130,22 +137,20 @@ public:
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
auto res = resf.get();
std::visit(overloaded_functor {
return std::visit(overloaded_functor {
[&] (std::string&& str) {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(str);
rep->set_content_type(REPLY_CONTENT_TYPE);
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
REPLY_CONTENT_TYPE, std::move(str));
},
[&] (executor::body_writer&& body_writer) {
rep->write_body(REPLY_CONTENT_TYPE, std::move(body_writer));
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
REPLY_CONTENT_TYPE, std::move(body_writer));
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
}, std::move(res));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}) { }
@@ -174,6 +179,7 @@ protected:
slogger.trace("api_handler error case: {}", rep._content);
}
response_compressor _response_compressor;
future_handler_function _f_handle;
};
@@ -270,24 +276,57 @@ protected:
}
};
// This function increments the authentication_failures counter, and may also
// log a warn-level message and/or throw an exception, depending on what
// enforce_authorization and warn_authorization are set to.
// The username and client address are only used for logging purposes -
// they are not included in the error message returned to the client, since
// the client knows who it is.
// Note that if enforce_authorization is false, this function will return
// without throwing. So a caller that doesn't want to continue after an
// authentication_error must explicitly return after calling this function.
template<typename Exception>
static void authentication_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, Exception&& e, std::string_view user, gms::inet_address client_address) {
stats.authentication_failures++;
if (enforce_authorization) {
if (warn_authorization) {
slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}", e.what(), user, client_address);
}
throw std::move(e);
} else {
if (warn_authorization) {
slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}", e.what(), user, client_address);
}
}
}
future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
if (!_enforce_authorization.get() && !_warn_authorization.get()) {
slogger.debug("Skipping authorization");
return make_ready_future<std::string>();
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
throw api_error::invalid_signature("Host header is mandatory for signature verification");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature("Host header is mandatory for signature verification"),
"", req.get_client_address());
return make_ready_future<std::string>();
}
auto authorization_it = req._headers.find("Authorization");
if (authorization_it == req._headers.end()) {
throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::missing_authentication_token("Authorization header is mandatory for signature verification"),
"", req.get_client_address());
return make_ready_future<std::string>();
}
std::string host = host_it->second;
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header)),
"", req.get_client_address());
return make_ready_future<std::string>();
}
authorization_header.remove_prefix(pos+1);
std::string credential;
@@ -322,7 +361,9 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::validation(fmt::format("Incorrect credential information format: {}", credential)), "", req.get_client_address());
return make_ready_future<std::string>();
}
std::string user(credential_split[0]);
std::string datestamp(credential_split[1]);
@@ -333,39 +374,81 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
for (const auto& header : signed_headers) {
signed_headers_map.emplace(header, std::string_view());
}
std::vector<std::string> modified_values;
for (auto& header : req._headers) {
std::string header_str;
header_str.resize(header.first.size());
std::transform(header.first.begin(), header.first.end(), header_str.begin(), ::tolower);
auto it = signed_headers_map.find(header_str);
if (it != signed_headers_map.end()) {
it->second = std::string_view(header.second);
// replace multiple spaces in the header value header.second with
// a single space, as required by AWS SigV4 header canonization.
// If we modify the value, we need to save it in modified_values
// to keep it alive.
std::string value;
value.reserve(header.second.size());
bool prev_space = false;
bool modified = false;
for (char ch : header.second) {
if (ch == ' ') {
if (!prev_space) {
value += ch;
prev_space = true;
} else {
modified = true; // skip a space
}
} else {
value += ch;
prev_space = false;
}
}
if (modified) {
modified_values.emplace_back(std::move(value));
it->second = std::string_view(modified_values.back());
} else {
it->second = std::string_view(header.second);
}
}
}
auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
return get_key_from_roles(proxy, as, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
signed_headers_str = std::move(signed_headers_str),
signed_headers_map = std::move(signed_headers_map),
modified_values = std::move(modified_values),
region = std::move(region),
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {
key_cache::value_ptr key_ptr(nullptr);
try {
key_ptr = key_ptr_fut.get();
} catch (const api_error& e) {
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
e, user, req.get_client_address());
return std::string();
}
std::string signature;
try {
signature = utils::aws::get_signature(user, *key_ptr, std::string_view(host), "/", req._method,
datestamp, signed_headers_str, signed_headers_map, &content, region, service, "");
} catch (const std::exception& e) {
throw api_error::invalid_signature(e.what());
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature(fmt::format("invalid signature: {}", e.what())),
user, req.get_client_address());
return std::string();
}
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
throw api_error::unrecognized_client("The security token included in the request is invalid.");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::unrecognized_client("wrong signature"),
user, req.get_client_address());
return std::string();
}
return user;
});
@@ -378,35 +461,82 @@ static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing
return tracing_instance.create_session(tracing::trace_type::QUERY, props);
}
// truncated_content_view() prints a potentially long chunked_content for
// debugging purposes. In the common case when the content is not excessively
// long, it just returns a view into the given content, without any copying.
// But when the content is very long, it is truncated after some arbitrary
// max_len (or one chunk, whichever comes first), with "<truncated>" added at
// the end. To do this modification to the string, we need to create a new
// std::string, so the caller must pass us a reference to one, "buf", where
// we can store the content. The returned view is only alive for as long this
// buf is kept alive.
static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
constexpr size_t max_len = 1024;
if (content.empty()) {
return std::string_view();
} else if (content.size() == 1 && content.begin()->size() <= max_len) {
return std::string_view(content.begin()->get(), content.begin()->size());
} else {
buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
return std::string_view(buf);
// A helper class to represent a potentially truncated view of a chunked_content.
// If the content is short enough and single chunked, it just holds a view into the content.
// Otherwise it will be copied into an internal buffer, possibly truncated (depending on maximum allowed size passed in),
// and the view will point into that buffer.
// `as_view()` method will return the view.
// `take_as_sstring()` will either move out the internal buffer (if any), or create a new sstring from the view.
// You should consider `as_view()` valid as long both the original chunked_content and the truncated_content object are alive.
class truncated_content {
std::string_view _view;
sstring _content_maybe;
void copy_from_content(const chunked_content& content) {
size_t offset = 0;
for(auto &tmp : content) {
size_t to_copy = std::min(tmp.size(), _content_maybe.size() - offset);
std::copy(tmp.get(), tmp.get() + to_copy, _content_maybe.data() + offset);
offset += to_copy;
if (offset >= _content_maybe.size()) {
break;
}
}
}
public:
truncated_content(const chunked_content& content, size_t max_len = std::numeric_limits<size_t>::max()) {
if (content.empty()) return;
if (content.size() == 1 && content.begin()->size() <= max_len) {
_view = std::string_view(content.begin()->get(), content.begin()->size());
return;
}
constexpr std::string_view truncated_text = "<truncated>";
size_t content_size = 0;
for(auto &tmp : content) {
content_size += tmp.size();
}
if (content_size <= max_len) {
_content_maybe = sstring{ sstring::initialized_later{}, content_size };
copy_from_content(content);
}
else {
_content_maybe = sstring{ sstring::initialized_later{}, max_len + truncated_text.size() };
copy_from_content(content);
std::copy(truncated_text.begin(), truncated_text.end(), _content_maybe.data() + _content_maybe.size() - truncated_text.size());
}
_view = std::string_view(_content_maybe);
}
std::string_view as_view() const { return _view; }
sstring take_as_sstring() && {
if (_content_maybe.empty() && !_view.empty()) {
return sstring{_view};
}
return std::move(_content_maybe);
}
};
// `truncated_content_view` will produce an object representing a view to a passed content
// possibly truncated at some length. The value returned is used in two ways:
// - to print it in logs (use `as_view()` method for this)
// - to pass it to tracing object, where it will be stored and used later
// (use `take_as_sstring()` method as this produces a copy in form of a sstring)
// `truncated_content` delays constructing `sstring` object until it's actually needed.
// `truncated_content` is valid as long as passed `content` is alive.
// if the content is truncated, `<truncated>` will be appended at the maximum size limit
// and total size will be `max_users_query_size_in_trace_output() + strlen("<truncated>")`.
static truncated_content truncated_content_view(const chunked_content& content, size_t max_size) {
return truncated_content{content, max_size};
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query) {
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query, size_t max_users_query_size_in_trace_output) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
trace_state = create_tracing_session(tracing_instance);
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::add_query(trace_state, truncated_content_view(query, max_users_query_size_in_trace_output).take_as_sstring());
tracing::begin(trace_state, seastar::format("Alternator {}", op), client_state.get_client_address());
if (!username.empty()) {
tracing::set_username(trace_state, auth::authenticated_user(username));
@@ -415,37 +545,215 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_
return trace_state;
}
// This read_entire_stream() is similar to Seastar's read_entire_stream()
// which reads the given content_stream until its end into non-contiguous
// memory. The difference is that this implementation takes an extra length
// limit, and throws an error if we read more than this limit.
// This length-limited variant would not have been needed if Seastar's HTTP
// server's set_content_length_limit() worked in every case, but unfortunately
// it does not - it only works if the request has a Content-Length header (see
// issue #8196). In contrast this function can limit the request's length no
// matter how it's encoded. We need this limit to protect Alternator from
// oversized requests that can deplete memory.
static future<chunked_content>
read_entire_stream(input_stream<char>& inp, size_t length_limit) {
chunked_content ret;
// We try to read length_limit + 1 bytes, so that we can throw an
// exception if we managed to read more than length_limit.
ssize_t remain = length_limit + 1;
do {
temporary_buffer<char> buf = co_await inp.read_up_to(remain);
if (buf.empty()) {
break;
}
remain -= buf.size();
ret.push_back(std::move(buf));
} while (remain > 0);
// If we read the full length_limit + 1 bytes, we went over the limit:
if (remain <= 0) {
// By throwing here an error, we may send a reply (the error message)
// without having read the full request body. Seastar's httpd will
// realize that we have not read the entire content stream, and
// correctly mark the connection unreusable, i.e., close it.
// This means we are currently exposed to issue #12166 caused by
// Seastar issue 1325), where the client may get an RST instead of
// a FIN, and may rarely get a "Connection reset by peer" before
// reading the error we send.
throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));
}
co_return ret;
}
// safe_gzip_stream is an exception-safe wrapper for zlib's z_stream.
// The "z_stream" struct is used by zlib to hold state while decompressing a
// stream of data. It allocates memory which must be freed with inflateEnd(),
// which the destructor of this class does.
class safe_gzip_zstream {
z_stream _zs;
public:
// If gzip is true, decode a gzip header (for "Content-Encoding: gzip").
// Otherwise, a zlib header (for "Content-Encoding: deflate").
safe_gzip_zstream(bool gzip = true) {
memset(&_zs, 0, sizeof(_zs));
if (inflateInit2(&_zs, gzip ? 16 + MAX_WBITS : MAX_WBITS) != Z_OK) {
// Should only happen if memory allocation fails
throw std::bad_alloc();
}
}
~safe_gzip_zstream() {
inflateEnd(&_zs);
}
z_stream* operator->() {
return &_zs;
}
z_stream* get() {
return &_zs;
}
void reset() {
inflateReset(&_zs);
}
};
// ungzip() takes a chunked_content of a compressed request body, and returns
// the uncompressed content as a chunked_content. If gzip is true, we expect
// gzip header (for "Content-Encoding: gzip"), if gzip is false, we expect a
// zlib header (for "Content-Encoding: deflate").
// If the uncompressed content exceeds length_limit, an error is thrown.
static future<chunked_content>
ungzip(chunked_content&& compressed_body, size_t length_limit, bool gzip = true) {
chunked_content ret;
// output_buf can be any size - when uncompressing input_buf, it doesn't
// need to fit in a single output_buf, we'll use multiple output_buf for
// a single input_buf if needed.
constexpr size_t OUTPUT_BUF_SIZE = 4096;
temporary_buffer<char> output_buf;
safe_gzip_zstream strm(gzip);
bool complete_stream = false; // empty input is not a valid gzip/deflate
size_t total_out_bytes = 0;
for (const temporary_buffer<char>& input_buf : compressed_body) {
if (input_buf.empty()) {
continue;
}
complete_stream = false;
strm->next_in = (Bytef*) input_buf.get();
strm->avail_in = (uInt) input_buf.size();
do {
co_await coroutine::maybe_yield();
if (output_buf.empty()) {
output_buf = temporary_buffer<char>(OUTPUT_BUF_SIZE);
}
strm->next_out = (Bytef*) output_buf.get();
strm->avail_out = OUTPUT_BUF_SIZE;
int e = inflate(strm.get(), Z_NO_FLUSH);
size_t out_bytes = OUTPUT_BUF_SIZE - strm->avail_out;
if (out_bytes > 0) {
// If output_buf is nearly full, we save it as-is in ret. But
// if it only has little data, better copy to a small buffer.
if (out_bytes > OUTPUT_BUF_SIZE/2) {
ret.push_back(std::move(output_buf).prefix(out_bytes));
// output_buf is now empty. if this loop finds more input,
// we'll allocate a new output buffer.
} else {
ret.push_back(temporary_buffer<char>(output_buf.get(), out_bytes));
}
total_out_bytes += out_bytes;
if (total_out_bytes > length_limit) {
throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));
}
}
if (e == Z_STREAM_END) {
// There may be more input after the first gzip stream - in
// either this input_buf or the next one. The additional input
// should be a second concatenated gzip. We need to allow that
// by resetting the gzip stream and continuing the input loop
// until there's no more input.
strm.reset();
if (strm->avail_in == 0) {
complete_stream = true;
break;
}
} else if (e != Z_OK && e != Z_BUF_ERROR) {
// DynamoDB returns an InternalServerError when given a bad
// gzip request body. See test test_broken_gzip_content
throw api_error::internal("Error during gzip decompression of request body");
}
} while (strm->avail_in > 0 || strm->avail_out == 0);
}
if (!complete_stream) {
// The gzip stream was not properly finished with Z_STREAM_END
throw api_error::internal("Truncated gzip in request body");
}
co_return ret;
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
_executor._stats.total_operations++;
sstring target = req->get_header("X-Amz-Target");
// target is DynamoDB API version followed by a dot '.' and operation type (e.g. CreateTable)
auto dot = target.find('.');
std::string_view op = (dot == sstring::npos) ? std::string_view() : std::string_view(target).substr(dot+1);
if (req->content_length > request_content_length_limit) {
// If we have a Content-Length header and know the request will be too
// long, we don't need to wait for read_entire_stream() below to
// discover it. And we definitely mustn't try to get_units() below for
// for such a size.
co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));
}
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// TODO: consider the case where req->content_length is missing. Maybe
// we need to take the content_length_limit and return some of the units
// when we finish read_content_and_verify_signature?
size_t mem_estimate = req->content_length * 2 + 8000;
// If the Content-Length of the request is not available, we assume
// the largest possible request (request_content_length_limit, i.e., 16 MB)
// and after reading the request we return_units() the excess.
size_t mem_estimate = (req->content_length ? req->content_length : request_content_length_limit) * 2 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
SCYLLA_ASSERT(req->content_stream);
chunked_content content = co_await util::read_entire_stream(*req->content_stream);
throwing_assert(req->content_stream);
chunked_content content = co_await read_entire_stream(*req->content_stream, request_content_length_limit);
// If the request had no Content-Length, we reserved too many units
// so need to return some
if (req->content_length == 0) {
size_t content_length = 0;
for (const auto& chunk : content) {
content_length += chunk.size();
}
size_t new_mem_estimate = content_length * 2 + 8000;
units.return_units(mem_estimate - new_mem_estimate);
}
auto username = co_await verify_signature(*req, content);
// If the request is compressed, uncompress it now, after we checked
// the signature (the signature is computed on the compressed content).
// We apply the request_content_length_limit again to the uncompressed
// content - we don't want to allow a tiny compressed request to
// expand to a huge uncompressed request.
sstring content_encoding = req->get_header("Content-Encoding");
if (content_encoding == "gzip") {
content = co_await ungzip(std::move(content), request_content_length_limit);
} else if (content_encoding == "deflate") {
content = co_await ungzip(std::move(content), request_content_length_limit, false);
} else if (!content_encoding.empty()) {
// DynamoDB returns a 500 error for unsupported Content-Encoding.
// I'm not sure if this is the best error code, but let's do it too.
// See the test test_garbage_content_encoding confirming this case.
co_return api_error::internal("Unsupported Content-Encoding");
}
// As long as the system_clients_entry object is alive, this request will
// be visible in the "system.clients" virtual table. When requested, this
// entry will be formatted by server::ongoing_request::make_client_data().
auto user_agent_header = co_await _connection_options_keys_and_values.get_or_load(req->get_header("User-Agent"), [] (const client_options_cache_key_type&) {
return make_ready_future<options_cache_value_type>(options_cache_value_type{});
});
auto system_clients_entry = _ongoing_requests.emplace(
req->get_client_address(), req->get_header("User-Agent"),
req->get_client_address(), std::move(user_agent_header),
username, current_scheduling_group(),
req->get_protocol_name() == "https");
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, _max_users_query_size_in_trace_output).as_view(), req->_headers);
}
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
@@ -465,7 +773,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
}
co_await client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());
tracing::trace(trace_state, "{}", op);
auto user = client_state.user();
@@ -485,7 +793,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
void server::set_routes(routes& r) {
api_handler* req_handler = new api_handler([this] (std::unique_ptr<request> req) mutable {
return handle_api_request(std::move(req));
});
}, _proxy.data_dictionary().get_config());
r.put(operation_type::POST, "/", req_handler);
r.put(operation_type::GET, "/", new health_handler(_pending_requests));
@@ -516,7 +824,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _max_users_query_size_in_trace_output(1024)
, _enabled_servers{}
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
@@ -596,28 +904,39 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
} {
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = std::move(enforce_authorization);
_warn_authorization = std::move(warn_authorization);
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
_max_users_query_size_in_trace_output = std::move(max_users_query_size_in_trace_output);
if (!port && !https_port && !port_proxy_protocol && !https_port_proxy_protocol) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
}
return seastar::async([this, addr, port, https_port, creds] {
return seastar::async([this, addr, port, https_port, port_proxy_protocol, https_port_proxy_protocol, creds] {
_executor.start().get();
if (port) {
if (port || port_proxy_protocol) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
if (port) {
_http_server.listen(socket_address{addr, *port}).get();
}
if (port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_http_server.listen(socket_address{addr, *port_proxy_protocol}, lo).get();
}
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
if (https_port || https_port_proxy_protocol) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_content_streaming(true);
if (this_shard_id() == 0) {
@@ -636,7 +955,15 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
} else {
_credentials = creds->build_server_credentials();
}
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
if (https_port) {
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
}
if (https_port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_https_server.listen(socket_address{addr, *https_port_proxy_protocol}, lo, _credentials).get();
}
_enabled_servers.push_back(std::ref(_https_server));
}
});
@@ -709,16 +1036,15 @@ client_data server::ongoing_request::make_client_data() const {
// and keep "driver_version" unset.
cd.driver_name = _user_agent;
// Leave "protocol_version" unset, it has no meaning in Alternator.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.
// As reported in issue #9216, we never set these fields in CQL
// either (see cql_server::connection::make_client_data()).
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset for Alternator.
// Note: CQL sets ssl_protocol and ssl_cipher_suite via generic_server::connection base class.
return cd;
}
future<utils::chunked_vector<client_data>> server::get_client_data() {
utils::chunked_vector<client_data> ret;
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> server::get_client_data() {
utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>> ret;
co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {
ret.emplace_back(r.make_client_data());
ret.emplace_back(make_foreign(std::make_unique<client_data>(r.make_client_data())));
});
co_return ret;
}

View File

@@ -28,7 +28,11 @@ namespace alternator {
using chunked_content = rjson::chunked_content;
class server : public peering_sharded_service<server> {
static constexpr size_t content_length_limit = 16*MB;
// The maximum size of a request body that Alternator will accept,
// in bytes. This is a safety measure to prevent Alternator from
// running out of memory when a client sends a very large request.
// DynamoDB also has the same limit set to 16 MB.
static constexpr size_t request_content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;
using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;
@@ -43,12 +47,15 @@ class server : public peering_sharded_service<server> {
key_cache _key_cache;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
utils::updateable_value<uint64_t> _max_users_query_size_in_trace_output;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
client_options_cache_type _connection_options_keys_and_values;
alternator_callbacks_map _callbacks;
@@ -82,7 +89,7 @@ class server : public peering_sharded_service<server> {
// is called when reading the "system.clients" virtual table.
struct ongoing_request {
socket_address _client_address;
sstring _user_agent;
client_options_cache_entry_type _user_agent;
sstring _username;
scheduling_group _scheduling_group;
bool _is_https;
@@ -93,14 +100,17 @@ class server : public peering_sharded_service<server> {
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
// get_client_data() is called (on each shard separately) when the virtual
// table "system.clients" is read. It is expected to generate a list of
// clients connected to this server (on this shard). This function is
// called by alternator::controller::get_client_data().
future<utils::chunked_vector<client_data>> get_client_data();
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data();
private:
void set_routes(seastar::httpd::routes& r);
// If verification succeeds, returns the authenticated user's username

View File

@@ -154,6 +154,18 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
seastar::metrics::label expression_label("expression");
@@ -176,6 +188,16 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].misses,
seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()
});
// Only register the following metrics for the global metrics, not per-table
if (!has_table) {
metrics.add_group("alternator", {
seastar::metrics::make_counter("authentication_failures", stats.authentication_failures,
seastar::metrics::description("total number of authentication failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_counter("authorization_failures", stats.authorization_failures,
seastar::metrics::description("total number of authorization failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
}
}
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {

View File

@@ -79,6 +79,43 @@ public:
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
} api_operations;
// Operation size metrics
struct {
// Item size statistics collected per table and aggregated per node.
// Each histogram covers the range 0 - 446. Resolves #25143.
// A size is the retrieved item's size.
utils::estimated_histogram get_item_op_size_kb{30};
// A size is the maximum of the new item's size and the old item's size.
utils::estimated_histogram put_item_op_size_kb{30};
// A size is the deleted item's size. If the deleted item's size is
// unknown (i.e. read-before-write wasn't necessary and it wasn't
// forced by a configuration option), it won't be recorded on the
// histogram.
utils::estimated_histogram delete_item_op_size_kb{30};
// A size is the maximum of existing item's size and the estimated size
// of the update. This will be changed to the maximum of the existing item's
// size and the new item's size in a subsequent PR.
utils::estimated_histogram update_item_op_size_kb{30};
// A size is the sum of the sizes of all items per table. This means
// that a single BatchGetItem / BatchWriteItem updates the histogram
// for each table that it has items in.
// The sizes are the retrieved items' sizes grouped per table.
utils::estimated_histogram batch_get_item_op_size_kb{30};
// The sizes are the the written items' sizes grouped per table.
utils::estimated_histogram batch_write_item_op_size_kb{30};
} operation_sizes;
// Count of authentication and authorization failures, counted if either
// alternator_enforce_authorization or alternator_warn_authorization are
// set to true. If both are false, no authentication or authorization
// checks are performed, so failures are not recognized or counted.
// "authentication" failure means the request was not signed with a valid
// user and key combination. "authorization" failure means the request was
// authenticated to a valid user - but this user did not have permissions
// to perform the operation (considering RBAC settings and the user's
// superuser status).
uint64_t authentication_failures = 0;
uint64_t authorization_failures = 0;
// Miscellaneous event counters
uint64_t total_operations = 0;
uint64_t unsupported_operations = 0;
@@ -126,4 +163,8 @@ struct table_stats {
};
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);
inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {
return (bytes + 1023) / 1024;
}
}

View File

@@ -13,7 +13,6 @@
#include <seastar/json/formatter.hh>
#include "auth/permission.hh"
#include "db/config.hh"
#include "cdc/log.hh"
@@ -127,7 +126,7 @@ public:
}
};
}
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>
@@ -297,7 +296,7 @@ sequence_number::sequence_number(std::string_view v)
}())
{}
}
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::shard_id>
@@ -357,7 +356,7 @@ static stream_view_type cdc_options_to_steam_view_type(const cdc::options& opts)
return type;
}
}
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_view_type>
@@ -476,10 +475,10 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
} else {
status = "ENABLED";
}
}
}
auto ttl = std::chrono::seconds(opts.ttl());
rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));
stream_view_type type = cdc_options_to_steam_view_type(opts);
@@ -492,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
co_return rjson::print(std::move(ret));
}
// TODO: label
@@ -503,123 +502,121 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
std::map<db_clock::time_point, cdc::streams_version> topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
std::optional<shard_id> last;
std::optional<shard_id> last;
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
std::sort(lo, end, id_cmp);
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
std::sort(lo, end, id_cmp);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
last = std::nullopt;
}
}
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
}
enum class shard_iterator_type {
@@ -715,7 +712,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto type = rjson::get<shard_iterator_type>(request, "ShardIteratorType");
auto seq_num = rjson::get_opt<sequence_number>(request, "SequenceNumber");
if (type < shard_iterator_type::TRIM_HORIZON && !seq_num) {
throw api_error::validation("Missing required parameter \"SequenceNumber\"");
}
@@ -725,7 +722,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");
auto db = _proxy.data_dictionary();
schema_ptr schema = nullptr;
std::optional<shard_id> sid;
@@ -790,7 +787,7 @@ struct event_id {
return os;
}
};
}
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::event_id>
@@ -828,7 +825,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
partition_key pk = iter.shard.id.to_partition_key(*schema);
@@ -899,162 +896,169 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
service::storage_proxy::coordinator_query_result qr = co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state));
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
auto records = rjson::empty_array();
auto result_set = builder.build();
auto records = rjson::empty_array();
auto& metadata = result_set->get_metadata();
auto& metadata = result_set->get_metadata();
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
using op_utype = std::underlying_type_t<cdc::operation>;
using op_utype = std::underlying_type_t<cdc::operation>;
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.0");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
}
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
{
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
co_return rjson::print(std::move(ret));
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
db_clock::time_point ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
co_return make_streamed(std::move(ret));
}
co_return rjson::print(std::move(ret));
}
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
@@ -1064,9 +1068,7 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
}
if (stream_enabled->GetBool()) {
auto db = sp.data_dictionary();
if (!db.features().alternator_streams) {
if (!sp.features().alternator_streams) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}
@@ -1125,4 +1127,4 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s
}
}
}
} // namespace alternator

View File

@@ -17,6 +17,7 @@
#include <seastar/core/lowres_clock.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "cdc/log.hh"
#include "exceptions/exceptions.hh"
#include "gms/gossiper.hh"
#include "gms/inet_address.hh"
@@ -45,6 +46,7 @@
#include "alternator/executor.hh"
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "alternator/ttl_tag.hh"
#include "dht/sharder.hh"
#include "db/config.hh"
#include "db/tags/utils.hh"
@@ -56,19 +58,10 @@ static logging::logger tlogger("alternator_ttl");
namespace alternator {
// We write the expiration-time attribute enabled on a table in a
// tag TTL_TAG_KEY.
// Currently, the *value* of this tag is simply the name of the attribute,
// and the expiration scanner interprets it as an Alternator attribute name -
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column. Although this is designed for Alternator, it may
// be good enough for CQL as well (there, the ":attrs" column won't exist).
extern const sstring TTL_TAG_KEY;
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.data_dictionary().features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
if (!_proxy.features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
}
schema_ptr schema = get_table(_proxy, request);
@@ -92,9 +85,9 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {
co_return api_error::validation("The length of AttributeName must be between 1 and 255");
}
sstring attribute_name(v->GetString(), v->GetStringLength());
sstring attribute_name = rjson::to_sstring(*v);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
@@ -140,7 +133,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLive request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -292,7 +285,12 @@ static future<> expire_item(service::storage_proxy& proxy,
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(), // FIXME - which timeout?
qs.get_trace_state(), qs.get_permit(),
db::allow_per_partition_rate_limit::no);
db::allow_per_partition_rate_limit::no,
false,
cdc::per_request_options{
.is_system_originated = true,
}
);
}
static size_t random_offset(size_t min, size_t max) {
@@ -318,9 +316,7 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
std::vector<std::pair<dht::token_range, locator::host_id>> ret;
if (sorted_tokens.empty()) {
on_internal_error(tlogger, "Token metadata is empty");
}
throwing_assert(!sorted_tokens.empty());
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
@@ -557,7 +553,7 @@ static future<> scan_table_ranges(
expiration_service::stats& expiration_stats)
{
const schema_ptr& s = scan_ctx.s;
SCYLLA_ASSERT (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
throwing_assert(partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,
*scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);
while (!p->is_exhausted()) {
@@ -587,7 +583,7 @@ static future<> scan_table_ranges(
if (retries >= 10) {
// Don't get stuck forever asking the same page, maybe there's
// a bug or a real problem in several replicas. Give up on
// this scan an retry the scan from a random position later,
// this scan and retry the scan from a random position later,
// in the next scan period.
throw runtime_exception("scanner thread failed after too many timeouts for the same page");
}
@@ -634,13 +630,38 @@ static future<> scan_table_ranges(
}
} else {
// For a real column to contain an expiration time, it
// must be a numeric type.
// FIXME: Currently we only support decimal_type (which is
// what Alternator uses), but other numeric types can be
// supported as well to make this feature more useful in CQL.
// Note that kind::decimal is also checked above.
big_decimal n = value_cast<big_decimal>(v);
expired = is_expired(n, now);
// must be a numeric type. We currently support decimal
// (used by Alternator TTL) as well as bigint, int and
// timestamp (used by CQL per-row TTL).
switch (meta[*expiration_column]->type->get_kind()) {
case abstract_type::kind::decimal:
// Used by Alternator TTL for key columns not stored
// in the map. The value is in seconds, fractional
// part is ignored.
expired = is_expired(value_cast<big_decimal>(v), now);
break;
case abstract_type::kind::long_kind:
// Used by CQL per-row TTL. The value is in seconds.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int64_t>(v))), now);
break;
case abstract_type::kind::int32:
// Used by CQL per-row TTL. The value is in seconds.
// Using int type is not recommended because it will
// overflow in 2038, but we support it to allow users
// to use existing int columns for expiration.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int32_t>(v))), now);
break;
case abstract_type::kind::timestamp:
// Used by CQL per-row TTL. The value is in milliseconds
// but we truncate it to gc_clock's precision (whole seconds).
expired = is_expired(gc_clock::time_point(std::chrono::duration_cast<gc_clock::duration>(value_cast<db_clock::time_point>(v).time_since_epoch())), now);
break;
default:
// Should never happen - we verified the column's type
// before starting the scan.
[[unlikely]]
on_internal_error(tlogger, format("expiration scanner value of unsupported type {} in column {}", meta[*expiration_column]->type->cql3_type_name(), scan_ctx.column_name) );
}
}
if (expired) {
expiration_stats.items_deleted++;
@@ -702,16 +723,12 @@ static future<bool> scan_table(
co_return false;
}
// attribute_name may be one of the schema's columns (in Alternator, this
// means it's a key column), or an element in Alternator's attrs map
// encoded in Alternator's JSON encoding.
// FIXME: To make this less Alternators-specific, we should encode in the
// single key's value three things:
// 1. The name of a column
// 2. Optionally if column is a map, a member in the map
// 3. The deserializer for the value: CQL or Alternator (JSON).
// The deserializer can be guessed: If the given column or map item is
// numeric, it can be used directly. If it is a "bytes" type, it needs to
// be deserialized using Alternator's deserializer.
// means a key column, in CQL it's a regular column), or an element in
// Alternator's attrs map encoded in Alternator's JSON encoding (which we
// decode). If attribute_name is a real column, in Alternator it will have
// the type decimal, counting seconds since the UNIX epoch, while in CQL
// it will one of the types bigint or int (counting seconds) or timestamp
// (counting milliseconds).
bytes column_name = to_bytes(*attribute_name);
const column_definition *cd = s->get_column_definition(column_name);
std::optional<std::string> member;
@@ -730,11 +747,14 @@ static future<bool> scan_table(
data_type column_type = cd->type;
// Verify that the column has the right type: If "member" exists
// the column must be a map, and if it doesn't, the column must
// (currently) be a decimal_type. If the column has the wrong type
// nothing can get expired in this table, and it's pointless to
// scan it.
// be decimal_type (Alternator), bigint, int or timestamp (CQL).
// If the column has the wrong type nothing can get expired in
// this table, and it's pointless to scan it.
if ((member && column_type->get_kind() != abstract_type::kind::map) ||
(!member && column_type->get_kind() != abstract_type::kind::decimal)) {
(!member && column_type->get_kind() != abstract_type::kind::decimal &&
column_type->get_kind() != abstract_type::kind::long_kind &&
column_type->get_kind() != abstract_type::kind::int32 &&
column_type->get_kind() != abstract_type::kind::timestamp)) {
tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());
co_return false;
}
@@ -747,7 +767,7 @@ static future<bool> scan_table(
auto my_host_id = erm->get_topology().my_host_id();
const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());
for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {
auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);
auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet, erm->get_topology());
// check if this is the primary replica for the current tablet
if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
@@ -761,7 +781,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet, erm->get_topology()); // throws if no secondary replica
if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {
if (!gossiper.is_alive(tablet_primary_replica.host)) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
@@ -872,12 +892,10 @@ future<> expiration_service::run() {
future<> expiration_service::start() {
// Called by main() on each shard to start the expiration-service
// thread. Just runs run() in the background and allows stop().
if (_db.features().alternator_ttl) {
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
return make_ready_future<>();
}

View File

@@ -30,7 +30,7 @@ namespace alternator {
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
public:
// Object holding per-shard statistics related to the expiration service.
@@ -52,7 +52,7 @@ private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the the background service
// _end is set by start(), and resolves when the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
std::optional<future<>> _end;

26
alternator/ttl_tag.hh Normal file
View File

@@ -0,0 +1,26 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "seastarx.hh"
#include <seastar/core/sstring.hh>
namespace alternator {
// We use the table tag TTL_TAG_KEY ("system:ttl_attribute") to remember
// which attribute was chosen as the expiration-time attribute for
// Alternator's TTL and CQL's per-row TTL features.
// Currently, the *value* of this tag is simply the name of the attribute:
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column (which Alternator uses).
extern const sstring TTL_TAG_KEY;
} // namespace alternator
// let users use TTL_TAG_KEY without the "alternator::" prefix,
// to make it easier to move it to a different namespace later.
using alternator::TTL_TAG_KEY;

View File

@@ -31,6 +31,7 @@ set(swagger_files
api-doc/column_family.json
api-doc/commitlog.json
api-doc/compaction_manager.json
api-doc/client_routes.json
api-doc/config.json
api-doc/cql_server_test.json
api-doc/endpoint_snitch_info.json
@@ -68,6 +69,7 @@ target_sources(api
PRIVATE
api.cc
cache_service.cc
client_routes.cc
collectd.cc
column_family.cc
commitlog.cc
@@ -106,5 +108,8 @@ target_link_libraries(api
wasmtime_bindings
absl::headers)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(api REUSE_FROM scylla-precompiled-header)
endif()
check_headers(check-headers api
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -12,7 +12,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset cache",
"summary":"Resets authorized prepared statements cache",
"type":"void",
"nickname":"authorization_cache_reset",
"produces":[

View File

@@ -0,0 +1,23 @@
, "client_routes_entry": {
"id": "client_routes_entry",
"summary": "An entry storing client routes",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"},
"address": {"type": "string"},
"port": {"type": "integer"},
"tls_port": {"type": "integer"},
"alternator_port": {"type": "integer"},
"alternator_https_port": {"type": "integer"}
},
"required": ["connection_id", "host_id", "address"]
}
, "client_routes_key": {
"id": "client_routes_key",
"summary": "A key of client_routes_entry",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"}
}
}

View File

@@ -0,0 +1,74 @@
, "/v2/client-routes":{
"get": {
"description":"List all client route entries",
"operationId":"get_client_routes",
"tags":["client_routes"],
"produces":[
"application/json"
],
"parameters":[],
"responses":{
"200":{
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
},
"default":{
"description":"unexpected error",
"schema":{"$ref":"#/definitions/ErrorModel"}
}
}
},
"post": {
"description":"Upsert one or more client route entries",
"operationId":"set_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
}
],
"responses":{
"200":{ "description": "OK" },
"default":{
"description":"unexpected error",
"schema":{ "$ref":"#/definitions/ErrorModel" }
}
}
},
"delete": {
"description":"Delete one or more client route entries",
"operationId":"delete_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_key" }
}
}
],
"responses":{
"200":{
"description": "OK"
},
"default":{
"description":"unexpected error",
"schema":{
"$ref":"#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -220,6 +220,25 @@
}
]
},
{
"path":"/storage_service/nodes/excluded",
"operations":[
{
"method":"GET",
"summary":"Retrieve host ids of nodes which are marked as excluded",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_excluded_nodes",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/nodes/joining",
"operations":[
@@ -594,6 +613,50 @@
}
]
},
{
"path": "/storage_service/natural_endpoints/v2/{keyspace}",
"operations": [
{
"method": "GET",
"summary":"This method returns the N endpoints that are responsible for storing the specified key i.e for replication. the endpoint responsible for this key",
"type": "array",
"items": {
"type": "string"
},
"nickname": "get_natural_endpoints_v2",
"produces": [
"application/json"
],
"parameters": [
{
"name": "keyspace",
"description": "The keyspace to query about.",
"required": true,
"allowMultiple": false,
"type": "string",
"paramType": "path"
},
{
"name": "cf",
"description": "Column family name.",
"required": true,
"allowMultiple": false,
"type": "string",
"paramType": "query"
},
{
"name": "key_component",
"description": "Each component of the key for which we need to find the endpoint (e.g. ?key_component=part1&key_component=part2).",
"required": true,
"allowMultiple": true,
"type": "string",
"paramType": "query"
}
]
}
]
},
{
"path":"/storage_service/cdc_streams_check_and_repair",
"operations":[
@@ -898,6 +961,14 @@
"type":"string",
"paramType":"query",
"enum": ["all", "dc", "rack", "node"]
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -984,7 +1055,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -994,6 +1065,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -1100,6 +1195,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name": "drop_unfixable_sstables",
"description": "When set to true, drop unfixable sstables. Applies only to scrub mode SEGREGATE.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -1519,6 +1622,30 @@
}
]
},
{
"path":"/storage_service/exclude_node",
"operations":[
{
"method":"POST",
"summary":"Marks the node as permanently down (excluded).",
"type":"void",
"nickname":"exclude_node",
"produces":[
"application/json"
],
"parameters":[
{
"name":"hosts",
"description":"Comma-separated list of host ids to exclude",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/removal_status",
"operations":[
@@ -2924,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -2958,6 +3085,48 @@
}
]
},
{
"path":"/storage_service/tablets/snapshots",
"operations":[
{
"method":"POST",
"summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",
"type":"void",
"nickname":"take_cluster_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[

View File

@@ -42,6 +42,14 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -37,6 +37,7 @@
#include "raft.hh"
#include "gms/gossip_address_map.hh"
#include "service_levels.hh"
#include "client_routes.hh"
logging::logger apilog("api");
@@ -67,9 +68,11 @@ future<> set_server_init(http_context& ctx) {
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
rb02->register_api_file(r, "metrics");
rb02->register_api_file(r, "client_routes");
rb->register_function(r, "system",
"The system related API");
rb02->add_definitions_file(r, "metrics");
rb02->add_definitions_file(r, "client_routes");
set_system(ctx, r);
rb->register_function(r, "error_injection",
"The error injection API");
@@ -129,6 +132,16 @@ future<> unset_server_storage_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });
}
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr) {
return ctx.http_server.set_routes([&ctx, &cr] (routes& r) {
set_client_routes(ctx, r, cr);
});
}
future<> unset_server_client_routes(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_client_routes(ctx, r); });
}
future<> set_load_meter(http_context& ctx, service::load_meter& lm) {
return ctx.http_server.set_routes([&ctx, &lm] (routes& r) { set_load_meter(ctx, r, lm); });
}
@@ -216,10 +229,10 @@ future<> unset_server_gossip(http_context& ctx) {
});
}
future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db, sharded<db::system_keyspace>& sys_ks) {
future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db) {
co_await register_api(ctx, "column_family",
"The column family API", [&db, &sys_ks] (http_context& ctx, routes& r) {
set_column_family(ctx, r, db, sys_ks);
"The column family API", [&db] (http_context& ctx, routes& r) {
set_column_family(ctx, r, db);
});
co_await register_api(ctx, "cache_service",
"The cache service API", [&db] (http_context& ctx, routes& r) {

View File

@@ -23,31 +23,6 @@
namespace api {
template<class T>
std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {
std::vector<T> res;
res.reserve(map.size());
for (const auto& [key, value] : map) {
res.push_back(T());
res.back().key = key;
res.back().value = value;
}
return res;
}
template<class T, class MAP>
std::vector<T>& map_to_key_value(const MAP& map, std::vector<T>& res) {
res.reserve(res.size() + std::size(map));
for (const auto& [key, value] : map) {
T val;
val.key = fmt::to_string(key);
val.value = fmt::to_string(value);
res.push_back(val);
}
return res;
}
template <typename T, typename S = T>
T map_sum(T&& dest, const S& src) {
for (const auto& i : src) {

View File

@@ -29,6 +29,7 @@ class storage_proxy;
class storage_service;
class raft_group0_client;
class raft_group_registry;
class client_routes_service;
} // namespace service
@@ -58,7 +59,6 @@ class sstables_format_selector;
namespace view {
class view_builder;
}
class system_keyspace;
}
namespace netw { class messaging_service; }
class repair_service;
@@ -100,6 +100,8 @@ future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snit
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);
future<> unset_server_client_routes(http_context& ctx);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);
@@ -118,7 +120,7 @@ future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_to
future<> unset_server_token_metadata(http_context& ctx);
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);
future<> unset_server_gossip(http_context& ctx);
future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db, sharded<db::system_keyspace>& sys_ks);
future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db);
future<> unset_server_column_family(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> unset_server_messaging_service(http_context& ctx);

176
api/client_routes.cc Normal file
View File

@@ -0,0 +1,176 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/http/short_streams.hh>
#include "client_routes.hh"
#include "api/api.hh"
#include "service/storage_service.hh"
#include "service/client_routes.hh"
#include "utils/rjson.hh"
#include "api/api-doc/client_routes.json.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
using namespace json;
extern logging::logger apilog;
namespace api {
static void validate_client_routes_endpoint(sharded<service::client_routes_service>& cr, sstring endpoint_name) {
if (!cr.local().get_feature_service().client_routes) {
apilog.warn("{}: called before the cluster feature was enabled", endpoint_name);
throw std::runtime_error(fmt::format("{} requires all nodes to support the CLIENT_ROUTES cluster feature", endpoint_name));
}
}
static sstring parse_string(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
throw bad_param_exception(fmt::format("Missing '{}'", name));
}
if (!it->value.IsString()) {
throw bad_param_exception(fmt::format("'{}' must be a string", name));
}
return {it->value.GetString(), it->value.GetStringLength()};
}
static std::optional<uint32_t> parse_port(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
return std::nullopt;
}
if (!it->value.IsInt()) {
throw bad_param_exception(fmt::format("'{}' must be an integer", name));
}
auto port = it->value.GetInt();
if (port < 1 || port > 65535) {
throw bad_param_exception(fmt::format("'{}' value={} is outside the allowed port range", name, port));
}
return port;
}
static std::vector<service::client_routes_service::client_route_entry> parse_set_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_entry> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
if (!element.IsObject()) { throw bad_param_exception("Each element must be object"); }
const auto port = parse_port("port", element);
const auto tls_port = parse_port("tls_port", element);
const auto alternator_port = parse_port("alternator_port", element);
const auto alternator_https_port = parse_port("alternator_https_port", element);
if (!port.has_value() && !tls_port.has_value() && !alternator_port.has_value() && !alternator_https_port.has_value()) {
throw bad_param_exception("At least one port field ('port', 'tls_port', 'alternator_port', 'alternator_https_port') must be specified");
}
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)},
parse_string("address", element),
port,
tls_port,
alternator_port,
alternator_https_port
);
}
return v;
}
static
future<json::json_return_type>
rest_set_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "rest_set_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().set_client_routes(parse_set_client_array(root));
co_return seastar::json::json_void();
}
static std::vector<service::client_routes_service::client_route_key> parse_delete_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_key> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)}
);
}
return v;
}
static
future<json::json_return_type>
rest_delete_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "delete_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().delete_client_routes(parse_delete_client_array(root));
co_return seastar::json::json_void();
}
static
future<json::json_return_type>
rest_get_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "get_client_routes");
co_return co_await cr.invoke_on(0, [] (service::client_routes_service& cr) -> future<json::json_return_type> {
co_return json::json_return_type(stream_range_as_array(co_await cr.get_client_routes(), [](const service::client_routes_service::client_route_entry & entry) {
seastar::httpd::client_routes_json::client_routes_entry obj;
obj.connection_id = entry.connection_id;
obj.host_id = fmt::to_string(entry.host_id);
obj.address = entry.address;
if (entry.port.has_value()) { obj.port = entry.port.value(); }
if (entry.tls_port.has_value()) { obj.tls_port = entry.tls_port.value(); }
if (entry.alternator_port.has_value()) { obj.alternator_port = entry.alternator_port.value(); }
if (entry.alternator_https_port.has_value()) { obj.alternator_https_port = entry.alternator_https_port.value(); }
return obj;
}));
});
}
void set_client_routes(http_context& ctx, routes& r, sharded<service::client_routes_service>& cr) {
seastar::httpd::client_routes_json::set_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_set_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::delete_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_delete_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::get_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_get_client_routes(ctx, cr, std::move(req));
});
}
void unset_client_routes(http_context& ctx, routes& r) {
seastar::httpd::client_routes_json::set_client_routes.unset(r);
seastar::httpd::client_routes_json::delete_client_routes.unset(r);
seastar::httpd::client_routes_json::get_client_routes.unset(r);
}
}

20
api/client_routes.hh Normal file
View File

@@ -0,0 +1,20 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/sharded.hh>
#include <seastar/json/json_elements.hh>
#include "api/api_init.hh"
namespace api {
void set_client_routes(http_context& ctx, httpd::routes& r, sharded<service::client_routes_service>& cr);
void unset_client_routes(http_context& ctx, httpd::routes& r);
}

View File

@@ -18,7 +18,6 @@
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include "db/system_keyspace.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
#include "compaction/compaction_manager.hh"
@@ -67,6 +66,13 @@ static future<json::json_return_type> get_cf_stats(sharded<replica::database>&
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats(sharded<replica::database>& db,
std::function<int64_t(const replica::column_family_stats&)> f) {
return map_reduce_cf(db, int64_t(0), [f](const replica::column_family& cf) {
return f(cf.get_stats());
}, std::plus<int64_t>());
}
static future<json::json_return_type> for_tables_on_all_shards(sharded<replica::database>& db, std::vector<table_info> tables, std::function<future<>(replica::table&)> set) {
return do_with(std::move(tables), [&db, set] (const std::vector<table_info>& tables) {
return db.invoke_on_all([&tables, set] (replica::database& db) {
@@ -336,7 +342,7 @@ uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<
return ret;
}
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db, sharded<db::system_keyspace>& sys_ks) {
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db) {
cf::get_column_family_name.set(r, [&db] (const_req req){
std::vector<sstring> res;
const replica::database::tables_metadata& meta = db.local().get_tables_metadata();
@@ -937,30 +943,6 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
return set_tables_tombstone_gc(db, std::move(tables), false);
});
cf::get_built_indexes.set(r, [&db, &sys_ks](std::unique_ptr<http::request> req) {
auto [ks, cf_name] = parse_fully_qualified_cf_name(req->get_path_param("name"));
// Use of load_built_views() as filtering table should be in sync with
// built_indexes_virtual_reader filtering with BUILT_VIEWS table
return sys_ks.local().load_built_views().then([ks, cf_name, &db](const std::vector<db::system_keyspace::view_name>& vb) mutable {
std::set<sstring> vp;
for (auto b : vb) {
if (b.first == ks) {
vp.insert(b.second);
}
}
std::vector<sstring> res;
auto uuid = validate_table(db.local(), ks, cf_name);
replica::column_family& cf = db.local().find_column_family(uuid);
res.reserve(cf.get_index_manager().list_indexes().size());
for (auto&& i : cf.get_index_manager().list_indexes()) {
if (vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
res.emplace_back(i.metadata().name());
}
}
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_compression_metadata_off_heap_memory_used.set(r, [](const_req) {
// FIXME
// Currently there are no information on the compression
@@ -1091,10 +1073,14 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
});
ss::get_load.set(r, [&db] (std::unique_ptr<http::request> req) {
return get_cf_stats(db, &replica::column_family_stats::live_disk_space_used);
return get_cf_stats(db, [](const replica::column_family_stats& stats) {
return stats.live_disk_space_used.on_disk;
});
});
ss::get_metrics_load.set(r, [&db] (std::unique_ptr<http::request> req) {
return get_cf_stats(db, &replica::column_family_stats::live_disk_space_used);
return get_cf_stats(db, [](const replica::column_family_stats& stats) {
return stats.live_disk_space_used.on_disk;
});
});
ss::get_keyspaces.set(r, [&db] (const_req req) {
@@ -1215,7 +1201,6 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::disable_tombstone_gc.unset(r);
ss::enable_tombstone_gc.unset(r);
ss::disable_tombstone_gc.unset(r);
cf::get_built_indexes.unset(r);
cf::get_compression_metadata_off_heap_memory_used.unset(r);
cf::get_compression_parameters.unset(r);
cf::get_compression_ratio.unset(r);

View File

@@ -13,13 +13,9 @@
#include <any>
#include "api/api_init.hh"
namespace db {
class system_keyspace;
}
namespace api {
void set_column_family(http_context& ctx, httpd::routes& r, sharded<replica::database>& db, sharded<db::system_keyspace>& sys_ks);
void set_column_family(http_context& ctx, httpd::routes& r, sharded<replica::database>& db);
void unset_column_family(http_context& ctx, httpd::routes& r);
table_info parse_table_info(const sstring& name, const replica::database& db);

View File

@@ -21,10 +21,10 @@ namespace hf = httpd::error_injection_json;
void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
sstring injection = req->get_path_param("injection");
bool one_shot = req->get_query_param("one_shot") == "True";
auto params = req->content;
auto params = co_await util::read_entire_stream_contiguous(*req->content_stream);
const size_t max_params_size = 1024 * 1024;
if (params.size() > max_params_size) {
@@ -39,12 +39,11 @@ void set_error_injection(http_context& ctx, routes& r) {
: rjson::parse_to_map<utils::error_injection_parameters>(params);
auto& errinj = utils::get_local_injector();
return errinj.enable_on_all(injection, one_shot, std::move(parameters)).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
co_await errinj.enable_on_all(injection, one_shot, std::move(parameters));
} catch (const rjson::error& e) {
throw httpd::bad_param_exception(format("Failed to parse injections parameters: {}", e.what()));
}
co_return json::json_void();
});
hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {

View File

@@ -20,6 +20,7 @@
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
#include <functional>
@@ -36,6 +37,7 @@
#include "gms/gossiper.hh"
#include "db/system_keyspace.hh"
#include <seastar/http/exception.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/exception.hh>
@@ -272,6 +274,13 @@ scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::re
throw httpd::bad_param_exception(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));
}
if(req_param<bool>(*req, "drop_unfixable_sstables", false)) {
if(scrub_mode != compaction::compaction_type_options::scrub::mode::segregate) {
throw httpd::bad_param_exception("The 'drop_unfixable_sstables' parameter is only valid when 'scrub_mode' is 'SEGREGATE'");
}
info.opts.drop_unfixable = compaction::compaction_type_options::scrub::drop_unfixable_sstables::yes;
}
return info;
}
@@ -496,17 +505,26 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto bucket = req->get_query_param("bucket");
auto prefix = req->get_query_param("prefix");
auto scope = parse_stream_scope(req->get_query_param("scope"));
auto primary_replica_only = validate_bool_x(req->get_query_param("primary_replica_only"), false);
// TODO: the http_server backing the API does not use content streaming
// should use it for better performance
rjson::value parsed = rjson::parse(req->content);
rjson::chunked_content content = co_await util::read_entire_stream(*req->content_stream);
rjson::value parsed = rjson::parse(std::move(content));
if (!parsed.IsArray()) {
throw httpd::bad_param_exception("malformatted sstables in body");
}
auto sstables = parsed.GetArray() |
std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |
std::ranges::to<std::vector>();
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope);
apilog.info("Restore invoked with following parameters: keyspace={}, table={}, endpoint={}, bucket={}, prefix={}, sstables_count={}, scope={}, primary_replica_only={}",
keyspace,
table,
endpoint,
bucket,
prefix,
sstables.size(),
scope,
primary_replica_only);
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);
co_return json::json_return_type(fmt::to_string(task_id));
});
@@ -518,19 +536,42 @@ void unset_sstables_loader(http_context& ctx, routes& r) {
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req);
auto view = req->get_path_param("view");
return vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
co_return json::json_return_type(stream_range_as_array(co_await vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()), [] (const auto& i) {
storage_service_json::mapper res;
res.key = i.first;
res.value = i.second;
return res;
}));
});
cf::get_built_indexes.set(r, [&vb](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto [ks, cf_name] = parse_fully_qualified_cf_name(req->get_path_param("name"));
// Use of load_built_views() as filtering table should be in sync with
// built_indexes_virtual_reader filtering with BUILT_VIEWS table
std::vector<db::system_keyspace::view_name> vn = co_await vb.local().get_sys_ks().load_built_views();
std::set<sstring> vp;
for (auto b : vn) {
if (b.first == ks) {
vp.insert(b.second);
}
}
replica::database& db = vb.local().get_db();
auto uuid = validate_table(db, ks, cf_name);
replica::column_family& cf = db.find_column_family(uuid);
co_return cf.get_index_manager().list_indexes()
| std::views::transform([] (const auto& i) { return i.metadata().name(); })
| std::views::filter([&vp] (const auto& n) { return vp.contains(secondary_index::index_table_name(n)); })
| std::ranges::to<std::vector>();
});
}
void unset_view_builder(http_context& ctx, routes& r) {
ss::view_build_statuses.unset(r);
cf::get_built_indexes.unset(r);
}
static future<json::json_return_type> describe_ring_as_json(sharded<service::storage_service>& ss, sstring keyspace) {
@@ -541,6 +582,16 @@ static future<json::json_return_type> describe_ring_as_json_for_table(const shar
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));
}
namespace {
template <typename Key, typename Value>
storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}
}
static
future<json::json_return_type>
rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -558,12 +609,7 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
throw bad_param_exception("Either provide both keyspace and table (for tablet table) or neither (for vnodes)");
}
co_return json::json_return_type(stream_range_as_array(token_endpoints, [](const auto& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}));
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
}
static
@@ -647,7 +693,6 @@ rest_get_range_to_endpoint_map(http_context& ctx, sharded<service::storage_servi
table_id = validate_table(ctx.db.local(), keyspace, table);
}
std::vector<ss::maplist_mapper> res;
co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace, table_id),
[](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){
ss::maplist_mapper m;
@@ -710,6 +755,14 @@ rest_get_natural_endpoints(http_context& ctx, sharded<service::storage_service>&
return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();
}
static
json::json_return_type
rest_get_natural_endpoints_v2(http_context& ctx, sharded<service::storage_service>& ss, const_req req) {
auto keyspace = validate_keyspace(ctx, req);
auto res = ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"), req.get_query_param_array("key_component"));
return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();
}
static
future<json::json_return_type>
rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -723,25 +776,43 @@ rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::un
static
future<json::json_return_type>
rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("cleanup_all");
auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
if (!ss.is_topology_coordinator_enabled()) {
co_return false;
}
co_await ss.do_cluster_cleanup();
co_return true;
});
if (done) {
bool global = true;
if (auto global_param = req->get_query_param("global"); !global_param.empty()) {
global = validate_bool(global_param);
}
apilog.info("cleanup_all global={}", global);
if (global) {
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
co_return co_await ss.do_clusterwide_vnodes_cleanup();
});
co_return json::json_return_type(0);
}
// fall back to the local global cleanup if topology coordinator is not enabled
// fall back to the local cleanup if local cleanup is requested
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::global_cleanup_compaction_task_impl>({}, db);
co_await task->done();
// Mark this node as clean
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
co_await ss.reset_cleanup_needed();
});
co_return json::json_return_type(0);
}
static
future<json::json_return_type>
rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("reset_cleanup_needed");
co_await ss.invoke_on(0, [] (service::storage_service& ss) {
return ss.reset_cleanup_needed();
});
co_return json_void();
}
static
future<json::json_return_type>
rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {
@@ -804,6 +875,25 @@ rest_remove_node(sharded<service::storage_service>& ss, std::unique_ptr<http::re
});
}
static
future<json::json_return_type>
rest_exclude_node(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto hosts = utils::split_comma_separated_list(req->get_query_param("hosts"))
| std::views::transform([] (const sstring& s) { return locator::host_id(utils::UUID(s)); })
| std::ranges::to<std::vector<locator::host_id>>();
auto& topo = ss.local().get_token_metadata().get_topology();
for (auto host : hosts) {
if (!topo.has_node(host)) {
throw bad_param_exception(fmt::format("Host ID {} does not belong to this cluster", host));
}
}
apilog.info("exclude_node: hosts={}", hosts);
co_await ss.local().mark_excluded(hosts);
co_return json_void();
}
static
future<json::json_return_type>
rest_get_removal_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1224,10 +1314,7 @@ rest_get_ownership(http_context& ctx, sharded<service::storage_service>& ss, std
throw httpd::bad_param_exception("storage_service/ownership cannot be used when a keyspace uses tablets");
}
return ss.local().get_ownership().then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
co_return json::json_return_type(stream_range_as_array(co_await ss.local().get_ownership(), &map_to_json<gms::inet_address, float>));
}
static
@@ -1244,10 +1331,7 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service
}
}
return ss.local().effective_ownership(keyspace_name, table_name).then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
co_return json::json_return_type(stream_range_as_array(co_await ss.local().effective_ownership(keyspace_name, table_name), &map_to_json<gms::inet_address, float>));
}
static
@@ -1257,7 +1341,7 @@ rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_ser
apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");
throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);
@@ -1323,7 +1407,7 @@ rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, serv
apilog.warn("retrain_dict: called before the cluster feature was enabled");
throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);
@@ -1481,16 +1565,7 @@ rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::
static
future<json::json_return_type>
rest_upgrade_to_raft_topology(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("Requested to schedule upgrade to raft topology");
try {
co_await ss.invoke_on(0, [] (auto& ss) {
return ss.start_upgrade_to_raft_topology();
});
} catch (...) {
auto ex = std::current_exception();
apilog.error("Failed to schedule upgrade to raft topology: {}", ex);
std::rethrow_exception(std::move(ex));
}
apilog.info("Requested to schedule upgrade to raft topology, but this version does not need it since it uses raft topology by default.");
co_return json_void();
}
@@ -1721,13 +1796,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::describe_ring.set(r, rest_bind(rest_describe_ring, ctx, ss));
ss::get_current_generation_number.set(r, rest_bind(rest_get_current_generation_number, ss));
ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));
ss::get_natural_endpoints_v2.set(r, rest_bind(rest_get_natural_endpoints_v2, ctx, ss));
ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));
ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
ss::get_removal_status.set(r, rest_bind(rest_get_removal_status, ss));
ss::force_remove_completion.set(r, rest_bind(rest_force_remove_completion, ss));
ss::set_logging_level.set(r, rest_bind(rest_set_logging_level));
@@ -1800,11 +1878,13 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_natural_endpoints.unset(r);
ss::cdc_streams_check_and_repair.unset(r);
ss::cleanup_all.unset(r);
ss::reset_cleanup_needed.unset(r);
ss::force_flush.unset(r);
ss::force_keyspace_flush.unset(r);
ss::decommission.unset(r);
ss::move.unset(r);
ss::remove_node.unset(r);
ss::exclude_node.unset(r);
ss::get_removal_status.unset(r);
ss::force_remove_completion.unset(r);
ss::set_logging_level.unset(r);
@@ -1927,12 +2007,16 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);
auto tcopt = req->get_query_param("tc");
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
try {
if (column_families.empty()) {
co_await snap_ctl.local().take_snapshot(tag, keynames, sf);
co_await snap_ctl.local().take_snapshot(tag, keynames, opts);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
@@ -1940,7 +2024,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
}
co_return json_void();
} catch (...) {
@@ -1949,6 +2033,27 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
}
});
ss::take_cluster_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
apilog.info("take_cluster_snapshot: {}", req->get_query_params());
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("table"), ",");
// Note: not published/active. Retain as internal option, but...
auto sfopt = req->get_query_param("skip_flush");
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
std::vector<sstring> keynames = split(req->get_query_param("keyspace"), ",");
try {
co_await snap_ctl.local().take_cluster_column_family_snapshot(keynames, column_families, tag, opts);
co_return json_void();
} catch (...) {
apilog.error("take_cluster_snapshot failed: {}", std::current_exception());
throw;
}
});
ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
apilog.info("del_snapshot: {}", req->get_query_params());
auto tag = req->get_query_param("tag");
@@ -1975,7 +2080,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
compaction::compaction_stats stats;

View File

@@ -54,7 +54,8 @@ void set_system(http_context& ctx, routes& r) {
hm::set_metrics_config.set(r, [](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
rapidjson::Document doc;
doc.Parse(req->content.c_str());
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
doc.Parse(content.c_str());
if (!doc.IsArray()) {
throw bad_param_exception("Expected a json array");
}
@@ -87,21 +88,19 @@ void set_system(http_context& ctx, routes& r) {
relabels[i].expr = element["regex"].GetString();
}
}
return do_with(std::move(relabels), false, [](const std::vector<seastar::metrics::relabel_config>& relabels, bool& failed) {
return smp::invoke_on_all([&relabels, &failed] {
return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {
if (result.metrics_relabeled_due_to_collision > 0) {
failed = true;
}
return;
});
}).then([&failed](){
if (failed) {
throw bad_param_exception("conflicts found during relabeling");
bool failed = false;
co_await smp::invoke_on_all([&relabels, &failed] {
return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {
if (result.metrics_relabeled_due_to_collision > 0) {
failed = true;
}
return make_ready_future<json::json_return_type>(seastar::json::json_void());
return;
});
});
if (failed) {
throw bad_param_exception("conflicts found during relabeling");
}
co_return seastar::json::json_void();
});
hs::get_system_uptime.set(r, [](const_req req) {

View File

@@ -9,6 +9,7 @@
#include <seastar/core/chunked_fifo.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/http/exception.hh>
#include "task_manager.hh"
@@ -264,7 +265,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
if (id) {
module->unregister_task(id);
}
co_await maybe_yield();
co_await coroutine::maybe_yield();
}
});
co_return json_void();

View File

@@ -38,76 +38,78 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
static future<shared_ptr<compaction::major_keyspace_compaction_task_impl>> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush && !consider_only_existing_data) {
fmopt = compaction::flush_mode::skip;
}
return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
}
static future<shared_ptr<compaction::upgrade_sstables_compaction_task_impl>> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
}
static future<shared_ptr<compaction::cleanup_keyspace_compaction_task_impl>> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.is_local() || !rs.is_vnode_based()) {
auto reason = rs.is_local() ? "require" : "support";
apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
co_return nullptr;
}
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_vnodes_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
}
void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {
t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush) {
fmopt = compaction::flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);
auto task = co_await force_keyspace_compaction(ctx, std::move(req));
co_return json::json_return_type(task->get_status().id.to_sstring());
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush && !consider_only_existing_data) {
fmopt = compaction::flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
auto task = co_await force_keyspace_compaction(ctx, std::move(req));
co_await task->done();
co_return json_void();
});
t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
tasks::task_id id = tasks::task_id::create_null_id();
auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
if (task) {
id = task->get_status().id;
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
co_return json::json_return_type(task->get_status().id.to_sstring());
co_return json::json_return_type(id.to_sstring());
});
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.is_local() || !rs.is_vnode_based()) {
auto reason = rs.is_local() ? "require" : "support";
apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
co_return json::json_return_type(0);
auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
if (task) {
co_await task->done();
}
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
co_await task->done();
co_return json::json_return_type(0);
});
@@ -129,25 +131,12 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
}));
t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
co_return json::json_return_type(task->get_status().id.to_sstring());
}));
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
co_await task->done();
co_return json::json_return_type(0);
}));
@@ -157,7 +146,8 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

View File

@@ -62,6 +62,17 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
return addr | std::ranges::to<std::vector>();
});
ss::get_excluded_nodes.set(r, [&tm](const_req req) {
const auto& local_tm = *tm.local().get();
std::vector<sstring> eps;
local_tm.get_topology().for_each_node([&] (auto& node) {
if (node.is_excluded()) {
eps.push_back(node.host_id().to_sstring());
}
});
return eps;
});
ss::get_joining_nodes.set(r, [&tm, &g](const_req req) {
const auto& local_tm = *tm.local().get();
const auto& points = local_tm.get_bootstrap_tokens();
@@ -130,6 +141,7 @@ void unset_token_metadata(http_context& ctx, routes& r) {
ss::get_leaving_nodes.unset(r);
ss::get_moving_nodes.unset(r);
ss::get_joining_nodes.unset(r);
ss::get_excluded_nodes.unset(r);
ss::get_host_id_map.unset(r);
httpd::endpoint_snitch_info_json::get_datacenter.unset(r);
httpd::endpoint_snitch_info_json::get_rack.unset(r);

View File

@@ -5,6 +5,7 @@ target_sources(scylla_audit
PRIVATE
audit.cc
audit_cf_storage_helper.cc
audit_composite_storage_helper.cc
audit_syslog_storage_helper.cc)
target_include_directories(scylla_audit
PUBLIC
@@ -16,4 +17,7 @@ target_link_libraries(scylla_audit
PRIVATE
cql3)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(scylla_audit REUSE_FROM scylla-precompiled-header)
endif()
add_whole_archive(audit scylla_audit)

View File

@@ -13,9 +13,11 @@
#include "cql3/statements/batch_statement.hh"
#include "cql3/statements/modification_statement.hh"
#include "storage_helper.hh"
#include "audit_cf_storage_helper.hh"
#include "audit_syslog_storage_helper.hh"
#include "audit_composite_storage_helper.hh"
#include "audit.hh"
#include "../db/config.hh"
#include "utils/class_registrator.hh"
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/trim.hpp>
@@ -26,6 +28,47 @@ namespace audit {
logging::logger logger("audit");
static std::set<sstring> parse_audit_modes(const sstring& data) {
std::set<sstring> result;
if (!data.empty()) {
std::vector<sstring> audit_modes;
boost::split(audit_modes, data, boost::is_any_of(","));
if (audit_modes.empty()) {
return {};
}
for (sstring& audit_mode : audit_modes) {
boost::trim(audit_mode);
if (audit_mode == "none") {
return {};
}
if (audit_mode != "table" && audit_mode != "syslog") {
throw audit_exception(fmt::format("Bad configuration: invalid 'audit': {}", audit_mode));
}
result.insert(std::move(audit_mode));
}
}
return result;
}
static std::unique_ptr<storage_helper> create_storage_helper(const std::set<sstring>& audit_modes, cql3::query_processor& qp, service::migration_manager& mm) {
SCYLLA_ASSERT(!audit_modes.empty() && !audit_modes.contains("none"));
std::vector<std::unique_ptr<storage_helper>> helpers;
for (const sstring& audit_mode : audit_modes) {
if (audit_mode == "table") {
helpers.emplace_back(std::make_unique<audit_cf_storage_helper>(qp, mm));
} else if (audit_mode == "syslog") {
helpers.emplace_back(std::make_unique<audit_syslog_storage_helper>(qp, mm));
}
}
SCYLLA_ASSERT(!helpers.empty());
if (helpers.size() == 1) {
return std::move(helpers.front());
}
return std::make_unique<audit_composite_storage_helper>(std::move(helpers));
}
static sstring category_to_string(statement_category category)
{
switch (category) {
@@ -103,7 +146,9 @@ static std::set<sstring> parse_audit_keyspaces(const sstring& data) {
}
audit::audit(locator::shared_token_metadata& token_metadata,
sstring&& storage_helper_name,
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
category_set&& audited_categories,
@@ -112,28 +157,21 @@ audit::audit(locator::shared_token_metadata& token_metadata,
, _audited_keyspaces(std::move(audited_keyspaces))
, _audited_tables(std::move(audited_tables))
, _audited_categories(std::move(audited_categories))
, _storage_helper_class_name(std::move(storage_helper_name))
, _cfg(cfg)
, _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<std::set<sstring>>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))
, _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<std::map<sstring, std::set<sstring>>>(new_value, parse_audit_tables, _audited_tables); }))
, _cfg_categories_observer(cfg.audit_categories.observe([this] (sstring const& new_value){ update_config<category_set>(new_value, parse_audit_categories, _audited_categories); }))
{ }
{
_storage_helper_ptr = create_storage_helper(std::move(audit_modes), qp, mm);
}
audit::~audit() = default;
future<> audit::create_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm) {
sstring storage_helper_name;
if (cfg.audit() == "table") {
storage_helper_name = "audit_cf_storage_helper";
} else if (cfg.audit() == "syslog") {
storage_helper_name = "audit_syslog_storage_helper";
} else if (cfg.audit() == "none") {
// Audit is off
future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm) {
std::set<sstring> audit_modes = parse_audit_modes(cfg.audit());
if (audit_modes.empty()) {
logger.info("Audit is disabled");
return make_ready_future<>();
} else {
throw audit_exception(fmt::format("Bad configuration: invalid 'audit': {}", cfg.audit()));
}
category_set audited_categories = parse_audit_categories(cfg.audit_categories());
std::map<sstring, std::set<sstring>> audited_tables = parse_audit_tables(cfg.audit_tables());
@@ -143,19 +181,20 @@ future<> audit::create_audit(const db::config& cfg, sharded<locator::shared_toke
cfg.audit(), cfg.audit_categories(), cfg.audit_keyspaces(), cfg.audit_tables());
return audit_instance().start(std::ref(stm),
std::move(storage_helper_name),
std::ref(qp),
std::ref(mm),
std::move(audit_modes),
std::move(audited_keyspaces),
std::move(audited_tables),
std::move(audited_categories),
std::cref(cfg));
}
future<> audit::start_audit(const db::config& cfg, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm) {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg, &qp, &mm] (audit& local_audit) {
return local_audit.start(cfg, qp.local(), mm.local());
std::cref(cfg))
.then([&cfg] {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {
return local_audit.start(cfg);
});
});
}
@@ -170,26 +209,14 @@ future<> audit::stop_audit() {
});
}
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table) {
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch) {
if (!audit_instance().local_is_initialized()) {
return nullptr;
}
return std::make_unique<audit_info>(cat, keyspace, table);
return std::make_unique<audit_info>(cat, keyspace, table, batch);
}
audit_info_ptr audit::create_no_audit_info() {
return audit_info_ptr();
}
future<> audit::start(const db::config& cfg, cql3::query_processor& qp, service::migration_manager& mm) {
try {
_storage_helper_ptr = create_object<storage_helper>(_storage_helper_class_name, qp, mm);
} catch (no_such_class& e) {
logger.error("Can't create audit storage helper {}: not supported", _storage_helper_class_name);
throw;
} catch (...) {
throw;
}
future<> audit::start(const db::config& cfg) {
return _storage_helper_ptr->start(cfg);
}
@@ -236,18 +263,21 @@ future<> audit::log_login(const sstring& username, socket_address client_ip, boo
}
future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error) {
cql3::statements::batch_statement* batch = dynamic_cast<cql3::statements::batch_statement*>(statement.get());
if (batch != nullptr) {
auto audit_info = statement->get_audit_info();
if (!audit_info) {
return make_ready_future<>();
}
if (audit_info->batch()) {
cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());
return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {
return inspect(m.statement, query_state, options, error);
});
} else {
auto audit_info = statement->get_audit_info();
if (bool(audit_info) && audit::local_audit_instance().should_log(audit_info)) {
if (audit::local_audit_instance().should_log(audit_info)) {
return audit::local_audit_instance().log(audit_info, query_state, options, error);
}
return make_ready_future<>();
}
return make_ready_future<>();
}
future<> inspect_login(const sstring& username, socket_address client_ip, bool error) {

View File

@@ -75,11 +75,13 @@ class audit_info final {
sstring _keyspace;
sstring _table;
sstring _query;
bool _batch;
public:
audit_info(statement_category cat, sstring keyspace, sstring table)
audit_info(statement_category cat, sstring keyspace, sstring table, bool batch)
: _category(cat)
, _keyspace(std::move(keyspace))
, _table(std::move(table))
, _batch(batch)
{ }
void set_query_string(const std::string_view& query_string) {
_query = sstring(query_string);
@@ -89,6 +91,7 @@ public:
const sstring& query() const { return _query; }
sstring category_string() const;
statement_category category() const { return _category; }
bool batch() const { return _batch; }
};
using audit_info_ptr = std::unique_ptr<audit_info>;
@@ -102,7 +105,6 @@ class audit final : public seastar::async_sharded_service<audit> {
std::map<sstring, std::set<sstring>> _audited_tables;
category_set _audited_categories;
sstring _storage_helper_class_name;
std::unique_ptr<storage_helper> _storage_helper_ptr;
const db::config& _cfg;
@@ -125,18 +127,19 @@ public:
static audit& local_audit_instance() {
return audit_instance().local();
}
static future<> create_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm);
static future<> start_audit(const db::config& cfg, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> stop_audit();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table);
static audit_info_ptr create_no_audit_info();
audit(locator::shared_token_metadata& stm, sstring&& storage_helper_name,
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);
audit(locator::shared_token_metadata& stm,
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
category_set&& audited_categories,
const db::config& cfg);
~audit();
future<> start(const db::config& cfg, cql3::query_processor& qp, service::migration_manager& mm);
future<> start(const db::config& cfg);
future<> stop();
future<> shutdown();
bool should_log(const audit_info* audit_info) const;

View File

@@ -11,11 +11,11 @@
#include "cql3/query_processor.hh"
#include "data_dictionary/keyspace_metadata.hh"
#include "utils/UUID_gen.hh"
#include "utils/class_registrator.hh"
#include "cql3/query_options.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "locator/abstract_replication_strategy.hh"
namespace audit {
@@ -64,8 +64,8 @@ future<> audit_cf_storage_helper::migrate_audit_table(service::group0_guard grou
data_dictionary::database db = _qp.db();
cql3::statements::ks_prop_defs old_ks_prop_defs;
auto old_ks_metadata = old_ks_prop_defs.as_ks_metadata_update(
ks->metadata(), *_qp.proxy().get_token_metadata_ptr(), db.features());
std::map<sstring, sstring> strategy_opts;
ks->metadata(), *_qp.proxy().get_token_metadata_ptr(), db.features(), db.get_config());
locator::replication_strategy_config_options strategy_opts;
for (const auto &dc: _qp.proxy().get_token_metadata_ptr()->get_topology().get_datacenters())
strategy_opts[dc] = "3";
@@ -73,6 +73,7 @@ future<> audit_cf_storage_helper::migrate_audit_table(service::group0_guard grou
"org.apache.cassandra.locator.NetworkTopologyStrategy",
strategy_opts,
std::nullopt, // initial_tablets
std::nullopt, // consistency_option
old_ks_metadata->durable_writes(),
old_ks_metadata->get_storage_options(),
old_ks_metadata->tables());
@@ -196,7 +197,4 @@ cql3::query_options audit_cf_storage_helper::make_login_data(socket_address node
return cql3::query_options(cql3::default_cql_config, db::consistency_level::ONE, std::nullopt, std::move(values), false, cql3::query_options::specific_options::DEFAULT);
}
using registry = class_registrator<storage_helper, audit_cf_storage_helper, cql3::query_processor&, service::migration_manager&>;
static registry registrator1("audit_cf_storage_helper");
}

View File

@@ -0,0 +1,68 @@
/*
* Copyright (C) 2025 ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/core/loop.hh>
#include <seastar/core/future-util.hh>
#include "audit/audit_composite_storage_helper.hh"
#include "utils/class_registrator.hh"
namespace audit {
audit_composite_storage_helper::audit_composite_storage_helper(std::vector<std::unique_ptr<storage_helper>>&& storage_helpers)
: _storage_helpers(std::move(storage_helpers))
{}
future<> audit_composite_storage_helper::start(const db::config& cfg) {
auto res = seastar::parallel_for_each(
_storage_helpers,
[&cfg] (std::unique_ptr<storage_helper>& h) {
return h->start(cfg);
}
);
return res;
}
future<> audit_composite_storage_helper::stop() {
auto res = seastar::parallel_for_each(
_storage_helpers,
[] (std::unique_ptr<storage_helper>& h) {
return h->stop();
}
);
return res;
}
future<> audit_composite_storage_helper::write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
const sstring& username,
bool error) {
return seastar::parallel_for_each(
_storage_helpers,
[audit_info, node_ip, client_ip, cl, &username, error](std::unique_ptr<storage_helper>& h) {
return h->write(audit_info, node_ip, client_ip, cl, username, error);
}
);
}
future<> audit_composite_storage_helper::write_login(const sstring& username,
socket_address node_ip,
socket_address client_ip,
bool error) {
return seastar::parallel_for_each(
_storage_helpers,
[&username, node_ip, client_ip, error](std::unique_ptr<storage_helper>& h) {
return h->write_login(username, node_ip, client_ip, error);
}
);
}
} // namespace audit

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2025 ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "audit/audit.hh"
#include <seastar/core/future.hh>
#include "storage_helper.hh"
namespace audit {
class audit_composite_storage_helper : public storage_helper {
std::vector<std::unique_ptr<storage_helper>> _storage_helpers;
public:
explicit audit_composite_storage_helper(std::vector<std::unique_ptr<storage_helper>>&&);
virtual ~audit_composite_storage_helper() = default;
virtual future<> start(const db::config& cfg) override;
virtual future<> stop() override;
virtual future<> write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
const sstring& username,
bool error) override;
virtual future<> write_login(const sstring& username,
socket_address node_ip,
socket_address client_ip,
bool error) override;
};
} // namespace audit

View File

@@ -21,7 +21,6 @@
#include <fmt/chrono.h>
#include "cql3/query_processor.hh"
#include "utils/class_registrator.hh"
namespace cql3 {
@@ -54,10 +53,10 @@ static std::string json_escape(std::string_view str) {
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
future<> audit_syslog_storage_helper::syslog_send_helper(temporary_buffer<char> msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
co_await _sender.send(_syslog_address, std::span(&msg, 1));
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
@@ -91,7 +90,7 @@ future<> audit_syslog_storage_helper::start(const db::config& cfg) {
co_return;
}
co_await syslog_send_helper("Initializing syslog audit backend.");
co_await syslog_send_helper(temporary_buffer<char>::copy_of("Initializing syslog audit backend."));
}
future<> audit_syslog_storage_helper::stop() {
@@ -121,7 +120,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
audit_info->table(),
username);
co_await syslog_send_helper(msg);
co_await syslog_send_helper(std::move(msg).release());
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -140,10 +139,7 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
client_ip,
username);
co_await syslog_send_helper(msg.c_str());
co_await syslog_send_helper(std::move(msg).release());
}
using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;
static registry registrator1("audit_syslog_storage_helper");
}

View File

@@ -26,7 +26,7 @@ class audit_syslog_storage_helper : public storage_helper {
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
future<> syslog_send_helper(seastar::temporary_buffer<char> msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -9,6 +9,7 @@ target_sources(scylla_auth
allow_all_authorizer.cc
authenticated_user.cc
authenticator.cc
cache.cc
certificate_authenticator.cc
common.cc
default_authorizer.cc
@@ -16,7 +17,6 @@ target_sources(scylla_auth
password_authenticator.cc
passwords.cc
permission.cc
permissions_cache.cc
resource.cc
role_or_anonymous.cc
roles-metadata.cc
@@ -44,5 +44,8 @@ target_link_libraries(scylla_auth
add_whole_archive(auth scylla_auth)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(scylla_auth REUSE_FROM scylla-precompiled-header)
endif()
check_headers(check-headers scylla_auth
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -9,7 +9,6 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -23,6 +22,6 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
cache&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -12,8 +12,8 @@
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/cache.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -29,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {
}
virtual future<> start() override {

353
auth/cache.cc Normal file
View File

@@ -0,0 +1,353 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/cache.hh"
#include "auth/common.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "schema/schema.hh"
#include <iterator>
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/core/format.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/do_with.hh>
namespace auth {
logging::logger logger("auth-cache");
cache::cache(cql3::query_processor& qp, abort_source& as) noexcept
: _current_version(0)
, _qp(qp)
, _loading_sem(1)
, _as(as)
, _permission_loader(nullptr)
, _permission_loader_sem(8) {
namespace sm = seastar::metrics;
_metrics.add_group("auth_cache", {
sm::make_gauge("roles", [this] { return _roles.size(); },
sm::description("Number of roles currently cached")),
sm::make_gauge("permissions", [this] {
return _cached_permissions_count;
}, sm::description("Total number of permission sets currently cached across all roles"))
});
}
void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
auto it = _roles.find(role);
if (it == _roles.end()) {
return {};
}
return it->second;
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;
if (is_anonymous(role)) {
perms_cache = &_anonymous_permissions;
} else {
const auto& role_name = *role.name;
auto role_it = _roles.find(role_name);
if (role_it == _roles.end()) {
// Role might have been deleted but there are some connections
// left which reference it. They should no longer have access to anything.
return make_ready_future<permission_set>(permissions::NONE);
}
role_ptr = role_it->second;
perms_cache = &role_ptr->cached_permissions;
}
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
return make_ready_future<permission_set>(it->second);
}
// keep alive role_ptr as it holds perms_cache (except anonymous)
return do_with(std::move(role_ptr), [this, &role, &r, perms_cache] (auto& role_ptr) {
return load_permissions(role, r, perms_cache);
});
}
future<permission_set> cache::load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache) {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_permission_loader_sem, 1, _as);
// Check again, perhaps we were blocked and other call loaded
// the permissions already. This is a protection against misses storm.
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
co_return it->second;
}
auto perms = co_await _permission_loader(role, r);
add_permissions(*perms_cache, r, perms);
co_return perms;
}
future<> cache::prune(const resource& r) {
auto units = co_await get_units(_loading_sem, 1, _as);
_anonymous_permissions.erase(r);
for (auto& it : _roles) {
// Prunning can run concurrently with other functions but it
// can only cause cached_permissions extra reload via get_permissions.
remove_permissions(it.second->cached_permissions, r);
co_await coroutine::maybe_yield();
}
}
future<> cache::reload_all_permissions() noexcept {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_loading_sem, 1, _as);
auto copy_keys = [] (const std::unordered_map<resource, permission_set>& m) {
std::vector<resource> keys;
keys.reserve(m.size());
for (const auto& [res, _] : m) {
keys.push_back(res);
}
return keys;
};
const role_or_anonymous anon;
for (const auto& res : copy_keys(_anonymous_permissions)) {
_anonymous_permissions[res] = co_await _permission_loader(anon, res);
}
for (auto& [role, entry] : _roles) {
auto& perms_cache = entry->cached_permissions;
auto r = role_or_anonymous(role);
for (const auto& res : copy_keys(perms_cache)) {
perms_cache[res] = co_await _permission_loader(r, res);
}
}
logger.debug("Reloaded auth cache with {} entries", _roles.size());
}
future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& role) const {
auto rec = make_lw_shared<role_record>();
rec->version = _current_version;
auto fetch = [this, &role](const sstring& q) {
return _qp.execute_internal(q, db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), {role},
cql3::query_processor::cache_internal::yes);
};
// roles
{
static const sstring q = format("SELECT * FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, meta::roles_table::name);
auto rs = co_await fetch(q);
if (!rs->empty()) {
auto& r = rs->one();
rec->is_superuser = r.get_or<bool>("is_superuser", false);
rec->can_login = r.get_or<bool>("can_login", false);
rec->salted_hash = r.get_or<sstring>("salted_hash", "");
if (r.has("member_of")) {
auto mo = r.get_set<sstring>("member_of");
rec->member_of.insert(
std::make_move_iterator(mo.begin()),
std::make_move_iterator(mo.end()));
}
} else {
// role got deleted
co_return nullptr;
}
}
// members
{
static const sstring q = format("SELECT role, member FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, ROLE_MEMBERS_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
rec->members.insert(r.get_as<sstring>("member"));
co_await coroutine::maybe_yield();
}
}
// attributes
{
static const sstring q = format("SELECT role, name, value FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, ROLE_ATTRIBUTES_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
rec->attributes[r.get_as<sstring>("name")] =
r.get_as<sstring>("value");
co_await coroutine::maybe_yield();
}
}
// permissions
{
static const sstring q = format("SELECT role, resource, permissions FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, PERMISSIONS_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
auto resource = r.get_as<sstring>("resource");
auto perms_strings = r.get_set<sstring>("permissions");
std::unordered_set<sstring> perms_set(perms_strings.begin(), perms_strings.end());
auto pset = permissions::from_strings(perms_set);
rec->permissions[std::move(resource)] = std::move(pset);
co_await coroutine::maybe_yield();
}
}
co_return rec;
}
future<> cache::prune_all() noexcept {
for (auto it = _roles.begin(); it != _roles.end(); ) {
if (it->second->version != _current_version) {
remove_role(it++);
co_await coroutine::maybe_yield();
} else {
++it;
}
}
co_return;
}
future<> cache::load_all() {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
++_current_version;
logger.info("Loading all roles");
const uint32_t page_size = 128;
auto loader = [this](const cql3::untyped_result_set::row& r) -> future<stop_iteration> {
const auto name = r.get_as<sstring>("role");
auto role = co_await fetch_role(name);
if (role) {
add_role(name, role);
}
co_return stop_iteration::no;
};
co_await _qp.query_internal(format("SELECT * FROM {}.{}",
db::system_keyspace::NAME, meta::roles_table::name),
db::consistency_level::LOCAL_ONE, {}, page_size, loader);
co_await prune_all();
for (const auto& [name, role] : _roles) {
co_await distribute_role(name, role);
}
co_await container().invoke_on_others([this](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
c._current_version = _current_version;
co_await c.prune_all();
});
}
future<> cache::gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name) {
if (!role) {
// Role might have been removed or not yet added, either way
// their members will be handled by another top call to this function.
co_return;
}
for (const auto& member_name : role->members) {
bool is_new = roles.insert(member_name).second;
if (!is_new) {
continue;
}
lw_shared_ptr<cache::role_record> member_role;
auto r = _roles.find(member_name);
if (r != _roles.end()) {
member_role = r->second;
}
co_await gather_inheriting_roles(roles, member_role, member_name);
}
}
future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
std::unordered_set<role_name_t> roles_to_clear_perms;
for (const auto& name : roles) {
logger.info("Loading role {}", name);
auto role = co_await fetch_role(name);
if (role) {
add_role(name, role);
co_await gather_inheriting_roles(roles_to_clear_perms, role, name);
} else {
if (auto it = _roles.find(name); it != _roles.end()) {
auto old_role = it->second;
remove_role(it);
co_await gather_inheriting_roles(roles_to_clear_perms, old_role, name);
}
}
co_await distribute_role(name, role);
}
co_await container().invoke_on_all([&roles_to_clear_perms] (cache& c) -> future<> {
for (const auto& name : roles_to_clear_perms) {
c.clear_role_permissions(name);
co_await coroutine::maybe_yield();
}
});
}
future<> cache::distribute_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
auto role_ptr = role.get();
co_await container().invoke_on_others([&name, role_ptr](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
if (!role_ptr) {
c.remove_role(name);
co_return;
}
auto role_copy = make_lw_shared<role_record>(*role_ptr);
c.add_role(name, std::move(role_copy));
});
}
bool cache::includes_table(const table_id& id) noexcept {
return id == db::system_keyspace::roles()->id()
|| id == db::system_keyspace::role_members()->id()
|| id == db::system_keyspace::role_attributes()->id()
|| id == db::system_keyspace::role_permissions()->id();
}
void cache::add_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
}
_cached_permissions_count += role->cached_permissions.size();
_roles[name] = std::move(role);
}
void cache::remove_role(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
remove_role(it);
}
}
void cache::remove_role(roles_map::iterator it) {
_cached_permissions_count -= it->second->cached_permissions.size();
_roles.erase(it);
}
void cache::clear_role_permissions(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
it->second->cached_permissions.clear();
}
}
void cache::add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms) {
if (cache.emplace(r, perms).second) {
++_cached_permissions_count;
}
}
void cache::remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r) {
_cached_permissions_count -= cache.erase(r);
}
} // namespace auth

94
auth/cache.hh Normal file
View File

@@ -0,0 +1,94 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/abort_source.hh>
#include <unordered_set>
#include <unordered_map>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "auth/permission.hh"
#include "auth/common.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
namespace cql3 { class query_processor; }
namespace auth {
class cache : public peering_sharded_service<cache> {
public:
using role_name_t = sstring;
using version_tag_t = char;
using permission_loader_func = std::function<future<permission_set>(const role_or_anonymous&, const resource&)>;
struct role_record {
bool can_login = false;
bool is_superuser = false;
std::unordered_set<role_name_t> member_of;
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
std::unordered_map<resource, permission_set> cached_permissions;
version_tag_t version; // used for seamless cache reloads
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
future<> reload_all_permissions() noexcept;
future<> load_all();
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.
std::unordered_map<resource, permission_set> _anonymous_permissions;
version_tag_t _current_version;
cql3::query_processor& _qp;
semaphore _loading_sem; // protects iteration of _roles map
abort_source& _as;
permission_loader_func _permission_loader;
semaphore _permission_loader_sem; // protects against reload storms on a single role change
metrics::metric_groups _metrics;
size_t _cached_permissions_count = 0;
future<lw_shared_ptr<role_record>> fetch_role(const role_name_t& role) const;
future<> prune_all() noexcept;
future<> distribute_role(const role_name_t& name, const lw_shared_ptr<role_record> role);
future<> gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name);
void add_role(const role_name_t& name, lw_shared_ptr<role_record> role);
void remove_role(const role_name_t& name);
void remove_role(roles_map::iterator it);
void clear_role_permissions(const role_name_t& name);
void add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms);
void remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r);
future<permission_set> load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache);
};
} // namespace auth

View File

@@ -8,6 +8,7 @@
*/
#include "auth/certificate_authenticator.hh"
#include "auth/cache.hh"
#include <boost/regex.hpp>
#include <fmt/ranges.h>
@@ -34,13 +35,13 @@ static const class_registrator<auth::authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
, auth::cache&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, auth::cache&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();
@@ -75,9 +76,9 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
throw std::invalid_argument(fmt::format("Invalid source: {}", map.at(cfg_source_attr)));
}
continue;
} catch (std::out_of_range&) {
} catch (const std::out_of_range&) {
// just fallthrough
} catch (boost::regex_error&) {
} catch (const boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -10,7 +10,6 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -26,13 +25,15 @@ class raft_group0_client;
namespace auth {
class cache;
extern const std::string_view certificate_authenticator_name;
class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~certificate_authenticator();
future<> start() override;

View File

@@ -94,7 +94,7 @@ static future<> create_legacy_metadata_table_if_missing_impl(
try {
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (exceptions::already_exists_exception&) {}
} catch (const exceptions::already_exists_exception&) {}
}
}

View File

@@ -48,6 +48,10 @@ extern constinit const std::string_view AUTH_PACKAGE_NAME;
} // namespace meta
constexpr std::string_view PERMISSIONS_CF = "role_permissions";
constexpr std::string_view ROLE_MEMBERS_CF = "role_members";
constexpr std::string_view ROLE_ATTRIBUTES_CF = "role_attributes";
// This is a helper to check whether auth-v2 is on.
bool legacy_mode(cql3::query_processor& qp);

View File

@@ -37,7 +37,6 @@ std::string_view default_authorizer::qualified_java_name() const {
static constexpr std::string_view ROLE_NAME = "role";
static constexpr std::string_view RESOURCE_NAME = "resource";
static constexpr std::string_view PERMISSIONS_NAME = "permissions";
static constexpr std::string_view PERMISSIONS_CF = "role_permissions";
static logging::logger alogger("default_authorizer");
@@ -257,7 +256,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name, ::service::g
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
}
@@ -294,13 +293,13 @@ future<> default_authorizer::revoke_all_legacy(const resource& resource) {
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}

View File

@@ -83,25 +83,35 @@ static const class_registrator<
ldap_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration(ldap_role_manager_full_name);
::service::migration_manager&,
cache&> registration(ldap_role_manager_full_name);
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm)
: _std_mgr(qp, rg0c, mm), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
, _bind_password(bind_password)
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this))) {
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
}
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm)
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: ldap_role_manager(
qp.db().get_config().ldap_url_template(),
qp.db().get_config().ldap_attr_role(),
qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(),
qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this] (const uint32_t& v) { _permissions_update_interval_in_ms = v; }),
qp,
rg0c,
mm) {
mm,
cache) {
}
std::string_view ldap_role_manager::qualified_java_name() const noexcept {
@@ -117,6 +127,22 @@ future<> ldap_role_manager::start() {
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
}
_cache_pruner = futurize_invoke([this] () -> future<> {
while (true) {
try {
co_await seastar::sleep_abortable(std::chrono::milliseconds(_permissions_update_interval_in_ms), _as);
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
});
}
});
return _std_mgr.start();
}
@@ -173,7 +199,11 @@ future<conn_ptr> ldap_role_manager::reconnect() {
future<> ldap_role_manager::stop() {
_as.request_abort();
return _std_mgr.stop().then([this] { return _connection_factory.stop(); });
return std::move(_cache_pruner).then([this] {
return _std_mgr.stop();
}).then([this] {
return _connection_factory.stop();
});
}
future<> ldap_role_manager::create(std::string_view name, const role_config& config, ::service::group0_batch& mc) {
@@ -233,9 +263,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted() {
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all();
auto roles = co_await query_all(qs);
for (auto& role: roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
@@ -247,8 +277,8 @@ ldap_role_manager::query_all_directly_granted() {
co_return result;
}
future<role_set> ldap_role_manager::query_all() {
return _std_mgr.query_all();
future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
return _std_mgr.query_all(qs);
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +341,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name) {
return _std_mgr.get_attribute(role_name, attribute_name);
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
return _std_mgr.query_attribute_for_all(attribute_name);
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.query_attribute_for_all(attribute_name, qs);
}
future<> ldap_role_manager::set_attribute(

View File

@@ -10,10 +10,12 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <stdexcept>
#include "ent/ldap/ldap_connection.hh"
#include "standard_role_manager.hh"
#include "auth/cache.hh"
namespace auth {
@@ -33,22 +35,31 @@ class ldap_role_manager : public role_manager {
seastar::sstring _target_attr; ///< LDAP entry attribute containing the Scylla role name.
seastar::sstring _bind_name; ///< Username for LDAP simple bind.
seastar::sstring _bind_password; ///< Password for LDAP simple bind.
uint32_t _permissions_update_interval_in_ms;
utils::observer<uint32_t> _permissions_update_interval_in_ms_observer;
mutable ldap_reuser _connection_factory; // Potentially modified by query_granted().
seastar::abort_source _as;
cache& _cache;
seastar::future<> _cache_pruner;
public:
ldap_role_manager(
std::string_view query_template, ///< LDAP query template as described in Scylla documentation.
std::string_view target_attr, ///< LDAP entry attribute containing the Scylla role name.
std::string_view bind_name, ///< LDAP bind credentials.
std::string_view bind_password, ///< LDAP bind credentials.
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ///< Passed to standard_role_manager.
::service::raft_group0_client& rg0c, ///< Passed to standard_role_manager.
::service::migration_manager& mm ///< Passed to standard_role_manager.
::service::migration_manager& mm, ///< Passed to standard_role_manager.
cache& cache ///< Passed to standard_role_manager.
);
/// Retrieves LDAP configuration entries from qp and invokes the other constructor. Required by
/// class_registrator<role_manager>.
ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm);
ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache);
/// Thrown when query-template parsing fails.
struct url_error : public std::runtime_error {
@@ -75,9 +86,9 @@ class ldap_role_manager : public role_manager {
future<role_set> query_granted(std::string_view, recursive_role_query) override;
future<role_to_directly_granted_map> query_all_directly_granted() override;
future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
future<role_set> query_all() override;
future<role_set> query_all(::service::query_state&) override;
future<bool> exists(std::string_view) override;
@@ -85,9 +96,9 @@ class ldap_role_manager : public role_manager {
future<bool> can_login(std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;
future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

View File

@@ -11,6 +11,7 @@
#include <seastar/core/future.hh>
#include <stdexcept>
#include <string_view>
#include "auth/cache.hh"
#include "cql3/description.hh"
#include "utils/class_registrator.hh"
@@ -23,7 +24,8 @@ static const class_registrator<
maintenance_socket_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration(sstring{maintenance_socket_role_manager_name});
::service::migration_manager&,
cache&> registration(sstring{maintenance_socket_role_manager_name});
std::string_view maintenance_socket_role_manager::qualified_java_name() const noexcept {
@@ -78,11 +80,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
}
@@ -98,11 +100,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
return make_ready_future<bool>(true);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
}

View File

@@ -8,6 +8,7 @@
#pragma once
#include "auth/cache.hh"
#include "auth/resource.hh"
#include "auth/role_manager.hh"
#include <seastar/core/future.hh>
@@ -29,7 +30,7 @@ extern const std::string_view maintenance_socket_role_manager_name;
// system_auth keyspace, which may be not yet created when the maintenance socket starts listening.
class maintenance_socket_role_manager final : public role_manager {
public:
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {}
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {}
virtual std::string_view qualified_java_name() const noexcept override;
@@ -53,9 +54,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -63,9 +64,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -49,7 +49,7 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
cache&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
@@ -63,13 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -315,24 +315,31 @@ future<authenticated_user> password_authenticator::authenticate(
const sstring password = credentials.at(PASSWORD_KEY);
try {
const std::optional<sstring> salted_hash = co_await get_password_hash(username);
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
std::optional<sstring> salted_hash;
if (legacy_mode(_qp)) {
salted_hash = co_await get_password_hash(username);
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
} else {
auto role = _cache.get(username);
if (!role || role->salted_hash.empty()) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
salted_hash = role->salted_hash;
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash = std::move(salted_hash)]{
return passwords::check(password, *salted_hash);
});
const bool password_match = co_await passwords::check(password, *salted_hash);
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
} catch (std::system_error &) {
} catch (const std::system_error &) {
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (exceptions::authentication_exception& e) {
} catch (const exceptions::authentication_exception& e) {
std::throw_with_nested(e);
} catch (exceptions::unavailable_exception& e) {
} catch (const exceptions::unavailable_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.get_message()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));

View File

@@ -16,8 +16,8 @@
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "auth/cache.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -41,19 +41,19 @@ class password_authenticator : public authenticator {
cql3::query_processor& _qp;
::service::raft_group0_client& _group0_client;
::service::migration_manager& _migration_manager;
cache& _cache;
future<> _stopped;
abort_source _as;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~password_authenticator();

View File

@@ -7,6 +7,8 @@
*/
#include "auth/passwords.hh"
#include "utils/crypt_sha512.hh"
#include <seastar/core/coroutine.hh>
#include <cerrno>
@@ -21,27 +23,48 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
void verify_hashing_output(const char * res) {
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
}
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return;
try {
verify_hashing_output(e);
} catch (const std::system_error& ex) {
throw no_supported_schemes();
}
throw no_supported_schemes();
}
sstring hash_with_salt(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
verify_hashing_output(res);
return res;
}
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt) {
sstring res;
// Only SHA-512 hashes for passphrases shorter than 256 bytes can be computed using
// the __crypt_sha512 method. For other computations, we fall back to the
// crypt_r implementation from `<crypt.h>`, which can stall.
if (salt.starts_with(prefix_for_scheme(scheme::sha_512)) && pass.size() <= 255) {
char buf[128];
const char * output_ptr = co_await __crypt_sha512(pass.c_str(), salt.c_str(), buf);
verify_hashing_output(output_ptr);
res = output_ptr;
} else {
const char * output_ptr = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
verify_hashing_output(output_ptr);
res = output_ptr;
}
co_return res;
}
std::string_view prefix_for_scheme(scheme c) noexcept {
switch (c) {
case scheme::bcrypt_y: return "$2y$";
@@ -58,8 +81,9 @@ no_supported_schemes::no_supported_schemes()
: std::runtime_error("No allowed hashing schemes are supported on this system") {
}
bool check(const sstring& pass, const sstring& salted_hash) {
return detail::hash_with_salt(pass, salted_hash) == salted_hash;
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash) {
const auto pwd_hash = co_await detail::hash_with_salt_async(pass, salted_hash);
co_return pwd_hash == salted_hash;
}
} // namespace auth::passwords

View File

@@ -11,6 +11,7 @@
#include <random>
#include <stdexcept>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
@@ -75,11 +76,23 @@ sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
///
/// Hash a password combined with an implementation-specific salt string.
/// Deprecated in favor of `hash_with_salt_async`. This function is still used
/// when generating password hashes for storage to ensure that
/// `hash_with_salt` and `hash_with_salt_async` produce identical results,
/// preserving backward compatibility.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
sstring hash_with_salt(const sstring& pass, const sstring& salt);
///
/// Async version of `hash_with_salt` that returns a future.
/// If possible, hashing uses `coroutine::maybe_yield` to prevent reactor stalls.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt);
} // namespace detail
///
@@ -107,6 +120,6 @@ sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
bool check(const sstring& pass, const sstring& salted_hash);
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash);
} // namespace auth::passwords

View File

@@ -36,7 +36,8 @@ static const std::unordered_map<sstring, auth::permission> permission_names({
{"MODIFY", auth::permission::MODIFY},
{"AUTHORIZE", auth::permission::AUTHORIZE},
{"DESCRIBE", auth::permission::DESCRIBE},
{"EXECUTE", auth::permission::EXECUTE}});
{"EXECUTE", auth::permission::EXECUTE},
{"VECTOR_SEARCH_INDEXING", auth::permission::VECTOR_SEARCH_INDEXING}});
const sstring& auth::permissions::to_string(permission p) {
for (auto& v : permission_names) {

View File

@@ -33,6 +33,7 @@ enum class permission {
// data access
SELECT, // required for SELECT.
MODIFY, // required for INSERT, UPDATE, DELETE, TRUNCATE.
VECTOR_SEARCH_INDEXING, // required for SELECT from tables with vector indexes if SELECT permission is not granted.
// permission management
AUTHORIZE, // required for GRANT and REVOKE.
@@ -54,7 +55,8 @@ typedef enum_set<
permission::MODIFY,
permission::AUTHORIZE,
permission::DESCRIBE,
permission::EXECUTE>> permission_set;
permission::EXECUTE,
permission::VECTOR_SEARCH_INDEXING>> permission_set;
bool operator<(const permission_set&, const permission_set&);

View File

@@ -1,38 +0,0 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/permissions_cache.hh"
#include <fmt/ranges.h>
#include "auth/authorizer.hh"
#include "auth/service.hh"
namespace auth {
permissions_cache::permissions_cache(const utils::loading_cache_config& c, service& ser, logging::logger& log)
: _cache(c, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
bool permissions_cache::update_config(utils::loading_cache_config c) {
return _cache.update_config(std::move(c));
}
void permissions_cache::reset() {
_cache.reset();
}
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);
});
}
}

Some files were not shown because too many files have changed in this diff Show More