Commit Graph

53943 Commits

Author SHA1 Message Date
Nadav Har'El
cd61a44ab8 test/alternator: test response compression of tiny responses
This patch adds to the existing collection of tests for Alternator
response compression another test with a tiny response being compressed.
This test serves two purposes:

1. It verifies setting alternator_response_compression_threshold_in_bytes
   to a tiny number like 1 really means that tiny responses would be
   compressed.

2. It verifies that our compression code, which has a special code path
   for the small chunk at the end of the compression, works correctly.

The original motivation for writing this test was a false alarm by
Claude Code which claimed that Alternator's response compression code
has a serious, exploitable, memory overrun bug, because it set the
wrong size limit on that last chunk. Claude was wrong, there is no such
bug. We did set an oversized limit on the last chunk (so this patch
fixes this typo), but it didn't matter - because the code used
deflateBound - the guaranteed maximum size of the uncompressed data -
for the buffer's size, so the buffer was unconditionally big enough,
no matter which avail_out limit we passed to delate() it could never
overflow.

The included test passes even before this patch, even with ASAN
enabled to detect memory overflows - no overflow was happening.
It also passes after the typo correction in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29718
2026-05-19 10:02:26 +03:00
Avi Kivity
85374207ca Merge 'test.py: rewrite gather metrics' from Andrei Chekun
Rewrite gather metrics to be able to gather metrics for python tests correctly.
Python tests require different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as a completely separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main thread.
Change the behavior of the gathering metrics. From this PR some data will be collected even with `--no-gather-metrics`. This data do not need any configuration and just metadata of the tests: test name, time of execution, status of the test. When `--gather-metrics` provided additionally will be written the data gathered from the cgroups about the memory for each specific test and system CPU/RAM utilization.

Backport is not needed, because it's a framework change only.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-575

~Blocked by: https://github.com/scylladb/scylladb/pull/27618~

Now python tests have metrics gathered from the cgroups as well with their own Scylla instances.
```bash
$ sqlite3 --header testlog/sqlite_af8cb.db 'select tst.path, tst.file, tst.test_name, user_sec,system_sec,usage_sec,memory_peak /1024/1024 as memory_peak_mb from test_metrics join tests as tst where tst.id = test_metrics.test_id order by memory_peak_mb desc limit 10;'
path|file|test_name|user_sec|system_sec|usage_sec|memory_peak_mb
test/cluster/dtest|limits_test.py|test_max_cells|489.468174|27.6638949999999|517.132069|4241
test/cluster/dtest|rebuild_test.py|test_rebuild_stream_abort_repro|93.6400869999998|28.9843249999999|122.624412|4241
test/cluster/dtest|schema_management_test.py|test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns|6.8933219999999|3.63569899999993|10.5290209999994|4241
test/cluster/dtest|schema_management_test.py|test_dropping_keyspace_with_many_columns|1.31770999999981|0.754742999999962|2.07245299999977|4241
test/cluster/dtest|schema_management_test.py|test_multiple_create_table_in_parallel|5.48435300000028|2.72915200000011|8.21350499999971|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[write]|80.687293|18.5562|99.2434920000005|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[read]|79.1984790000001|18.0969829999999|97.2954609999997|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[mixed]|85.332915|18.9321070000001|104.265022|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[create_table]|10.5875369999999|5.67954400000008|16.267081|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[alter_table]|11.3801709999998|6.54689099999996|17.9270630000001|4241
```

Closes scylladb/scylladb#28206

* github.com:scylladb/scylladb:
  test.py: Add host hardware info
  test.py: rewrite resource gather
2026-05-18 20:35:14 +03:00
Dawid Pawlik
c2d27d1a50 index: remove Chinese, Japanese, and Korean language analyzers
Remove "chinese", "japanese", and "korean" from the list of accepted
full-text search analyzer options. Exposing these options commits
ScyllaDB to supporting them long-term — if we ever switch from one
backend search engine to another, CJK analyzers are the most likely
to lose out-of-the-box support, unlike the popular European languages
that are broadly available across text analysis libraries.

Restrict the accepted set now, while FTS is still new, to avoid a
future compatibility burden.

Add a test to check if the CJK language analyzer options are rejected.

Fixes: VECTOR-672

Closes scylladb/scylladb#29877
2026-05-18 18:20:47 +03:00
Szymon Malewski
15493872b2 vector_search: fix decimal/varint precision loss in filter value_to_json()
value_to_json() converts CQL values to JSON for vector search filters.
For decimal and varint types, it used rjson::parse() on the JSON string,
which parses through a double and silently loses precision for values
exceeding ~15 significant digits — producing wrong filter results.

Additionally, for decimal type we need an exact string representation
that preserves the original (unscaled, scale) pair, because partition
keys use byte-level identity: different serialized representations of
the same numeric value are distinct rows, so the filter must reproduce
the exact representation stored in the key.

Add big_decimal::to_string_canonical() which follows the Java BigDecimal
toString() spec (JDK 8+), producing a bijective string representation
that uses exponential notation for extreme scales instead of expanding
trailing zeros (which could cause OOM). This could replace to_string(),
but doing so has wider consequences (e.g. hash/equality contract for
decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for
decimal_type, and use rjson::from_string() for varint_type, both
bypassing the lossy double parse path.

Tests cover the new to_string_canonical() and the filter fix, as well as
existing decimal type behavior (key representation, clustering order,
toJson) that we rely on and must not break. The CQL decimal type tests
(test_type_decimal.py) also pass against Cassandra.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1583
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574

Closes scylladb/scylladb#29505
2026-05-18 17:07:26 +03:00
Piotr Dulikowski
26671d4d5f Merge 'Refactor view_update_builder' from Wojciech Mitros
This series improves the readability and structure of
view_update_builder, the component that generates materialized view
updates from base-table mutations.

The first four patches are pure renames and refactoring with no
semantic changes:

  1. Document that the builder operates on a single base partition.
  2. Rename member fields to clearly distinguish readers (the
     mutation_reader streams) from the cached fragments (the last
     mutation_fragment_v2 read from each stream).
  3. Rename advance/on_results methods to names that describe what
     they actually do: read the next fragment, or generate view
     updates.
  4. Extract partition-start handling into its own method.

The next two patches are minor optimizations:

  5. Simplify clustering-row handling by moving the row out of the
     fragment before applying the tombstone, avoiding an unnecessary
     memory-usage recalculation in the reader permit.
  6. Replace deep copies with moves in the existing-only tail path,
     matching the pattern used everywhere else.

Finally, patch 7 deduplicates the fragment-consuming logic by
extracting the three repeated blocks into consume_both_fragments(),
consume_update_fragment(), and consume_existing_fragment().

Code reorganization - no backport needed

Closes scylladb/scylladb#29497

* github.com:scylladb/scylladb:
  mv: deduplicate code for consuming fragments in view_update_builder
  mv: avoid unnecessary copies of existing rows in generate_updates()
  mv: simplify clustering row handling in generate_updates()
  mv: rename methods in view_update_builder for clarity
  mv: rename view_update_builder readers and cached fragments
  mv: drop redundant std::move from partition key extraction
  mv: document single-partition builder scope
2026-05-18 15:52:26 +02:00
Piotr Dulikowski
5efb43195e Merge 'db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE' from Michał Jadwiszczak
After recent change (1a32ccd) `make_update_indices_mutations()` is unconditionally adding a mutation for `system.view_building_tasks`, even when no indices were being dropped.

In a mixed-version cluster, the older node may not have this table, causing the Raft schema applier to fail with 'Can't find a column family with UUID ...'.

This patch fixes the bug by emitting the mutation when indices are actually dropped (i.e., when the view building cleanup code path was entered).

Fixes: SCYLLADB-2026
Refs: scylladb#26557

scylladb#26557 wasn't backported, so this patch also doesn't need to be.

Closes scylladb/scylladb#29908

* github.com:scylladb/scylladb:
  db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE
  db/view_building_task_mutation_builder: add `empty()` method
2026-05-18 15:37:02 +02:00
Nadav Har'El
5dbd0d71d5 Merge 'test/pylib: test/pylib: Cached Scylla package resolver' from Alex Dathskovsky
This series adds a shared helper for resolving, downloading, unpacking, and
installing Scylla relocatable packages for test.py.

The first patch introduces `version_fetch_utils`, which can resolve public
Scylla artifacts from the downloads bucket by version, architecture, package
variant, or direct URL. It also centralizes the local cache/install flow using
retry handling, marker files, and file locking so repeated or concurrent test
runs can safely reuse an existing installation.

The second patch wires this helper into the existing Scylla executable setup
paths. This removes the hard-coded 2025.1 package URL and replaces the local
download/unpack/install logic in `scylla_cluster.py` with the shared resolver.
It also makes `--exe-url` use the same cached installer path.
Together, these changes make upgrade-test executable selection less brittle,
avoid duplicated install logic, and provide a reusable foundation for fetching
other Scylla versions in test.py.

Closes scylladb/scylladb#29855

* github.com:scylladb/scylladb:
  test/pylib: use version fetcher for Scylla executable setup
  test/pylib: add cached Scylla package installer
2026-05-18 16:32:47 +03:00
Yaniv Michael Kaul
5d8d158bdd call_backport_with_jira.yaml: add missing workflow permissions
Add explicit permissions block matching the requirements of the called
reusable workflow (contents: read, pull-requests: write, issues: write).
Fixes code scanning alerts #181, #182, #183.

Closes scylladb/scylladb#29182
2026-05-18 15:50:00 +03:00
Yaniv Michael Kaul
fbf5be5587 docs: update Python deps
Ran 'make update' to get the latest version of all dependencies needed to build docs.
Tested with 'make test' only.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-Assisted: no, to my surprise.
Backport: not sure.

Closes scylladb/scylladb#29909
2026-05-18 15:45:59 +03:00
Andrei Chekun
a03c4fd754 test.py: Add host hardware info
Gather additional information about the running host for better metrics analysis
2026-05-18 12:23:40 +02:00
Andrei Chekun
6414c48fc2 test.py: rewrite resource gather
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
2026-05-18 12:23:40 +02:00
Marcin Maliszkiewicz
628e1ef2de Merge 'Introduce auth::config to decouple auth modules from db::config' from Pavel Emelyanov
Auth modules (authenticators, role managers, and auth::service) access their configuration options by reaching into db::config through the query processor. This abuses database as proxy object to get configuration.

This series introduces a dedicated auth::config struct that carries the configuration options used by auth modules.The config is populated in main.cc and delivered to each shard via sharded_parameter. This makes auth service conform to the overall design, where db::config is split into smaller per-service configs on start, thus decoupling individual components/services from global configuration.

Cleaning components dependencies, not backporting.

Closes scylladb/scylladb#29870

* github.com:scylladb/scylladb:
  auth: Remove unused default_superuser() function
  auth: Switch role managers to use auth::config
  auth: Switch authenticators to use auth::config
  auth: Introduce auth::config and wire it through service
2026-05-18 11:32:11 +02:00
Patryk Jędrzejczak
c9592a495e Merge 'cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce' from Petr Gusev
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

backport: need to backport to all versions with LWT over tablets

Closes scylladb/scylladb#29910

* https://github.com/scylladb/scylladb:
  cql: refactor add_tablet_info to take tablet_routing_info directly
  cql: fix UB dereference of nullopt tablet_info in execute_with_condition
  test/boost: add regression test for missing tablet routing after CAS bounce
  cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
2026-05-18 11:19:04 +02:00
Yehuda Lebi
6307e17795 fix: raise scylla-helper.slice CPUWeight from 10 to 100 to prevent node_exporter CPU starvation
Closes scylladb/scylladb#29839
2026-05-18 11:55:14 +03:00
Yaniv Michael Kaul
f047e6fd5c trigger_jenkins.yaml: add missing permissions and fix script injection
Add explicit empty permissions block (permissions: {}) since this
workflow only triggers Jenkins and sends Slack notifications using its
own secrets. Also move expression interpolations into env vars to
prevent potential script injection. Fixes code scanning alert #147.

Also remove the pre-existing 'permissions: contents: read' block,
which would result in duplicate YAML keys (invalid per the YAML spec).

Closes scylladb/scylladb#29186
2026-05-18 11:39:39 +03:00
Botond Dénes
cc210813c8 Merge 'cmake: add IDL comparison to build system tool and fix PCH propagation' from Ernest Zaslavsky
This series adds IDL file comparison to the build system comparison tool and fixes CMake PCH propagation.

1. `scripts/compare_build_systems.py` only compared compilation flags, link targets, and linker settings — it did not compare IDL-generated file sets. This allowed PR #28843 to pass CI despite adding `strong_consistency/groups_manager.idl.hh` to `configure.py` but not to `idl/CMakeLists.txt`.

2. CMake's `scylla-main` target was not using the precompiled header (`stdafx.hh`), even though configure.py applies it to every source file via `-include-pch`. This caused compilation failures for files relying on transitive includes from the PCH — e.g., `sstables_loader.cc` failed with `no member named 'read_entire_stream' in namespace 'seastar::util'`.

Add a 4th comparison check to the build system comparison script: extract IDL-generated file sets from both build systems' ninja files and compare them. The extractors parse ninja build statements — configure.py side filters by build mode, CMake side handles the `|` separator for implicit outputs — and normalize to a canonical relative path for comparison.

Add the missing `strong_consistency/groups_manager.idl.hh` to `idl/CMakeLists.txt`.

Add `target_precompile_headers(scylla-main REUSE_FROM scylla-precompiled-header)` so that all sources compiled under `scylla-main` benefit from the PCH, matching configure.py's behavior.

Update documentation to reflect the new IDL comparison check.

Refs: https://github.com/scylladb/scylladb/pull/29901
Refs: https://github.com/scylladb/scylladb/pull/28843

No backport needed — these are build system improvements only.

Closes scylladb/scylladb#29912

* github.com:scylladb/scylladb:
  cmake: reuse precompiled header in scylla-main target
  idl: add missing groups_manager.idl.hh to CMakeLists.txt
  scripts: add IDL-generated file comparison to compare_build_systems
2026-05-18 11:38:14 +03:00
Yaniv Michael Kaul
34aac2030c paxos: enable paging for internal paxos state queries
The paxos state queries (load_paxos_state, save_paxos_promise, etc.)
were using page_size=-1 (no paging). While each query returns at most
one row and paging never actually kicks in, the lack of paging causes
these internal queries to be counted as non-paged reads in the metrics,
which can be confusing to users monitoring their cluster.

Add LIMIT 1 to the SELECT query so that may_need_paging() short-circuits
to false (row_limit <= 1), avoiding pager allocation overhead entirely.
Set page_size=1000 so these queries are no longer reported as non-paged
reads.

Refs: https://scylladb.atlassian.net/browse/CUSTOMER-372
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Backport: no, improvement

Closes scylladb/scylladb#29852
2026-05-18 11:35:55 +03:00
Michał Jadwiszczak
a9b2baf36b db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE
After recent change (1a32ccd) `make_update_indices_mutations()` is unconditionally
adding a mutation for `system.view_building_tasks`, even when no indices were being dropped.

In a mixed-version cluster, the older node may not have this table,
causing the Raft schema applier to fail with 'Can't find a column
family with UUID ...'.

This patch fixes the bug by emitting the mutation when indices are actually
dropped (i.e., when the view building cleanup code path was entered).

Fixes: SCYLLADB-2026
Refs: scylladb#26557
2026-05-18 10:01:21 +02:00
Michał Jadwiszczak
82eb5611ab db/view_building_task_mutation_builder: add empty() method
The method allows to check if the builder contains any changes,
so it will allow to skip emitting empty mutation.
2026-05-18 09:54:26 +02:00
Ernest Zaslavsky
834eed10d9 test: fix use-after-free in start_docker_service retry path
start_docker_service is a coroutine that took docker_args and
image_args by const reference. Its caller start_fake_gcs_server
is a regular function that passes temporaries (initializer lists)
and immediately returns a future. The temporaries are destroyed
when the caller returns, leaving the coroutine holding dangling
references.

On the first loop iteration this works by luck (memory not yet
reused), but on retry (after "address already in use") the
params.append_range(image_args) reads freed memory, causing
use-after-free that manifests as std::bad_alloc or broken_promise
in non-sanitizer builds.

Fix by taking docker_args and image_args by value so the coroutine
frame owns the vectors for its entire lifetime.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2003

Closes scylladb/scylladb#29932
2026-05-18 10:50:19 +03:00
Szymon Malewski
cb8e11653f test/alternator: Number normalization tests
DynamoDB normalizes Number values, so different string representations
of the same number (e.g., "1000" vs "1e3") should be treated as the
same value in all contexts.
In Alternator this is true in most cases, thanks to implicit normalization in
Decimal `to_string()` function.
However this is fragile - and in fact this function should be fixed
due to OOM vulnerability in CQL use (#8002).

This patch adds tests that should prevent regression in cases
that work currently.

Unfortunately not all contexts work currently - mainly the HASH keys
are not normalized and backend handles them by byte representation.
Added test replicate this incorrect behaviour

All added tests pass with DynamoDB, with one exception: weirdly
DynamoDB doesn't recognise unnormalized numbers in BatchGetItem
 as duplicate keys.

Ref SCYLLADB-1575

Closes scylladb/scylladb#29501
2026-05-18 09:42:33 +03:00
Evgeniy Naydanov
39a10d6d67 test: remove dead suite subclasses and legacy execution pipeline
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.

Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py

Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
  add_test_list(), build_test_list(), all_tests(), test_count(),
  SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
  Test attributes (args, core_args, valid_exit_codes, allure_dir,
  is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
  _prepare_pytest_params(), pattern, test_file_ext, xmlout,
  server_log, scylla_env setup, and shlex import.
  Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
  print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
  and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
  architecture, log paths, and removed features.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>

Closes scylladb/scylladb#29613
2026-05-17 22:16:31 +03:00
Alex
176dbf12c2 test/pylib: use version fetcher for Scylla executable setup
Replace the hard-coded 2025.1 archive download and local install logic with the
shared Scylla package fetch/install helper. This keeps upgrade-test executable
resolution and `--exe-url` handling on the same cached installer path.
2026-05-17 17:43:56 +03:00
Alex
1efe9a7243 test/pylib: add cached Scylla package installer
Add utilities to resolve relocatable Scylla artifacts from the public downloads
bucket by version, architecture, package variant, or direct URL. Download,
unpack, and install the selected archive into the test.py cache with retry
handling, marker files, and file locking so repeated or concurrent test runs can
reuse the same installation safely.
2026-05-17 17:43:56 +03:00
Andrzej Jackowski
61e5ec9888 test: storage: retry fusermount3 unmount on teardown
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.

Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.

Fixes: SCYLLADB-2049

Closes scylladb/scylladb#29920
2026-05-16 19:36:48 +03:00
Piotr Dulikowski
460cb1656e Merge 'test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Dario Mirovic
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory

Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.

This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements

Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645

Fixes SCYLLADB-1645

CI stability improvement. Backport to versions that have this test.

Closes scylladb/scylladb#29759

* github.com:scylladb/scylladb:
  test: prepare max cells inserts
  test: reuse max cells schema
  test: limits: skip empty max cells iterations
2026-05-15 18:12:48 +02:00
Pavel Emelyanov
98bea152a8 auth: Remove unused default_superuser() function
All callers have been migrated to read the superuser name from
auth::config directly. Remove the now-unused helper that fetched
it from db::config via the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-15 18:55:02 +03:00
Pavel Emelyanov
9b58d2213b auth: Switch role managers to use auth::config
Convert all role manager implementations to receive their
configuration from auth::config instead of accessing db::config
through the query processor:

- standard_role_manager: reads superuser name from config
- ldap_role_manager: reads LDAP URL template, attribute, bind
  credentials, and permissions update interval from config;
  passes config to inner standard_role_manager
- maintenance_socket_role_manager: keeps a const reference to
  service's config and passes it directly when lazily
  constructing standard_role_manager

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-15 18:55:02 +03:00
Aleksandra Martyniuk
d874d355c2 service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.

Closes scylladb/scylladb#29622
2026-05-15 17:46:28 +02:00
Pavel Emelyanov
14b36b3db1 auth: Switch authenticators to use auth::config
Convert all authenticator implementations to receive their
configuration from auth::config instead of accessing db::config
through the query processor:

- password_authenticator: reads superuser name and salted password
  from config, stores them as members
- saslauthd_authenticator: reads socket path from config
- certificate_authenticator: reads role queries from config
- transitional_authenticator: passes config to inner
  password_authenticator
- maintenance_socket_authenticator: inherits new constructor
  via using declaration

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-15 18:45:01 +03:00
Pavel Emelyanov
07ed557a2f auth: Introduce auth::config and wire it through service
Add a dedicated auth::config struct that carries all configuration
options needed by auth modules. The config is created per-shard using
sharded_parameter to ensure updateable_value fields are shard-local.

The config is stored as a member in auth::service and passed by
const reference to factories so that each auth module can receive its
configuration when constructed. The modules themselves are not yet
converted — they still read from db::config via the query processor.

The stored config is also used in describe_roles() to read the
superuser name, eliminating the default_superuser() call that reached
into db::config via the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-15 18:44:37 +03:00
Petr Gusev
9e3209e4a3 cql: refactor add_tablet_info to take tablet_routing_info directly
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.

Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.
2026-05-15 12:28:33 +02:00
Petr Gusev
738b7b4a86 cql: fix UB dereference of nullopt tablet_info in execute_with_condition
When check_locality() returns nullopt (correctly routed LWT), the
optional tablet_info was unconditionally dereferenced in the lambda
capture list: tablet_info->tablet_replicas, tablet_info->token_range.

The code previously masked this by initializing tablet_info with an
empty-but-present value, so the dereference happened to work but
only because the empty tablet_replicas made add_tablet_info() a no-op.
After check_locality() overwrites it with nullopt, the dereference
is UB.

Fix by initializing tablet_info as empty (nullopt) and guarding the
dereference.
2026-05-15 11:56:14 +02:00
Petr Gusev
8a76ec7e65 test/boost: add regression test for missing tablet routing after CAS bounce
Add test_tablet_routing_info_after_cas_shard_bounce that verifies
TABLETS_ROUTING_V1 payload is returned after an internal CAS shard
bounce.

The test simulates the transport-layer bounce: it creates a table whose
single tablet replica lands on a shard different from the test thread,
executes an LWT (which bounces), then transfers client_state via
client_state_for_another_shard (preserving _original_shard) and
re-executes on the tablet shard. The test asserts that check_locality()
correctly detects the misrouting and returns tablet routing info.

Refs SCYLLADB-2041
2026-05-15 11:56:14 +02:00
Petr Gusev
167a3c9c50 cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041
2026-05-15 11:56:14 +02:00
Jenkins Promoter
db3c44440b Update pgo profiles - aarch64 2026-05-15 05:49:12 +03:00
Jenkins Promoter
a2fd608b7d Update pgo profiles - x86_64 2026-05-15 05:10:51 +03:00
Ernest Zaslavsky
8d85382c55 cmake: reuse precompiled header in scylla-main target
scylla-precompiled-header defines the PCH (stdafx.hh) with PRIVATE
visibility, so targets linking to it do not inherit the PCH.
scylla-main was missing the PCH entirely, causing files like
sstables_loader.cc to fail with 'no member read_entire_stream' since
that symbol comes from <seastar/util/short_streams.hh> which is
included in stdafx.hh.

PR #29901 worked around this by adding the missing #include directly,
but the real fix is to propagate the PCH to scylla-main — matching
the configure.py behavior where every source file is compiled with
-include-pch stdafx.hh.pch.

Add target_precompile_headers(scylla-main REUSE_FROM
scylla-precompiled-header) so that all sources in scylla-main benefit
from the precompiled header.

Refs: https://github.com/scylladb/scylladb/pull/29901
2026-05-14 19:46:51 +03:00
Ernest Zaslavsky
d0ac01af2f idl: add missing groups_manager.idl.hh to CMakeLists.txt
PR #28843 added strong_consistency/groups_manager.idl.hh to
configure.py but not to idl/CMakeLists.txt, causing the CMake build
to fail with a missing generated header.
2026-05-14 19:46:51 +03:00
Ernest Zaslavsky
c36932d252 scripts: add IDL-generated file comparison to compare_build_systems
Add a 4th check that compares IDL-generated file sets between
configure.py and CMake. Previously only compilation flags, link
targets, and linker settings were compared — a missing IDL entry
(like strong_consistency/groups_manager.idl.hh in PR #28843) would
go undetected.

The extractors parse ninja build statements from both systems and
normalize to a canonical relative path (e.g. cache_temperature.dist.hh)
for comparison. configure.py outputs are filtered by mode; CMake
outputs handle the | separator for implicit outputs in ninja build
lines.

Also update the documentation to mention the new check.
2026-05-14 19:46:51 +03:00
Marcin Maliszkiewicz
0574055b73 test: prepare max cells inserts
Switch from raw CQL batch string to using a prepared statement.
The old approach constructed the entire 50-row batch as a single
CQL text string (~19.8 MiB with 32768 column names spelled out
per row). This caused large contiguous allocations in the server.

Fixes SCYLLADB-1645
2026-05-14 17:25:39 +02:00
Marcin Maliszkiewicz
0fd6f6f292 test: reuse max cells schema
Extract table creation into _create_max_cell_count_table(). Call
it once before the loop instead of creating and dropping the table
on every iteration. Use TRUNCATE instead of DROP TABLE between
iterations to clear data while keeping the schema.

This avoids repeated schema operations that fragment the Seastar
buddy allocator's address space with scattered small allocations.

Refs SCYLLADB-1645
2026-05-14 17:24:53 +02:00
Marcin Maliszkiewicz
ec8f8e3a5b Merge 'test: make test_vector_search_with_vector_store_mock 30 times faster!' from Nadav Har'El
Before this patch,
```
test/cqlpy/run test_vector_search_with_vector_store_mock.py
```

Took 34 seconds.

After this patch, it takes **1 second**.

Look at the individual patches for how the magic happened. The first patch lowers the test duration from 34 to 5 seconds, the second patch lowers it further to 1 second.

Closes scylladb/scylladb#29891

* github.com:scylladb/scylladb:
  test/cqlpy: make test_vector_search_with_vector_store_mock faster
  vector-search: reset DNS timeout after changing host
2026-05-14 17:12:47 +02:00
Marcin Maliszkiewicz
3debae9a37 test: limits: skip empty max cells iterations
The doubling loop in test_max_cells started from cells=1. Since
each row has MAX_CELLS_COLUMNS (32768) cells, iterations where
cells < MAX_CELLS_COLUMNS produced zero rows (cells // columns = 0).
Those iterations only did CREATE TABLE / DROP TABLE with no data
inserted.

Start the loop from MAX_CELLS_COLUMNS and use a while loop.

Co-authored-by: Dario Mirovic <dario.mirovic@scylladb.com>

Refs SCYLLADB-1645
2026-05-14 17:00:15 +02:00
Botond Dénes
8a305dd6c7 docs: expand OCI Object Storage configuration section
The existing OCI section in admin.rst was a minimal stub that only showed
a config snippet without explaining how to actually set up connectivity.

Add documentation for:
- The OCI S3-compatible endpoint URL format (namespace + region)
- That credentials must be set explicitly via AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY using OCI Customer Secret Keys (unlike AWS,
  OCI has no instance metadata fallback compatible with STS/EC2)
- A note that iam_role_arn is AWS-specific and should be omitted for OCI

Fixes: SCYLLADB-501

Closes scylladb/scylladb#29689
2026-05-14 16:44:42 +02:00
Piotr Dulikowski
5b269be37b Merge 'test/cluster/test_view_building_coordinator: migrate test from dtest' from Michał Jadwiszczak
Move `materialized_views_test.py::TestMaterializedViews::test_do_not_finish_view_building_with_hints`
test from dtest to test.py.

The dtest was throttling down IO throughput in the hope that the view
building won't be finished too soon. This introduces some unreliability,
which can be solved by using error injection and pausing view building
until we stop necessary nodes.

This patch adds 2 tests: one for tablet-based view and one for vnode-based. Both of the tests use error injection to pause view building.

Fixes [SCYLLADB-1261](https://scylladb.atlassian.net/browse/SCYLLADB-1261)

The issue was seen in 2026.2, so we should backport this patch to this version.

[SCYLLADB-1261]: https://scylladb.atlassian.net/browse/SCYLLADB-1261?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29788

* github.com:scylladb/scylladb:
  test/cluster/mv/test_mv_building: add similar test for vnode-based view
  test/cluster/test_view_building_coordinator: migrate test from dtest
  db/view/view_building_worker: add more logs when flushing base table
2026-05-14 15:34:26 +02:00
Michał Jadwiszczak
25c176c1b4 sstables_loader: fix missing include
Commit c97232b introduced use of `seastar::util::read_entire_stream()`,
however it didn't included relevant header which is causing compilation
error.

It probably went silently through CI because of precompiled headers.

Refs scylladb#28763

Closes scylladb/scylladb#29901
2026-05-14 15:16:34 +02:00
Piotr Szymaniak
ac3fff897a alternator/doc: update Streams compatibility docs
Alternator Streams graduated from experimental in #29604.  Update the
compatibility and FAQ docs accordingly:

- Replace the "Experimental API features" section with a new
  "Alternator Streams" section that lists known differences without
  the experimental framing.
- Expand the alternator_streams_increased_compatibility paragraph to
  explain both consequences of leaving it off (spurious no-op events
  and inaccurate INSERT/MODIFY distinction) and the performance cost
  of enabling it (LWT path for every write).
- Drop the stale ShardFilter limitation (now implemented).
- Replace the alternator-streams FAQ example with
  strongly-consistent-tables so the multi-feature syntax example
  remains useful.

Fixes SCYLLADB-462

Closes scylladb/scylladb#29695
2026-05-14 15:06:19 +02:00
Michał Jadwiszczak
5c84cff78a test/cluster/mv/test_mv_building: add similar test for vnode-based view
In the dtest repo, the test run for both vnode and tablet based views.

Since in test.py infra we're using error injection to pause the view
building process, we need separate tests for those two cases.
2026-05-14 10:52:44 +02:00
Piotr Dulikowski
0c016cecc3 Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-1807

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

Closes scylladb/scylladb#29749

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-14 10:32:43 +02:00