Compare commits

...

498 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
39b6ff982c topology_coordinator: suppress cancel warning in should_preempt_balancing
Agent-Logs-Url: https://github.com/scylladb/scylladb/sessions/ff8e4ba3-e470-4446-8a15-9f173b22c277

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2026-04-10 19:25:21 +00:00
Michał Hudobski
7d648961ed vector_search: forward non-primary key restrictions to Vector Store service
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.

Add get_nonprimary_key_restrictions() getter to statement_restrictions.

Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.

Fixes: SCYLLADB-970

Closes scylladb/scylladb#29019
2026-04-10 17:16:29 +02:00
Piotr Dulikowski
3bd770d4d9 Merge 'counters: reuse counter IDs by rack' from Michael Litvak
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356

backport not needed - an enhancement

Closes scylladb/scylladb#28901

* github.com:scylladb/scylladb:
  docs/dev: add counters doc
  counters: reuse counter IDs by rack
2026-04-10 12:24:18 +02:00
Wojciech Mitros
163c6f71d6 transport: refactor result_message bounce interface
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.

- Add virtual as_bounce() returning const bounce* to the base class
  (nullptr by default, overridden in bounce to return this), replacing
  the virtual move_to_shard() which conflated bounce detection with
  shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
  unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
  that already checked as_bounce()
- Move forward declarations of message types before the virtual
  methods so as_bounce() can reference bounce

Fixes: SCYLLADB-1066

Closes scylladb/scylladb#29367
2026-04-10 12:17:43 +02:00
Piotr Dulikowski
32e3a01718 Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek
Motivation
----------

Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.

Description of solution
-----------------------

The situations we focus on are the following:

* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns

We handle each of them and provide validation tests.

Implementation strategy
-----------------------

1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.

Tests
-----

We provide tests that validate the correctness of the solution.

The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):

Before:
```
real    0m31.809s
user    1m3.048s
sys     0m21.812s
```

After:
```
real    0m34.523s
user    1m10.307s
sys     0m27.223s
```

The incremental differences in time can be found in the commit messages.

Fixes SCYLLADB-429

Backport: not needed. This is an enhancement to an experimental feature.

Closes scylladb/scylladb#28526

* github.com:scylladb/scylladb:
  service: strong_consistency: Abort state_machine::apply when aborting server
  service: strong_consistency: Abort ongoing operations when shutting down
  service: client_state: Extend with abort_source
  service: strong_consistency: Handle abort when removing Raft group
  service: strong_consistency: Abort Raft operations on timeout
  service: strong_consistency: Use timeout when mutating
  service: strong_consistency: Fix indentation
  service: strong_consistency: Enclose coordinator methods with try-catch
  service: strong_consistency: Crash at unexpected exception
  test: cluster: Extract default config & cmdline in test_strong_consistency.py
2026-04-10 11:11:21 +02:00
Pavel Emelyanov
0b336da89d Revert "cmake: add missing rolling_max_tracker_test and symmetric_key_test"
This reverts commit 8b4a91982b.

Two commits independently added rolling_max_tracker_test to test/boost/CMakeLists.txt:
8b4a919 cmake: add missing rolling_max_tracker_test and symmetric_key_test
f3a91df test/cmake: add missing tests to boost test suite

The second was merged two days after the first. They didn't conflict on
code-level and applied cleanly resulting in a duplicate add_scylla_test()
entries that breaks the CMake build:

    CMake Error: add_executable cannot create target
    "test_boost_rolling_max_tracker_test" because another target
    with the same name already exists.

Remove the duplicate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reported-by: Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>
2026-04-10 11:19:43 +03:00
Patryk Jędrzejczak
751bf31273 Merge 'More gossiper cleanups' from Gleb Natapov
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.

No need to backport. Cleanups.

Closes scylladb/scylladb#29129

* https://github.com/scylladb/scylladb:
  storage_service: cleanup unused code
  storage_service: simplify get_peer_info_for_update
  gossiper: send shutdown notifications in parallel
  gms: remove unused code
  virtual_tables: no need to call gossiper if we already know that the node is in shutdown
  gossiper: print node state from raft topology in the logs
  gossiper: use is_shutdown instead of code it manually
  gossiper: mark endpoint_state(inet_address ip) constructor as explicit
  gossiper: remove unused code
  gossiper: drop last use of LEFT state and drop the state
  gossiper: drop unused STATUS_BOOTSTRAPPING state
  gossiper: rename is_dead_state to is_left since this is all that the function checks now.
  gossiper: use raft topology state instead of gossiper one when checking node's state
  storage_service: drop check_for_endpoint_collision function
  storage_service: drop is_first_node function
  gossiper: remove unused REMOVED_TOKEN state
  gossiper: remove unused advertise_token_removed function
2026-04-10 09:56:20 +02:00
Nadav Har'El
6674aa29ca Merge 'Add Cassandra SAI (StorageAttachedIndex) compatibility' from Szymon Wasik
Cassandra's native vector index type is StorageAttachedIndex (SAI). Libraries such as CassIO, LangChain, and LlamaIndex generate `CREATE CUSTOM INDEX` statements using the SAI class name. Previously, ScyllaDB rejected these with "Non-supported custom class".

This PR adds compatibility so that SAI-style CQL statements work on ScyllaDB without modification.

1. **test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests**
   Enables the `SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS` Cassandra system property so that `search_beam_width` tests pass against Cassandra 5.0.7.

2. **test: modernize vector index test comments and fix xfail**
   Updates test comments from "Reproduces" to "Validates fix for" for clarity, and converts the `test_ann_query_with_pk_restriction` xfail into a stripped-down CREATE INDEX syntax test (removing unused INSERT/SELECT lines). Removes the redundant `test_ann_query_with_non_pk_restriction` test.

3. **cql: add Cassandra SAI (StorageAttachedIndex) compatibility**
   Core implementation: the SAI class name is detected and translated to ScyllaDB's native `vector_index`. The fully-qualified class name (`org.apache.cassandra.index.sai.StorageAttachedIndex`) requires exact case; short names (`StorageAttachedIndex`, `sai`) are matched case-insensitively — matching Cassandra's behavior. Non-vector and multi-column SAI targets are rejected with clear errors. Adds `skip_on_scylla_vnodes` fixture, SAI compatibility docs, and the Cassandra compatibility table entry (split into "SAI general" vs "SAI for vector search").

4. **cql: accept source_model option for Cassandra SAI compatibility**
   The `source_model` option is a Cassandra SAI property used by Cassandra libraries (e.g., CassIO) to tag vector indexes with the name of the embedding model. ScyllaDB accepts it for compatibility but does not use it — the validator is a no-op lambda. The option is preserved in index metadata and returned in DESCRIBE INDEX output.

- `cql3/statements/create_index_statement.cc`: SAI class detection and rewriting logic
- `index/secondary_index_manager.cc`: case-insensitive class name lookup (lowercasing restored before `classes.find()`)
- `index/vector_index.cc`: `source_model` accepted as a valid option with no-op validator
- `docs/cql/secondary-indexes.rst`: SAI compatibility documentation with `source_model` table row
- `docs/using-scylla/cassandra-compatibility.rst`: SAI entry split into general (not supported) and vector search (supported)
- `test/cqlpy/conftest.py`: `scylla_with_tablets` renamed to `skip_on_scylla_vnodes`
- `test/cqlpy/test_vector_index.py`: SAI tests inlined (no constants), `check_bad_option()` helper for numeric validation, uppercase class name test, merged `source_model` tests with DESCRIBE check

| Backend            | Passed | Skipped | Failed |
|--------------------|--------|---------|--------|
| ScyllaDB (dev)     | 42     | 0       | 0      |
| Cassandra 5.0.7    | 16     | 26      | 0      |

None: new feature.

Fixes: SCYLLADB-239

Closes scylladb/scylladb#28645

* github.com:scylladb/scylladb:
  cql: accept source_model option and show options in DESCRIBE
  cql: add Cassandra SAI (StorageAttachedIndex) compatibility
  test: modernize vector index test comments and fix xfail
  test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests
2026-04-10 10:21:20 +03:00
Avi Kivity
f67d0739d0 test: user_function_test: adjust Lua error message tests
Lua 5.5 changed the error message slightly ("?:-1" -> "?:?"). Relax
the error message tests to avoid this unimportant fragment.

Closes scylladb/scylladb#29414
2026-04-10 01:09:35 +03:00
Piotr Szymaniak
98d6edaa88 alternator: add comment explaining delta_mode::keys in add_stream_options()
Clarify that cdc::delta_mode is ignored by Alternator, so we use the
least expensive mode (keys) to reduce overhead.

Fixes scylladb/scylladb#24812

Closes scylladb/scylladb#29408
2026-04-10 01:07:21 +03:00
Michał Hudobski
c8b9fde828 auth: allow VECTOR_SEARCH_INDEXING permission to access system.tablets
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.

Fixes: VECTOR-605

Closes scylladb/scylladb#29397
2026-04-09 21:53:07 +03:00
Szymon Wasik
573def7cd8 cql: accept source_model option and show options in DESCRIBE
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.

ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.

Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
2026-04-09 17:20:03 +02:00
Szymon Wasik
80a2e4a0ab cql: add Cassandra SAI (StorageAttachedIndex) compatibility
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.

When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
  - org.apache.cassandra.index.sai.StorageAttachedIndex
  - StorageAttachedIndex
  - sai

SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.

The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.

Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().

Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
2026-04-09 17:20:03 +02:00
Szymon Wasik
fa7edc627c test: modernize vector index test comments and fix xfail
- Change 'Reproduces' to 'Validates fix for' in test comments to
  reflect that the referenced issues are already fixed.
- Condense the VECTOR-179 comment to two lines.
- Replace the xfailed test_ann_query_with_restriction_works_only_on_pk
  with a focused test (test_ann_query_with_pk_restriction) that creates
  a vector index on a table with a PK column restriction, validating
  the VECTOR-374 fix.
2026-04-09 17:20:02 +02:00
Szymon Wasik
4eab050be4 test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests 2026-04-09 17:20:02 +02:00
Andrzej Jackowski
23c386a27f test: perf: add audit-unix-socket-path to perf-simple-query
To allow performance benchmarking with custom syslog sinks.

Example use case:

-- Audit + default syslog: ~100k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY"

```
110263.72 tps ( 66.1 allocs/op,  16.0 logallocs/op,  25.7 tasks/op,  254900 insns/op,  144796 cycles/op,        0 errors)
throughput:
	mean=   107137.48 standard-deviation=3142.98
	median= 106665.00 median-absolute-deviation=1786.03
	maximum=111435.19 minimum=97620.79
instructions_per_op:
	mean=   256311.36 standard-deviation=5037.13
	median= 256288.09 median-absolute-deviation=2223.08
	maximum=274220.89 minimum=248141.40
cpu_cycles_per_op:
	mean=   146443.47 standard-deviation=2844.19
	median= 146001.85 median-absolute-deviation=1514.82
	maximum=157177.54 minimum=142981.03
```

-- Audit + custom syslog: ~400k tps
socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path /tmp/audit-null.sock

```
404929.62 tps ( 65.9 allocs/op,  16.0 logallocs/op,  25.5 tasks/op,   77406 insns/op,   35559 cycles/op,        0 errors)
throughput:
	mean=   399868.39 standard-deviation=6232.88
	median= 401770.65 median-absolute-deviation=3859.09
	maximum=406126.79 minimum=383434.84
instructions_per_op:
	mean=   77481.26 standard-deviation=168.31
	median= 77405.54 median-absolute-deviation=84.33
	maximum=78081.46 minimum=77332.84
cpu_cycles_per_op:
	mean=   35871.32 standard-deviation=516.83
	median= 35699.70 median-absolute-deviation=251.15
	maximum=37454.86 minimum=35432.60
```

-- No audit: ~800k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30

```
808970.95 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.9 tasks/op,   49904 insns/op,   20471 cycles/op,        0 errors)
throughput:
	mean=   809065.31 standard-deviation=6222.39
	median= 810507.10 median-absolute-deviation=1827.99
	maximum=815213.41 minimum=782104.84
instructions_per_op:
	mean=   49905.50 standard-deviation=21.81
	median= 49900.12 median-absolute-deviation=7.72
	maximum=50010.97 minimum=49892.57
cpu_cycles_per_op:
	mean=   20429.00 standard-deviation=41.40
	median= 20425.18 median-absolute-deviation=29.11
	maximum=20530.74 minimum=20355.42
```

Closes scylladb/scylladb#29396
2026-04-09 16:00:41 +03:00
Anna Stuchlik
c6587c6a70 doc: Fix malformed markdown link in alternator network docs
Fixes https://github.com/scylladb/scylladb/issues/29400

Closes scylladb/scylladb#29402
2026-04-09 15:54:43 +03:00
Botond Dénes
5886d1841a Merge 'cmake: align CMake build system with configure.py and add comparison script' from Ernest Zaslavsky
Every time someone modifies the build system — adding a source file, changing a compilation flag, or wiring a new test — the change tends to land in only one of our two build systems (configure.py or CMake). Over time this causes three classes of problems:

1. **CMake stops compiling entirely.** Missing defines, wrong sanitizer flags, or misplaced subdirectory ordering cause hard build failures that are only discovered when someone tries to use CMake (e.g. for IDE integration).

2. **Missing build targets.** Tests or binaries present in configure.py are never added to CMake, so `cmake --build` silently skips them. This PR fixes several such cases (e.g. `symmetric_key_test`, `auth_cache_test`, `sstable_tablet_streaming`).

3. **Missing compilation units in targets.** A `.cc` file is added to a test binary in one system but not the other, causing link errors or silently omitted test coverage.

To fix the existing drift and prevent future divergence, this series:

**Adds a build-system comparison script**
(`scripts/compare_build_systems.py`) that configures both systems into a temporary directory, parses their generated `build.ninja` files, and compares per-file compilation flags, link target sets, and per-target libraries. configure.py is treated as the baseline; CMake must match it. The script supports a `--ci` mode suitable for gating PRs that touch
build files.

**Fixes all current mismatches** found by the script:
- Mode flag alignment in `mode.common.cmake` and `mode.Coverage.cmake`
  (sanitizer flags, `-fno-lto`, stack-usage warnings, coverage defines).
- Global define alignment (`SEASTAR_NO_EXCEPTION_HACK`, `XXH_PRIVATE_API`,
  `BOOST_ALL_DYN_LINK`, `SEASTAR_TESTING_MAIN` placement).
- Seastar build configuration (shared vs static per mode, coverage
  sanitizer link options).
- Abseil sanitizer flags (`-fno-sanitize=vptr`).
- Missing test targets in `test/boost/CMakeLists.txt`.
- Redundant per-test flags now covered by global settings.
- Lua library resolution via a custom `cmake/FindLua.cmake` using
  pkg-config, matching configure.py's approach.

**Adds documentation** (`docs/dev/compare-build-systems.md`) describing how to run the script and interpret its output.

No backport needed — this is build infrastructure improvement only.

Closes scylladb/scylladb#29273

* github.com:scylladb/scylladb:
  scripts: remove lua library rename workaround from comparison script
  cmake: add custom FindLua using pkg-config to match configure.py
  test/cmake: add missing tests to boost test suite
  test/cmake: remove per-test LTO disable
  cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
  cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
  cmake: add -fno-sanitize=vptr for abseil sanitizer flags
  cmake: align Seastar build configuration with configure.py
  cmake: align global compile defines and options with configure.py
  cmake: fix Coverage mode in mode.Coverage.cmake
  cmake: align mode.common.cmake flags with configure.py
  configure.py: add sstable_tablet_streaming to combined_tests
  docs: add compare-build-systems.md
  scripts: add compare_build_systems.py to compare ninja build files
2026-04-09 15:46:09 +03:00
Yaniv Michael Kaul
13879b023f tracing: set_skip_when_empty() for error-path metrics
Add .set_skip_when_empty() to all error-path metrics in the tracing
module. Tracing itself is not a commonly used feature, making all of
these metrics almost always zero:

Tier 1 (very rare - corruption/schema issues):
- tracing_keyspace_helper::bad_column_family_errors: tracing schema
  missing or incompatible, should never happen post-bootstrap
- tracing::trace_errors: internal error building trace parameters

Tier 2 (overload - tracing backend saturated):
- tracing::dropped_sessions: too many pending sessions
- tracing::dropped_records: too many pending records

Tier 3 (general tracing write errors):
- tracing_keyspace_helper::tracing_errors: errors during writes to
  system_traces keyspace

Since tracing is an opt-in feature that most deployments rarely use,
all five metrics are almost always zero and create unnecessary
reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29346
2026-04-09 14:28:16 +03:00
Michael Litvak
3964040008 docs/dev: add counters doc
Add a documentation of the counters feature implementation in
docs/dev/counters.md.

The documentation is taken from the wiki and updated according to the
current state of the code - legacy details are removed, and a section
about the counter id is added.
2026-04-09 13:08:02 +02:00
Michael Litvak
b71762d5da counters: reuse counter IDs by rack
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356
2026-04-09 13:08:02 +02:00
Yaniv Michael Kaul
2c0076d3ef replica: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:

- database::dropped_view_updates: view updates dropped due to overload.
  NOTE: this metric appears to never be incremented in the current
  codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
  badness counter' that should always be zero. NOTE: no increment site
  was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
  badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
  fires when disk utilization is critical and user table writes are
  disabled, a very rare operational state.

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29345
2026-04-09 14:07:28 +03:00
Botond Dénes
86417d49de Merge 'transport: improve memory accounting for big responses and slow network' from Marcin Maliszkiewicz
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.

Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740

Backport: all supported versions

Closes scylladb/scylladb#29288

* github.com:scylladb/scylladb:
  transport: add per-service-level pending response memory metric
  transport: hold memory permit until response write completes
  transport: account for response size exceeding initial memory estimate
2026-04-09 13:36:31 +03:00
Yaniv Michael Kaul
5c8b4a003e db: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in the db module that are
only incremented on very rare error paths and are almost always zero:

- cache::pinned_dirty_memory_overload: described as 'should sit
  constantly at 0, nonzero is indicative of a bug'
- corrupt_data::entries_reported: only fires on actual data corruption
- hints::corrupted_files: only fires on on-disk hint file corruption
- rate_limiter::failed_allocations: only fires when the rate limiter
  hash table is completely full and gives up allocating, requiring
  extreme cardinality pressure

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29344
2026-04-09 13:32:09 +03:00
Gleb Natapov
dbaba7ab8a storage_service: cleanup unused code
Remove unused definition and double includes.
2026-04-09 13:31:41 +03:00
Gleb Natapov
b050b593b3 storage_service: simplify get_peer_info_for_update
It does nothing for fields managed in raft, so drop their processing.
2026-04-09 13:31:41 +03:00
Gleb Natapov
d0576c109f gossiper: send shutdown notifications in parallel 2026-04-09 13:31:40 +03:00
Gleb Natapov
1586fa65af gms: remove unused code
Also moved version_string(...) and make_token_string(...) to private: — they are internal helpers used only by normal(), not part of the public API
2026-04-09 13:31:40 +03:00
Gleb Natapov
b2e35c538f virtual_tables: no need to call gossiper if we already know that the node is in shutdown 2026-04-09 13:31:40 +03:00
Gleb Natapov
e17fc180a0 gossiper: print node state from raft topology in the logs
Raft topology has real node's state now. gossiper sate are now set to
NORMAL and SHUTDOWN only.
2026-04-09 13:31:40 +03:00
Gleb Natapov
8439154851 gossiper: use is_shutdown instead of code it manually 2026-04-09 13:31:39 +03:00
Gleb Natapov
7d700d0377 gossiper: mark endpoint_state(inet_address ip) constructor as explicit
get_live_members function called is_shutdown which inet_address
argument, which caused temporary endpoint_state to be created. Fix
it by prohibiting implicit conversion and calling the correct
is_shutdown function instead.
2026-04-09 13:31:39 +03:00
Gleb Natapov
6df4f572d5 gossiper: remove unused code 2026-04-09 13:31:39 +03:00
Gleb Natapov
67102496c8 gossiper: drop last use of LEFT state and drop the state
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
2026-04-09 13:31:39 +03:00
Gleb Natapov
54d2c95094 gossiper: drop unused STATUS_BOOTSTRAPPING state 2026-04-09 13:31:38 +03:00
Gleb Natapov
7c895ced19 gossiper: rename is_dead_state to is_left since this is all that the function checks now. 2026-04-09 13:31:38 +03:00
Gleb Natapov
7dfb0577b8 gossiper: use raft topology state instead of gossiper one when checking node's state
Raft topology state is a truth source for the nodes state, so use it
instead of a gossiper one.
2026-04-09 13:31:38 +03:00
Gleb Natapov
c17c4806a1 storage_service: drop check_for_endpoint_collision function
All the checks that it does are also done by join coordinator and the
join coordinator uses more reliable raft state instead of gossiper one.
2026-04-09 13:31:37 +03:00
Gleb Natapov
1ac8edb22b storage_service: drop is_first_node function
It make no sense now since the first node to bootstrap is determined by
discover_group0 algorithm.
2026-04-09 13:31:37 +03:00
Gleb Natapov
681aa9ebe1 gossiper: remove unused REMOVED_TOKEN state 2026-04-09 13:31:37 +03:00
Gleb Natapov
5af17aa578 gossiper: remove unused advertise_token_removed function 2026-04-09 13:31:36 +03:00
Dawid Mędrek
f0dfe29d88 service: strong_consistency: Abort state_machine::apply when aborting server
The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.

When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.

---

In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].

Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.

---

Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:

1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.

Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.

That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].

---

We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:

* when the table is dropped (which should also cover the case of tablet
  migration),
* when the node is shutting down.

The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.

Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].

References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056
2026-04-09 11:36:51 +02:00
Dawid Mędrek
ad8a263683 service: strong_consistency: Abort ongoing operations when shutting down
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.

When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.

We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).

---

Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:

```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
    ss.local().drain_on_shutdown().get();
});
```

The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.

It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.

---

Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

After:
```
real    0m32.024s
user    1m10.751s
sys     0m23.871s
```

Individual runs of the added tests:

test_queries_when_shutting_down:
```
real    0m12.818s
user    0m36.726s
sys     0m4.577s
```

test_abort_forwarded_write_upon_shutdown:
```
real    0m12.930s
user    0m36.622s
sys     0m4.752s
```
2026-04-09 11:36:17 +02:00
Dawid Mędrek
4a87bdc778 service: client_state: Extend with abort_source
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.

We set the abort source for client states in relevant places.
2026-04-09 11:35:35 +02:00
Dawid Mędrek
89c049b889 service: strong_consistency: Handle abort when removing Raft group
When a strongly consistent Raft group is being removed, it means one of
the following cases:

(A) The node is shutting down and it's simply part of the the shutdown
    procedure.

(B) The tablet is somehow leaving the replica. For example, due to:
    - Tablet migration
    - Tablet split/merge
    - Tablet removal (e.g. because the table is dropped)

In this commit, we focus on case (A). Case (B) will be handled in the
following one.

---

The changes in the code are literally none, and there's a reason to it.

First, let's note that we've already implemented abortion of timed-out
requests. There is a limit to how long a query can run and sooner or
later it will finish, regardless of what we do.

Second, we need to ask ourselves if the cases we're considering in this
commit (i.e. case (B)) is a situation where we'd like to speed up the
process. The answer is no.

Tablet migrations are effectively internal operations that are invisible
to the users. User requests are, quite obviously, the opposite of that.
Because of that, we want to patiently wait for the queries to finish or
time out, even though it's technically possible to lead to an abort
earlier.

Lastly, the changes in the code that actually appear in this commit are
not completely irrelevant either. We consider the important case of
the `leader_info_updater` fiber and argue that it's safe to not pass
any abort source to the Raft methods used by it.

---

Unfortunately, we don't have tablet migrations implemented yet [1],
so our testing capabilities are limited. Still, we provide a new test
that corresponds to case (B) described above. We simulate a tablet
migration by dropping a table and observe how reads and writes behave
in such a situation. There's no extremely careful validation involved
there, but that's what we can have for the time being.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

After:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

The time spent on the new test only:
```
real    0m5.264s
user    0m34.646s
sys     0m3.374s
```

References:
[1] SCYLLADB-868
2026-04-09 11:35:31 +02:00
Dawid Mędrek
7dcc3e85b9 service: strong_consistency: Abort Raft operations on timeout
If a query, either a write, or a read to a strongly consistent table,
times out, we immediately abort the operation and throw an exception.

Unfortunately, due to the inconsistency in exception types thrown
on timeout by the many methods we use in the code, it results in
pretty messy `try-catch` clauses. Perhaps there's a better alternative
to this, but it's beyond the scope of this work, so we leave it as-is.

We provide a validation test that consists of three cases corresponding
to reads, writes, and waiting for the leader. They verify that the code
works as expected in all affected places.

A comparison of time spent on the whole `test_strong_consistency.py` on
my local machine, in dev mode:

Before:
```
real    0m32.185s
user    0m55.391s
sys     0m15.745s
```

After:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

The time spent on the new test only:
```
real  0m7.077s
user  0m35.359s
sys   0m3.717s
```
2026-04-09 11:35:04 +02:00
Piotr Szymaniak
65a1bdd368 docs: document Alternator auditing in the operator-facing auditing guide
- Document Alternator (DynamoDB-compatible API) auditing support in
  the operator-facing auditing guide (docs/operating-scylla/security/auditing.rst)
- Cover operation-to-category mapping, operation field format,
  keyspace/table filtering, and audit log examples
- Document the audit_tables=alternator.<table> shorthand format
- Minor wording improvements throughout (Scylla -> ScyllaDB,
  clarify default audit backend)

Closes scylladb/scylladb#29231
2026-04-09 12:26:57 +03:00
Dawid Mędrek
2243e0ffea service: strong_consistency: Use timeout when mutating
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.

Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
fd9c907be1 service: strong_consistency: Fix indentation 2026-04-09 11:25:57 +02:00
Dawid Mędrek
ca7f24516e service: strong_consistency: Enclose coordinator methods with try-catch
We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They
do nothing at the moment, but we'll use them later. We do this now to
avoid noise in the upcoming commits.

We'll fix the indentation in the following commit.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
e9ea9e7259 service: strong_consistency: Crash at unexpected exception
The loop shouldn't throw any other exception than the ones already
covered by the `catch` claues. Crash, at least when
`abort_on_internal_error` is set, if we catch any other type since
that may be a sign of a bug.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
f499a629ab test: cluster: Extract default config & cmdline in test_strong_consistency.py
All used configs and cmdlines share the same values. Let's extract them
to avoid repeating them every time a new test is written. Those options
should be enabled for all tests in the file anyway.
2026-04-09 11:25:57 +02:00
Geoff Montee
7d7ec7025e docs: Document system keyspaces for developers / internal usage
Fixes #29043 with the following docs changes:

- docs/dev/system-keyspaces.md: Added a new file that documents all keyspaces created internally

Closes scylladb/scylladb#29044
2026-04-09 11:49:58 +03:00
Guy Shtub
40a861016a docs/faq.rst: Fixing small spelling mistake
Closes scylladb/scylladb#29131
2026-04-09 11:48:46 +03:00
Pavel Emelyanov
78f5bab7cf table: Add formatter for group_id argument in tablet merge exception message
Fixes: SCYLLADB-1432

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29143
2026-04-09 11:45:57 +03:00
Botond Dénes
fbbe2bdce8 Merge 'Introduce repair_service::config and cut dependency from db::config' from Pavel Emelyanov
Spreading db::config around and making all services depend on it is not nice. Most other service that need configuration provide their own config that's populated from db::config in main.cc/cql_test_env.cc and use it, not the global config.

This PR does the same for repair_service.

Enhancing components dependencies, not backporting

Closes scylladb/scylladb#29153

* github.com:scylladb/scylladb:
  repair: Remove db/config.hh from repair/*.cc files
  repair: Move repair_multishard_reader options onto repair_service::config
  repair: Move critical_disk_utilization_level onto repair_service::config
  repair: Move repair_partition_count_estimation_ratio onto repair_service::config
  repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config
  repair: Move enable_small_table_optimization_for_rbno onto repair_service::config
  repair: Introduce repair_service::config
2026-04-09 11:44:25 +03:00
Botond Dénes
76c8794f4f Merge 'Strong consistency: allow taking snapshots (but not transfer) and make them less likely' from Piotr Dulikowski
While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely.

While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking:

- The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB.
- The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots.

I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally.

Refs: SCYLLADB-1115

Strong consistency is experimental, no need for backport.

Closes scylladb/scylladb#29189

* github.com:scylladb/scylladb:
  strong_consistency: fake taking and dropping snapshots
  strong_consistency: adjust limits for snapshots
2026-04-09 11:44:03 +03:00
Anna Stuchlik
dd34d2afb4 doc: remove references to old versions from Docker Hub docs
This commit removes references ScyllaDB versions ("Since x.y")
from the ScyllaDB documentation on Docker Hub, as they are
redundant and confusing (some versions are super ancient).

Fixes SCYLLADB-1212

Closes scylladb/scylladb#29204
2026-04-09 11:43:40 +03:00
Botond Dénes
c162277b28 Merge 'Perform full connection set-up for CertificateAuthorization in process_startup()' from Pavel Emelyanov
The code responds ealry with READY message, but lack some necessary set up, namely:

* update_scheduling_group(): without it, the connection runs under the default scheduling group instead of the one mapped to the user's service level.

* on_connection_ready(): without it, the connection never releases its slot in the uninitialized-connections concurrency semaphore (acquired at connection creation), leaking one unit per cert-authenticated connection for the lifetime of the connection.

* _authenticating = false / _ready = true: without them, system.clients reports connection_stage = AUTHENTICATING forever instead of READY (not critical, but not nice either)

The PR fixes it and adds a regression test, that (for sanity) also covers AllowAll and Password authrticators

Fixes SCYLLADB-1226

Present since 2025.1, probably worth backporting

Closes scylladb/scylladb#29220

* github.com:scylladb/scylladb:
  transport: fix process_startup cert-auth path missing connection-ready setup
  transport: test that connection_stage is READY after auth via all process_startup paths
2026-04-09 11:43:02 +03:00
Raphael S. Carvalho
16e387d5f9 repair/replica: Fix race window where post-repair data is wrongly promoted to repaired
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED.  The repair lifecycle is:

  1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
  2. Row-level repair streams missing rows between replicas.
  3. mark_sstable_as_repaired() runs on all replicas, rewriting the
     SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
  4. The coordinator atomically commits sstables_repaired_at=N+1 and
     the end_repair stage to Raft, then broadcasts
     repair_update_compaction_ctrl which calls clear_being_repaired().

The bug lives in the window between steps 3 and 4.  After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N.  The classifier therefore sees:

  is_repaired(N, sst{repaired_at=N+1}) == false
  sst->being_repaired == null   (lost on restart, or not yet set)

and puts them in the UNREPAIRED view.  If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables.  The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.

Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan.  Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED.  This
divergence is never healed by future repairs because the repaired set is
considered authoritative.  The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.

The fix has two layers:

Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes.  This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.

Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory.  It checks:

  sst.repaired_at == sstables_repaired_at + 1
  AND tablet transition kind == tablet_transition_kind::repair

Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft.  Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.

The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.

A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
  - Running repair round 1 to establish sstables_repaired_at=1.
  - Injecting delay_end_repair_update to hold the race window open.
  - Running repair round 2 so all replicas complete mark_sstable_as_repaired
    (repaired_at=2) but the coordinator has not yet committed step 4.
  - Writing post-repair keys to all replicas and flushing servers[1] to
    create an SSTable with repaired_at=0 on disk.
  - Restarting servers[1] so being_repaired is lost from memory.
  - Waiting for autocompaction to merge the two SSTables on servers[1].
  - Asserting that the merged SSTable contains post-repair keys (the bug)
    and that servers[0] and servers[2] do not see those keys as repaired.

NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check.  I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29244
2026-04-09 11:42:28 +03:00
Dawid Mędrek
a8bc90a375 Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron
This series fixes two related inconsistencies around secondary-index
names.
1. `DESCRIBE INDEX ... WITH INTERNALS` returned the backing
   materialized-view name in the `name` column instead of the logical
   index name.
2. The snapshot REST API accepted backing table names for MV-backed
   secondary indexes, but not the logical index names exposed to users.

The snapshot side now resolves logical secondary-index names to backing
table names where applicable, reports logical index names in snapshot
details, rejects vector index names with HTTP 400, and keeps multi-keyspace
DELETE atomic by resolving all keyspaces before deleting anything.
The tests were also extended accordingly, and the snapshot test helper
was fixed to clean up multi-table snapshots using one DELETE per table.

Fixes: SCYLLADB-1122

Minor bugfix, no need to backport.

Closes scylladb/scylladb#29083

* github.com:scylladb/scylladb:
  cql3: fix DESCRIBE INDEX WITH INTERNALS name
  test: add snapshot REST API tests for logical index names
  test: fix snapshot cleanup helper
  api: clarify snapshot REST parameter descriptions
  api: surface no_such_column_family as HTTP 400
  db: fix clear_snapshot() atomicity and use C++23 lambda form
  db: normalize index names in get_snapshot_details()
  db: add resolve_table_name() to snapshot_ctl
2026-04-09 08:37:51 +03:00
Piotr Dulikowski
ec0231c36c Merge 'db/view/view_building_worker: lock staging sstables mutex for all necessary shards when creating tasks' from Michał Jadwiszczak
To create `process_staging` view building tasks, we firstly need to collect informations about them on shard0, create necessary mutations, commit them to group0 and move staging sstables objects to their original shards.

But there is a possible race after committing the group0 command and before moving the staging sstables to their shards. Between those two events, the coordinator may schedule freshly created tasks and dispatch them to the worker but the worker won't have the sstables objects because they weren't moved yet.

This patch fixes the race by holding `_staging_sstables_mutex` locks from all necessary shards when executing `create_staging_sstable_tasks()`. With this, even if the task will be scheduled and dispatched quickly, the worker will wait with executing it until the sstables objects are moved and the locks are released.

Fixes SCYLLADB-816

This PR should be backported to all versions containing view building coordinator (2025.4 and newer).

Closes scylladb/scylladb#29174

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indentation
  db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks
2026-04-09 08:37:51 +03:00
Piotr Smaron
d458ff50b0 cql3: fix DESCRIBE INDEX WITH INTERNALS name
DESCRIBE INDEX ... WITH INTERNALS returned the name of
the backing materialized view in the name column instead
of the logical index name.

Return the logical index name from schema::describe()
for index schemas so all callers observe the
user-facing name consistently.

Fixes: SCYLLADB-1122
2026-04-08 13:38:17 +02:00
Piotr Smaron
04837ba20f test: add snapshot REST API tests for logical index names
Add focused REST coverage for logical secondary-index
names in snapshot creation, deletion, and details
output.

Also cover vector-index rejection and verify
multi-keyspace delete resolves all keyspaces before
deleting anything so mixed index kinds cannot cause
partial removal.
2026-04-08 13:38:17 +02:00
Piotr Smaron
6b85da3ce3 test: fix snapshot cleanup helper
The snapshot REST helper cleaned up multi-table
snapshots with a single DELETE request that passed a
comma-separated cf filter, but the API accepts only one
table name there.

Delete each table snapshot separately so existing tests
that snapshot multiple tables use the API as
documented.
2026-04-08 13:36:27 +02:00
Piotr Smaron
3090684dad api: clarify snapshot REST parameter descriptions
Document the current /storage_service/snapshots behavior
more accurately.

For DELETE, cf is a table filter applied independently
in each keyspace listed in kn. If cf is omitted or
empty, snapshots for all tables are eligible, and
secondary indexes can be addressed by their logical
index name.
2026-04-08 13:36:27 +02:00
Piotr Smaron
6ee75c74bd api: surface no_such_column_family as HTTP 400
Snapshot requests that name a non-existent table or a
non-snapshotable logical index currently surface an
internal server error.

Translate no_such_column_family into a bad request so
callers get a client-facing error that matches the
invalid input.
2026-04-08 13:36:27 +02:00
Piotr Smaron
7d83a264ac db: fix clear_snapshot() atomicity and use C++23 lambda form
clear_snapshot() applies a table filter independently in
each keyspace, so logical index names must be resolved
per keyspace on the delete path as well.

Resolve all keyspaces before deleting anything so a later
failure cannot partially remove a snapshot, and use the
explicit-object-parameter coroutine lambda form for the
asynchronous implementation.
2026-04-08 13:36:27 +02:00
Piotr Smaron
39baa1870e db: normalize index names in get_snapshot_details()
Snapshot details exposed backing secondary-index view
names instead of logical index names.

Normalize index entries in get_snapshot_details() so the
REST API reports the user-facing name, and update the
existing REST test to assert that behavior directly.
2026-04-08 13:36:27 +02:00
Piotr Smaron
9c37f1def2 db: add resolve_table_name() to snapshot_ctl
The snapshot REST API accepted backing secondary-index
table names, but not logical index names.

Introduce resolve_table_name() so snapshot creation can
translate a logical index name to the backing table when
the index is materialized as a view.
2026-04-08 13:36:27 +02:00
Petr Gusev
7750d5737c strong consistency: replace local consistency with global
Currently we don't support 'local' consistency, which would
imply maintaining separate raft group for each dc. What we
support is actually 'global' consistency -- one raft group
per tablet replica set. We don't plan to support local
consistency for the first GA.

Closes scylladb/scylladb#29221
2026-04-08 12:52:32 +02:00
Patryk Jędrzejczak
850db950f8 Merge 'raft: include demoted voters in read barrier during joint config' from Qian Cheng
Hi, thanks for Scylla!

We found a small issue in tracker::set_configuration() during joint consensus and put together a fix.

When a server is demoted from voter to non-voter, set_configuration processes the current config first (can_vote=false), then the previous config. But when it finds the server already in the progress map (tracker.cc:118), it hits `continue` without updating can_vote. So the server's follower_progress::can_vote stays false even though it's still a voter in the previous config.

This causes broadcast_read_quorum (fsm.cc:1055) to skip the demoted server, reducing the pool of responders. Since committed() correctly includes the server in _previous_voters for quorum calculation, read barriers can stall if other servers are slow.

The fix is to use configuration::can_vote() in tracker::set_configuration.

We included a reproduction unit test (test_tracker_voter_demotion_joint_config) that extracts the set_configuration algorithm and demonstrates the mismatch. We weren't able to build the full Scylla test suite to add an in-tree test, so we kept it as a standalone file for reference.

No backport: the bug is non-critical and the change needs some soak time in master.

Closes scylladb/scylladb#29226

* https://github.com/scylladb/scylladb:
  fix: use is_voter::yes instead of true in test assertions
  test: add tracker voter demotion test to fsm_test.cc
  fix: use configuration::can_vote() in tracker::set_configuration
2026-04-08 12:37:27 +02:00
Qian-Cheng-nju
a416238155 test: add tracker voter demotion test to fsm_test.cc 2026-04-08 12:37:19 +02:00
Qian-Cheng-nju
f72528c759 raft: use configuration::can_vote() in tracker::set_configuration 2026-04-08 12:37:16 +02:00
Michał Jadwiszczak
568f20396a test: fix flaky test_create_index_synchronous_updates trace event race
The test_create_index_synchronous_updates test in test_secondary_index_properties.py
was intermittently failing with 'assert found_wanted_trace' because the expected
trace event 'Forcing ... view update to be synchronous' was missing from the
trace events returned by get_query_trace().

Root cause: trace events are written asynchronously to system_traces.events.
The Python driver's populate() method considers a trace complete once the
session row in system_traces.sessions has duration IS NOT NULL, then reads
events exactly once. Since the session row and event rows are written as
separate mutations with no transactional guarantee, the driver can read an
incomplete set of events.

Evidence from the failed CI run logs:
- The entire test (CREATE TABLE through DROP TABLE) completed in ~300ms
  (01:38:54,859 - 01:38:55,157)
- The INSERT with tracing happened in a ~50ms window between the second
  CREATE INDEX completing (01:38:55,108) and DROP TABLE starting
  (01:38:55,157)
- The 'Forcing ... synchronous' trace message is generated during the
  INSERT write path (db/view/view.cc:2061), so it was produced, but
  not yet flushed to system_traces.events when the driver read them
- This matches the known limitation documented in test/alternator/
  test_tracing.py: 'we have no way to know whether the tracing events
  returned is the entire trace'

Fix: replace the single-shot trace.events read with a retry loop that
directly queries system_traces.events until the expected event appears
(with a 30s timeout). Use ConsistencyLevel.ONE since system_traces has
RF=2 and cqlpy tests run on a single-node cluster.

The same race condition pattern exists in test_mv_synchronous_updates in
test_materialized_view.py (which this test was modeled after), so the
same fix is proactively applied there as well.

Fixes SCYLLADB-1314

Closes scylladb/scylladb#29374
2026-04-08 12:35:10 +02:00
Raphael S. Carvalho
f941a77867 scripts/base36-uuid: dump date in UTC
Previously, the timestamp decoded from a timeuuid was printed using the
local timezone via datetime.fromtimestamp(), which produces different
output depending on the machine's locale settings.

ScyllaDB logs are emitted in UTC by default. Printing the decoded date
in UTC makes it straightforward to correlate SSTable identifiers with
log entries without having to mentally convert timezones.

Also fix the embedded pytest assertion, which was accidentally correct
only on machines in UTC+8 — it now uses an explicit UTC-aware datetime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29253
2026-04-08 12:19:55 +03:00
Yaniv Michael Kaul
c385c0bdf9 .github/workflows/call_validate_pr_author_email.yml: add missing workflow permissions
Add explicit permissions block (contents: read, pull-requests: write,
statuses: write) matching the requirements of the called reusable
workflow which checks out code, posts PR comments, and sets commit
statuses. Fixes code scanning alert #172.

Closes scylladb/scylladb#29183
2026-04-08 12:19:55 +03:00
Pavel Emelyanov
788ecaa682 api: Fix enable_injection to accept case-insensitive bool parameter
Replace strict case-sensitive '== "True"' check with strcasecmp(..., "true")
so that Python's str(True) -> "True" is properly recognized. Accepts any
case variation of "true" ("True", "TRUE", etc.), with empty string
defaulting to false.

Maintains backward compatibility with out-of-tree tests that rely on
Python's bool stringification.

The goal is to reduce the number of distinct ways API handlers use to
convert string http query parameters into bool variables. This place is the
only one that simply compares param to "True".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29236
2026-04-08 12:19:55 +03:00
Avi Kivity
0fd9ea9701 abseil: update to lts_2026_01_07
Switch to branch lts_2026_01_07, which is exactly equal to
upstream now.

There were no notable changes in the release notes, but the
new versions are more friendly to newer compilers (specifically,
in include hygiene).

configure.py needs a few library updates; cmake works without
change.

scylla-gdb.py updated for new hash table layout (by Claude Opus 4.6).

* abseil d7aaad83...255c84da (1179):
  > Abseil LTS branch, Jan 2026, Patch 1 (#2007)
  > Cherry-picks for LTS 20260107 (#1990)
  > Apply LTS transformations for 20260107 LTS branch (#1989)
  > Mark legacy Mutex methods and MutexLock pointer constructors as deprecated
  > `cleanup`: specify that it's safe to use the class in a signal handler.
  > Suppress bugprone-use-after-move in benign cases
  > StrFormat: format scientific notation without heap allocation
  > Introduce a legacy copy of GetDebugStackTraceHook API.
  > Report 1ns instead of 0ns for probe_benchmarks. Some tools incorrectly assume that benchmark was not run if 0ns reported.
  > Add absl::chunked_queue
  > `CRC32` version of `CombineContiguous` for length <= 32.
  > Add `absl::down_cast`
  > Fix FixedArray iterator constructor, which should require input_iterator, not forward_iterator
  > Add a latency benchmark for hashing a pair of integers.
  > Delete absl::strings_internal::STLStringReserveAmortized()
  > As IsAtLeastInputIterator helper
  > Use StringAppendAndOverwrite() in CEscapeAndAppendInternal()
  > Add support for absl::(u)int128 in FastIntToBuffer()
  > absl/strings: Prepare helper for printing objects to string representations.
  > Use SimpleAtob() for parsing bool flags
  > No-op changes to relative timeout support code.
  > Adjust visibility of heterogeneous_lookup_testing.h
  > Remove -DUNORDERED_SET_CXX17 since the macro no longer exists
  > [log] Prepare helper for streaming container contents to strings.
  > Restrict the visibility of some internal testing utilities
  > Add absl::linked_hash_set and absl::linked_hash_map
  > [meta] Add constexpr testing helper.
  > BUILD file reformatting.
  > `absl/meta`: Add C++17 port of C++20 `requires` expression for internal use
  > Remove the implementation of `absl::string_view`, which was only needed prior to C++17. `absl::string_view` is now an alias for `std::string_view`. It is recommended that clients simply use `std::string_view`.
  > No public description
  > absl:🎏 Stop echoing file content in flagfile parsing errors Modified ArgsList::ReadFromFlagfile to redact the content of unexpected lines from error messages. \
  > Refactor the declaration of `raw_hash_set`/`btree` to omit default template parameters from the subclasses.
  > Import of CCTZ from GitHub.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Flag help generator
  > Correct `Mix4x16Vectors` comment.
  > Special implementation for string hash with sizes greater than 64.
  > Reorder function parameters so that hash state is the first argument.
  > Search more aggressively for open slots in absl::internal_stacktrace::BorrowedFixupBuffer
  > Implement SpinLockHolder in terms of std::lock_guard.
  > No public description
  > Avoid discarding test matchers.
  > Import of CCTZ from GitHub.
  > Automated rollback of commit 9f40d6d6f3cfc1fb0325dd8637eb65f8299a4b00.
  > Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
  > Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
  > Make AnyInvocable remember more information
  > Add further diagnostics under clang for string_view(nullptr)
  > Import of CCTZ from GitHub.
  > Document the differing trimming behavior of absl::Span::subspan() and std::span::subspan()
  > Special implementation for string hash with sizes in range [33, 64].
  > Add the deleted string_view(std::nullptr_t) constructor from C++23
  > CI: Use a cached copy of GoogleTest in CMake builds if possible to minimize the possibility of errors downloading from GitHub
  > CI: Enable libc++ hardening in the ASAN build for even more checks https://libcxx.llvm.org/Hardening.html
  > Call the common case of AllocateBackingArray directly instead of through the function pointer.
  > Change AlignedType to have a void* array member so that swisstable backing arrays end up in the pointer-containing partition for heap partitioning.
  > base: Discourage use of ABSL_ATTRIBUTE_PACKED
  > Revert: Add an attribute to HashtablezInfo which performs a bitwise XOR on all hashes. The purposes of this attribute is to identify if identical hash tables are being created. If we see a large number of identical tables, it's likely the code can be improved by using a common table as opposed to keep rebuilding the same one.
  > Import of CCTZ from GitHub.
  > Record insert misses in hashtable profiling.
  > Add absl::StatusCodeToStringView.
  > Add a missing dependency on str_format that was being pulled in transitively
  > Pico-optimize `SkipWhitespace` to use `StripLeadingAsciiWhitespace`.
  > absl::string_view: Upgrade the debug assert on the single argument char* constructor to ABSL_HARDENING_ASSERT
  > Use non-stack storage for stack trace buffers
  > Fixed incorrect include for ABSL_NAMESPACE_BEGIN
  > Add ABSL_REFACTOR_INLINE to separate the inliner directive from the deprecated directive so that we can give users a custom deprecation message.
  > Reduce stack usage when unwinding without fixups
  > Reduce stack usage when unwinding from 170 to 128 on x64
  > Rename RecordInsert -> RecordInsertMiss.
  > PR #1968: Use std::move_backward within InlinedVector's Storage::Insert
  > Use the new absl::StringResizeAndOverwrite() in CUnescape()
  > Explicitly instantiate common `raw_hash_set` backing array functions.
  > Rollback reduction of maximum load factor. Now it is back to 28/32.
  > Export Mutex::Dtor from shared libraries in NDEBUG mode
  > Allow `IsOkAndHolds` to rely on duck typing for matching `StatusOr` like types instead of uniquely `absl::StatusOr`, e.g. `google::cloud::StatusOr`.
  > Fix typo in macro and add missing static_cast for WASM builds.
  > windows(cmake): add abseil_test_dll to target link libraries when required
  > Handle empty strings in `SimpleAtof` after stripping whitespace
  > Avoid using a thread_local in an inline function since this causes issues on some platforms.
  > (Roll forward) Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Fixes for String{Resize|Append}AndOverwrite   - StringAppendAndOverwrite() should always call StringResizeAndOverwrite()     with at least capacity() in case the standard library decides to shrink     the buffer (Fixes #1965)   - Small refactor to make the minimum growth an addition for clarity and     to make it easier to test 1.5x growth in the future   - Turn an ABSL_HARDENING_ASSERT into a ThrowStdLengthError   - Add a missing std::move
  > Correct the supported features of Status Matchers
  > absl/time: Use "memory order acquire" for loads, which would allow for the safe removal of the data memory barrier.
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
  > Release `ABSL_EXPECT_OK` and `ABSL_ASSERT_OK`.
  > Fix the CHECK_XX family of macros to not print `char*` arguments as C-strings if the comparison happened as pointers. Printing as pointers is more relevant to the result of the comparison.
  > Rollback StringAppendAndOverwrite() - the problem is that StringResizeAndOverwrite has MSAN testing of the entire string. This causes quadratic MSAN verification on small appends.
  > Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
  > PR #1961: Fix Clang warnings on powerpc
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > macOS CI: Move the Bazel vendor_dir to ${HOME} to workaround a Bazel issue where it does not work when it is in ${TMP} and also fix the quoting which was causing it to incorrectly receive the argument
  > Use __msan_check_mem_is_initialized for detailed MSan report
  > Optimize stack unwinding by reducing `AddressIsReadable` calls.
  > Add internal API to allow bypassing stack trace fixups when needed
  > absl::StrFormat: improve test coverage with scientific exponent test cases
  > Add throughput and latency benchmarks for `absl::ToDoubleXYZ` functions.
  > CordzInfo: Use absl::NoDestructor to remove a global destructor. Chromium requires no global destructors.
  > string_view: Enable std::view and std::borrowed_range
  > cleanup: s/logging_internal/log_internal/ig for consistency
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in absl::AsciiStrTo{Lower|Upper}
  > Use the new absl::StringResizeAndOverwrite() in absl::StrJoin()
  > Use the new absl::StringResizeAndOverwrite() in absl::StrCat()
  > string_view: Fix include order
  > Don't pass nullptr as the 1st arg of `from_chars`
  > absl/types: format code with clang-format.
  > Validate absl::StringResizeAndOverwrite op has written bytes as expected.
  > Skip the ShortStringCollision test on WASM.
  > Rollback `absl/types`: format code with clang-format.
  > Remove usage of the WasmOffsetConverter for Wasm / Emscripten stack-traces.
  > Use the new absl::StringResizeAndOverwrite() in absl::CordCopyToString()
  > Remove an undocumented behavior of --vmodule and absl::SetVLogLevel that could set a module_pattern to defer to the global vlog threshold.
  > Update to rules_cc 0.2.9
  > Avoid redefine warnings with ntstatus constants
  > PR #1944: Use same element-width for non-temporal loads and stores on Arm
  > absl::StringResizeAndOverwrite(): Add the requirement that the only value that can be written to buf[size] is the terminator character.
  > absl/types: format code with clang-format.
  > Minor formatting changes.
  > Remove `IntIdentity` and `PtrIdentity` from `raw_hash_set_probe_benchmark`.
  > Automated rollback of commit cad60580dba861d36ed813564026d9774d9e4e2b.
  > FlagStateInterface implementors need only support being restored once.
  > Clarify the post-condition of `reserve()` in Abseil hash containers.
  > Clarify the post-condition of `reserve()` in Abseil hash containers.
  > Represent dropped samples in hashtable profile.
  > Add lifetimebound to absl::implicit_cast and make it work for rvalue references as it already does with lvalue references
  > Clean up a doc example where we had `absl_nonnull` and `= nullptr;`
  > Change Cordz to synchronize tracked cords with Snapshots / DeleteQueue
  > Minor refactor to `num_threads` in deadlock test
  > Rename VLOG macro parameter to match other uses of this pseudo type.
  > `time`: Fix indentation
  > Automated Code Change
  > Adds `absl::StringResizeAndOverwrite` as a polyfill for C++23's `std::basic_string<CharT,Traits,Allocator>::resize_and_overwrite`
  > Internal-only change
  > absl/time: format code with clang-format.
  > No public description
  > Expose typed releasers of externally appended memory.
  > Fix __declspec support for ABSL_DECLARE_FLAG()
  > Annotate absl::AnyInvocable as an owner type via [[gsl::Owner]] and absl_internal_is_view = std::false_type
  > Annotate absl::FunctionRef as a view type via [[gsl::Pointer]] and absl_internal_is_view
  > Remove unnecessary dep on `core_headers` from the `nullability` cc_library
  > type_traits: Add type_identity and type_traits_t backfills
  > Refactor raw_hash_set range insertion to call private insert_range function.
  > Fix bug in absl::FunctionRef conversions from non-const to const
  > PR #1937: Simplify ConvertSpecialToEmptyAndFullToDeleted
  > Improve absl::FunctionRef compatibility with C++26
  > Add a workaround for unused variable warnings inside of not-taken if-constexpr codepaths in older versions of GCC
  > Annotate ABSL_DIE_IF_NULL's return type with `absl_nonnull`
  > Move insert index computation into `PrepareInsertLarge` in order to reduce inlined part of insert/emplace operations.
  > Automated Code Change
  > PR #1939: Add missing rules_cc loads
  > Expose (internally) a LogMessage constructor taking file as a string_view for (internal, upcoming) FFI integration.
  > Fixed up some #includes in mutex.h
  > Make absl::FunctionRef support non-const callables, aligning it with std::function_ref from C++26
  > Move capacity update in `Grow1To3AndPrepareInsert` after accessing `common.infoz()` to prevent assertion failure in `control()`.
  > Fix check_op(s) compilation failures on gcc 8 which eagerly tries to instantiate std::underlying_type for non-num types.
  > Use `ABSL_ATTRIBUTE_ALWAYS_INLINE`for lambda in `find_or_prepare_insert_large`.
  > Mark the implicit floating operators as constexpr for `absl::int128` and `absl::uint128`
  > PR #1931: raw_hash_set: fix instantiation for recursive types on MSVC with /Zc:__cplusplus
  > Add std::pair specializations for IsOwner and IsView
  > Cast ABSL_MIN_LOG_LEVEL to absl::LogSeverityAtLeast instead of absl::LogSeverity.
  > Fix a corner case in the aarch64 unwinder
  > Fix inconsistent nullability annotation in ReleasableMutexLock
  > Remove support for Native Client
  > Rollback f040e96b93dba46e8ed3ca59c0444cbd6c0a0955
  > When printing CHECK_XX failures and both types are unprintable, don't bother printing " (UNPRINTABLE vs. UNPRINTABLE)".
  > PR #1929: Fix shorten-64-to-32 warning in stacktrace_riscv-inl.inc
  > Refactor `find_or_prepare_insert_large` to use a single return statement using a lambda.
  > Use possible CPUs to identify NumCPUs() on Linux.
  > Fix incorrect nullability annotation of `absl::Cord::InlineRep::set_data()`.
  > Move SetCtrl* family of functions to cc file.
  > Change absl::InlinedVector::clear() so that it does not deallocate any allocated space. This allows allocations to be reused and matches the behavior specification of std::vector::clear().
  > Mark Abseil container algorithms as `constexpr` for C++20.
  > Fix `CHECK_<OP>` ambiguous overload for `operator<<` in older versions of GCC when C-style strings are compared
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > Optimize `CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend` by interleaving the `CRC32_u64` calls at a lower level.
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > std::multimap::find() is not guaranteed to return the first entry with the requested key. Any may be returned if many exist.
  > Mark `/`, `%`, and `*` operators as constexpr when intrinsics are available.
  > Add the C++20 string_view contructor that uses iterators
  > Implement absl::erase_if for absl::InlinedVector
  > Adjust software prefetch to fetch 5 cachelines ahead, as benchmarking suggests this should perform better.
  > Reduce maximum load factor to 27/32 (from 28/32).
  > Remove unused include
  > Remove unused include statement
  > PR #1921: Fix ABSL_BUILD_DLL mode (absl_make_dll) with mingw
  > PR #1922: Enable mmap for WASI if it supports the mman header
  > Rollback C++20 string_view constructor that uses iterators due to broken builds
  > Add the C++20 string_view contructor that uses iterators
  > Bump versions of dependencies in MODULE.bazel
  > Automated Code Change
  > PR #1918: base: add musl + ppc64le fallback for UnscaledCycleClock::Frequency
  > Optimize crc32 Extend by removing obsolete length alignment.
  > Fix typo in comment of `ABSL_ATTRIBUTE_UNUSED`.
  > Mark AnyInvocable as being nullability compatible.
  > Ensure stack usage remains low when unwinding the stack, to prevent stack overflows
  > Shrink #if ABSL_HAVE_ATTRIBUTE_WEAK region sizes in stacktrace_test.cc
  > <filesystem> is not supported for XTENSA. Disable it in //absl/hash/internal/hash.h.
  > Use signal-safe dynamic memory allocation for stack traces when necessary
  > PR #1915: Fix SYCL Build Compatibility with Intel LLVM Compiler on Windows for abseil
  > Import of CCTZ from GitHub.
  > Tag tests that currently fail on ios_sim_arm64 with "no_test_ios_sim_arm64"
  > Automated Code Change
  > Automated Code Change
  > Import of CCTZ from GitHub.
  > Move comment specific to pointer-taking MutexLock variant to its definition.
  > Add lifetime annotations to MutexLock, SpinLockHolder, etc.
  > Add lifetimebound annotations to absl::MakeSpan and absl::MakeConstSpan to detect dangling references
  > Remove comment mentioning deferenceability.
  > Add referenceful MutexLock with Condition overload.
  > Mark SpinLock camel-cased methods as ready for inlining.
  > Whitespace change
  > In logging tests that write expectations against `ScopedMockLog::Send`, suppress the default behavior that forwards to `ScopedMockLog::Log` so that unexpected logs are printed with full metadata.  Many of these tests are poking at those metadata, and a failure message that doesn't include them is unhelpful.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::ClippedSubstr
  > Inline internal usages of Mutex::Lock, etc. in favor of lock.
  > Inline internal usages of pointerful SpinLockHolder/MutexLock.
  > Remove wrong comment in Cord::Unref
  > Update the crc32 dynamic dispatch table with newer platforms.
  > PR #1914: absl/base/internal/poison.cc: Minor build fix
  > Accept references on SpinLockHolder/MutexLock
  > Import of CCTZ from GitHub.
  > Fix typos in comments.
  > Inline SpinLock Lock->lock, Unlock->unlock internal to Abseil.
  > Rename Mutex methods to use the typical C++ lower case names.
  > Rename SpinLock methods to use the typical C++ lower case names.
  > Add an assert that absl::StrSplit is not called with a null char* argument.
  > Fix sign conversion warning
  > PR #1911: Fix absl_demangle_test on ppc64
  > Disallow using a hash function whose return type is smaller than size_t.
  > Optimize CRC-32C extension by zeroes
  > Deduplicate stack trace implementations in stacktrace.cc
  > Align types of location_table_ and mapping_table_ keys (-Wshorten-64-to-32).
  > Move SigSafeArena() out to absl/base/internal/low_level_alloc.h
  > Allow CHECK_<OP> variants to be used with unprintable types.
  > Import of CCTZ from GitHub.
  > Adds required load statements for C++ rules to BUILD and bzl files.
  > Disable sanitizer bounds checking in ComputeZeroConstant.
  > Roll back NDK weak symbol mode for backtrace() due to internal test breakage
  > Add converter for extracting SwissMap profile information into a https://github.com/google/pprof suitable format for inspection.
  > Allocate memory for frames and sizes during stack trace fix-up when no memory is provided
  > Support NDK weak symbol mode for backtrace() on Android.
  > Change skip_empty_or_deleted to not use groups.
  > Fix bug of dereferencing invalidated iterator in test case.
  > Refactor: split erase_meta_only into large and small versions.
  > Fix a TODO to use std::is_nothrow_swappable when it became available.
  > Clean up the testing of alternate options that were removed in previous changes
  > Only use generic stacktrace when ABSL_HAVE_THREAD_LOCAL.
  > Automated Code Change
  > Add triviality tests for absl::Span
  > Loosen the PointerAlignment test to allow up to 5 stuck bits to avoid flakiness.
  > Prevent conversion constructions from absl::Span to itself
  > Skip flaky expectations in waiter_test for MSVC.
  > Refactor: call AssertIsFull from iterator::assert_is_full to avoid passing the same arguments repeatedly.
  > In AssertSameContainer, remove the logic checking for whether the iterators are from SOO tables or not since we don't use it to generate a more informative debug message.
  > Remove unused NonIterableBitMask::HighestBitSet function.
  > Refactor: move iterator unchecked_* members before data members to comply with Google C++ style guide.
  > Mix pointers once instead of twice now that we've improved mixing on 32-bit platforms and improved the kMul constant.
  > Remove unused utility functions/constants.
  > Revert a change for breaking downstream third party libs
  > Remove unneeded include from cord_rep_btree_navigator.h
  > Refactor: move find_first_non_full into raw_hash_set.cc.
  > Perform stronger mixing on 32-bit platforms and enable the LowEntropyStrings test.
  > Include deallocated caller-provided size in delete hooks.
  > Roll back one more time: In debug mode, assert that the probe sequence isn't excessively long.
  > Allow a `std::move` of `delimiter_` to happen in `ByString::ByString(ByString&&)`. Right now the move ctor is making a copy because the source object is `const`.
  > Assume that control bytes don't alias CommonFields.
  > Consistently use [[maybe_unused]] in raw_hash_set.h for better compiler warning compatibility.
  > Roll forward: In debug mode, assert that the probe sequence isn't excessively long.
  > Add a new test for hash collisions for short strings when PrecombineLengthMix has low quality.
  > Refactor: define CombineRawImpl for repeated `Mix(state ^ value, kMul)` operations.
  > Automated Code Change
  > Mark hash_test as large so that the timeout is increased.
  > Change the value of kMul to have higher entropy and prevent collisions when keys are aligned integers or pointers.
  > Fix LIFETIME annotations for op*/op->/value operators for reference types.
  > Update StatusOr to support lvalue reference value types.
  > Rollback debug assertion that the probe sequence isn't excessively long.
  > AnyInvocable: Fix operator==/!= comments
  > In debug mode, assert that the probe sequence isn't excessively long.
  > Improve NaN handling in absl::Duration arithmetic.
  > Change PrecombineLengthMix to sample data from kStaticRandomData.
  > Fix includes and fuse constructors of SpinLock.
  > Enable `operator==` for `StatusOr` only if the contained type is equality-comparable
  > Enable SIMD memcpy-crc on ARM cores.
  > Improve mixing on 32-bit platforms.
  > Change DurationFromDouble to return -InfiniteDuration() for all NaNs.
  > Change return type of hash internal `Seed` to `size_t` from `uint64_t`
  > CMake: Add a fatal error when the compiler defaults to or is set to a C++ language standard prior to C++17.
  > Make bool true hash be ~size_t{} instead of 1 so that all bits are different between true/false instead of only one.
  > Automated Code Change
  > Pass swisstable seed as seed to absl::Hash so we can save an XOR in H1.
  > Add support for scoped enumerations in CHECK_XX().
  > Revert no-inline on Voidify::operator&&() -- caused unexpected binary size growth
  > Mark Voidify::operator&&() as no-inline. This improves stack trace for `LOG(FATAL)` with optimization on.
  > Refactor long strings hash computations and move `len <= PiecewiseChunkSize()` out of the line to keep only one function call in the inlined hash code.
  > rotr/rotl: Fix undefined behavior when passing INT_MIN as the number of positions to rotate by
  > Reorder members of MixingHashState to comply with Google C++ style guide ordering of type declarations, static constants, ctors, non-ctor functions.
  > Delete unused function ShouldSampleHashtablezInfoOnResize.
  > Remove redundant comments that just name the following symbol without providing additional information.
  > Remove unnecessary modification of growth info in small table case.
  > Suppress CFI violation on VDSO call.
  > Replace WeakMix usage with Mix and change H2 to use the most significant 7 bits - saving 1 cycle in H1.
  > Fix -Wundef warning
  > Fix conditional constexpr in ToInt64{Nano|Micro|Milli}seconds under GCC7 and GCC8 using an else clause as a workaround
  > Enable CompressedTupleTest.NestedEbo test case.
  > Lift restriction on using EBCO[1] for nested CompressedTuples. The current implementation of CompressedTuple explicitly disallows EBCO for cases where CompressedTuples are nested. This is because the implentation for a tuple with EBCO-compatible element T inherits from Storage<T, I>, where I is the index of T in the tuple, and
  > absl::string_view: assert against (data() == nullptr && size() != 0)
  > Fix a false nullability warning in [Q]CHECK_OK by replacing nullptr with an empty char*
  > Make `combine_contiguous` to mix length in a weak way by adding `size << 24`, so that we can avoid a separate mixing of size later. The empty range is mixing 0x57 byte.
  > Add a test case that -1.0 and 1.0 have different hashes.
  > Update CI to a more recent Clang on Linux x86-64
  > `absl::string_view`: Add a debug assert to the single-argument constructor that the argument is not `nullptr`.
  > Fix CI on macOS Sequoia
  > Use Xcode 16.3 for testing
  > Use a proper fix instead of a workaround for a parameter annotated absl_nonnull since the latest Clang can see through the workaround
  > Assert that SetCtrl isn't called on small tables - there are no control bytes in such cases.
  > Use `MaskFullOrSentinel` in `skip_empty_or_deleted`.
  > Reduce flakiness in MockDistributions.Examples test case.
  > Rename PrepareInsertNonSoo to PrepareInsertLarge now that it's no longer used in all non-SOO cases.
  > PR #1895: use c++17 in podspec
  > Avoid hashing the key in prefetch() for small tables.
  > Remove template alias nullability annotations.
  > Add `Group::MaskFullOrSentinel` implementation without usage.
  > Move `hashtable_control_bytes` tests into their own file.
  > Simplify calls to `EqualElement` by introducing `equal_to` helper function.
  > Do `common.increment_size()` directly in SmallNonSooPrepareInsert if inserting to reserved 1 element table.
  > Import of CCTZ from GitHub.
  > Small cleanup of `infoz` processing to get the logic out of the line or removed.
  > Extract the entire PrepareInsert to Small non SOO table out of the line.
  > Take `get_hash` implementation out of the SwissTable class to minimize number of instantiations.
  > Change kEmptyGroup to kDefaultIterControl now that it's only used for default-constructed iterators.
  > [bits] Add tests for return types
  > Avoid allocating control bytes in capacity==1 swisstables.
  > PR #1888: Adjust Table.GrowExtremelyLargeTable to avoid OOM on i386
  > Avoid mixing after `Hash64` calls for long strings by passing `state` instead of `Seed` to low level hash.
  > Indent absl container examples consistently
  > Revert- Doesn't actually work because SWIG doesn't use the full preprocessor
  > Add tags to skip some tests under UBSAN.
  > Avoid subtracting `it.control()` and `table.control()` in single element table during erase.
  > Remove the `salt` parameter from low level hash and use a global constant. That may potentially remove some loads.
  > In SwissTable, don't hash the key when capacity<=1 on insertions.
  > Remove the "small" size designation for thread_identity_test, which causes the test to timeout after 60s.
  > Add comment explaining math behind expressions.
  > Exclude SWIG from ABSL_DEPRECATED and ABSL_DEPRECATE_AND_INLINE
  > stacktrace_x86: Handle nested signals on altstack
  > Import of CCTZ from GitHub.
  > Simplify MixingHashState::Read9To16 to not depend on endianness.
  > Delete deprecated `absl::Cord::Get` and its remaining call sites.
  > PR #1884: Remove duplicate dependency
  > Remove relocatability test that is no longer useful
  > Import of CCTZ from GitHub.
  > Fix a bug of casting sizeof(slot_type) to uint16_t instead of uint32_t.
  > Rewrite `WideToUtf8` for improved readability.
  > Avoid requiring default-constructability of iterator type in algorithms that use ContainerIterPairType
  > Added test cases for invalid surrogates sequences.
  > Use __builtin_is_cpp_trivially_relocatable to implement absl::is_trivially_relocatable in a way that is compatible with PR2786 in the upcoming C++26.
  > Remove dependency on `wcsnlen` for string length calculation.
  > Stop being strict about validating the "clone" part of mangled names
  > Add support for logging wide strings in `absl::log`.
  > Deprecate `ABSL_HAVE_STD_STRING_VIEW`.
  > Change some nullability annotations in absl::Span to absl_nullability_unknown to workaround a bug that makes nullability checks trigger in foreach loops, while still fixing the -Wnullability-completeness warnings.
  > Linux CI update
  > Fix new -Wnullability-completeness warnings found after upgrading the Clang version used in the Linux ARM CI to Clang 19.
  > Add __restrict for uses of PolicyFunctions.
  > Use Bazel vendor mode to cache external dependencies on Windows and macOS
  > Move PrepareInsertCommon from header file to cc file.
  > Remove the explicit from the constructor to a test allocator in hash_policy_testing.h. This is rejected by Clang when using the libstdc++ that ships with GCC15
  > Extract `WideToUtf8` helper to `utf8.h`.
  > Updates the documentation for `CHECK` to make it more explicit that it is used to require that a condition is true.
  > Add PolicyFunctions::soo_capacity() so that the compiler knows that soo_capacity() is always 0 or 1.
  > Expect different representations of pointers from the Windows toolchain.
  > Add set_no_seed_for_testing for use in GrowExtremelyLargeTable test.
  > Update GoogleTest dependency to 1.17.0 to support GCC15
  > Assume that frame pointers inside known stack bounds are readable.
  > Remove fallback code in absl/algorithm/container.h
  > Fix GCC15 warning that <ciso646> is deprecated in C++17
  > Fix misplaced closing brace
  > Remove unused include.
  > Automated Code Change
  > Type erase copy constructor.
  > Refactor to use hash_of(key) instead of hash_ref()(key).
  > Create Table.Prefetch test to make sure that it works.
  > Remove NOINLINE on the constructor with buckets.
  > In SwissTable, don't hash the key in find when capacity<=1.
  > Use 0x57 instead of Seed() for weakly mixing of size.
  > Use absl::InsecureBitGen in place of std::random_device in Abseil tests.
  > Remove unused include.
  > Use large 64 bits kMul for 32 bits platforms as well.
  > Import of CCTZ from GitHub.
  > Define `combine_weakly_mixed_integer` in HashSelect::State in order to allow `friend auto AbslHashValue` instead of `friend H AbslHashValue`.
  > PR #1878: Fix typos in comments
  > Update Abseil dependencies in preparation for release
  > Use weaker mixing for absl::Hash for types that mix their sizes.
  > Update comments on UnscaledCycleClock::Now.
  > Use alignas instead of the manual alignment for the Randen entropy pool.
  > Document nullability annotation syntax for array declarations (not many people may know the syntax).
  > Import of CCTZ from GitHub.
  > Release tests for ABSL_RAW_DCHECK and ABSL_RAW_DLOG.
  > Adjust threshold for stuck bits to avoid flaky failures.
  > Deprecate template type alias nullability annotations.
  > Add more probe benchmarks
  > PR #1874: Simplify detection of the powerpc64 ELFv1 ABI
  > Make `absl::FunctionRef` copy-assignable. This brings it more in line with `std::function_ref`.
  > Remove unused #includes from absl/base/internal/nullability_impl.h
  > PR #1870: Retry SymInitialize on STATUS_INFO_LENGTH_MISMATCH
  > Prefetch from slots in parallel with reading from control.
  > Migrate template alias nullability annotations to macros.
  > Improve dependency graph in `TryFindNewIndexWithoutProbing` hot path evaluation.
  > Add latency benchmarks for Hash for strings with size 3, 5 and 17.
  > Exclude UnwindImpl etc. from thread sanitizer due to false positives.
  > Use `GroupFullEmptyOrDeleted` inside of `transfer_unprobed_elements_to_next_capacity_fn`.
  > PR #1863: [minor] Avoid variable shadowing for absl btree
  > Extend stack-frame walking functionality to allow dynamic fixup
  > Fix "unsafe narrowing" in absl for Emscripten
  > Roll back change to address breakage
  > Extend stack-frame walking functionality to allow dynamic fixup
  > Introduce `absl::Cord::Distance()`
  > Avoid aliasing issues in growth information initialization.
  > Make `GrowSooTableToNextCapacityAndPrepareInsert` in order to initialize control bytes all at once and avoid two function calls on growth right after SOO.
  > Simplify `SingleGroupTableH1` since we do not need to mix all bits anymore. Per table seed has a good last bit distribution.
  > Use `NextSeed` instead of `NextSeedBaseNumber` and make the result type to be `uint16_t`. That avoids unnecessary bit twiddling and simplify the code.
  > Optimize `GrowthToLowerBoundCapacity` in order to avoid division.
  > [base] Make :endian internal to absl
  > Fully qualify absl names in check macros to avoid invalid name resolution when the user scope has those names defined.
  > Fix memory sanitization in `GrowToNextCapacityAndPrepareInsert`.
  > Define and use `ABSL_SWISSTABLE_ASSERT` in cc file since a lot of logic moved there.
  > Remove `ShouldInsertBackwards` functionality. It was used for additional order randomness in debug mode. It is not necessary anymore with introduction of separate per table `seed`.
  > Fast growing to the next capacity based on carbon hash table ideas.
  > Automated Code Change
  > Refactor CombinePiecewiseBuffer test case to (a) call PiecewiseChunkSize() to get the chunk size and (b) use ASSERT for expectation in a loop.
  > PR #1867: Remove global static in stacktrace_win32-inl.inc
  > Mark Abseil hardening assert in AssertIsValidForComparison as slow.
  > Roll back a problematic change.
  > Add absl::FastTypeId<T>()
  > Automated Code Change
  > Update TestIntrinsicInt128 test to print the indices with the conflicting hashes.
  > Code simplification: we don't need XOR and kMul when mixing large string hashes into hash state.
  > Refactor absl::CUnescape() to use direct string output instead of pointer/size.
  > Rename `policy.transfer` to `policy.transfer_n`.
  > Optimize `ResetCtrl` for small tables with `capacity < Group::KWidth * 2` (<32 if SSE enabled and <16 if not).
  > Use 16 bits of per-table-seed so that we can save an `and` instruction in H1.
  > Fully annotate nullability in headers where it is partially annotated.
  > Add note about sparse containers to (flat|node)_hash_(set|map).
  > Make low_level_alloc compatible with -Wthread-safety-pointer
  > Add missing direct includes to enable the removal of unused includes from absl/base/internal/nullability_impl.h.
  > Add tests for macro nullability annotations analogous to existing tests for type alias annotations.
  > Adds functionality to return stack frame pointers during stack walking, in addition to code addresses
  > Use even faster reduction algorithm in FinalizePclmulStream()
  > Add nullability annotations to some very-commonly-used APIs.
  > PR #1860: Add `unsigned` to character buffers to ensure they can provide storage (https://eel.is/c++draft/intro.object#3)
  > Release benchmarks for absl::Status and absl::StatusOr
  > Use more efficient reduction algorithm in FinalizePclmulStream()
  > Add a test case to make it clear that `--vmodule=foo/*=1` does match any children and grandchildren and so on under `foo/`.
  > Gate use of clang nullability qualifiers through absl nullability macros on `nullability_on_classes`.
  > Mark `absl::StatusOr::status()` as ABSL_MUST_USE_RESULT
  > Cleanups related to benchmarks   * Fix many benchmarks to be cc_binary instead of cc_test   * Add a few benchmarks for StrFormat   * Add benchmarks for Substitute   * Add benchmarks for Damerau-Levenshtein distance used in flags
  > Add a log severity alias `DO_NOT_$UBMIT` intended for logging during development
  > Avoid relying on true and false tokens in the preprocessor macros used in any_invocable.h
  > Avoid relying on true and false tokens in the preprocessor macros used in absl/container
  > Refactor to make it clear that H2 computation is not repeated in each iteration of the probe loop.
  > Turn on C++23 testing for GCC and Clang on Linux
  > Fix overflow of kSeedMask on 32 bits platform in `generate_new_seed`.
  > Add a workaround for std::pair not being trivially copyable in C++23 in some standard library versions
  > Refactor WeakMix to include the XOR of the state with the input value.
  > Migrate ClearPacBits() to a more generic implementation and location
  > Annotate more Abseil container methods with [[clang::lifetime_capture_by(...)]] and make them all forward to the non-captured overload
  > Make PolicyFunctions always be the second argument (after CommonFields) for type-erased functions.
  > Move GrowFullSooTableToNextCapacity implementation with some dependencies to cc file.
  > Optimize btree_iterator increment/decrement to avoid aliasing issues by using local variables instead of repeatedly writing to `this`.
  > Add constexpr conversions from absl::Duration to int64_t
  > PR #1853: Add support for QCC compiler
  > Fix documentation for key requirements of flat_hash_set
  > Use `extern template` for `GrowFullSooTableToNextCapacity` since we know the most common  set of paramenters.
  > C++23: Fix log_format_test to match the stream format for volatile pointers
  > C++23: Fix compressed_tuple_test.
  > Implement `btree::iterator::+=` and `-=`.
  > Stop calling `ABSL_ANNOTATE_MEMORY_IS_INITIALIZED` for threadlocal counter.
  > Automated Code Change
  > Introduce seed stored in the hash table inside of the size.
  > Replace ABSL_ATTRIBUTE_UNUSED with [[maybe_unused]]
  > Minor consistency cleanups to absl::BitGen mocking.
  > Restore the empty CMake targets for bad_any_cast, bad_optional_access, and bad_variant_access to allow clients to migrate.
  > bits.h: Add absl::endian and absl::byteswap polyfills
  > Use absl::NoDestructor an absl::Mutex instance in the flags library to prevent some exit-time destructor warnings
  > Add thread GetEntropyFromRandenPool test
  > Update nullability annotation documentation to focus on macro annotations.
  > Simplify some random/internal types; expose one function to acquire entropy.
  > Remove pre-C++17 workarounds for lack of std::launder
  > UBSAN: Use -fno-sanitize-recover
  > int128_test: Avoid testing signed integer overflow
  > Remove leading commas in `Describe*` methods of `StatusIs` matcher.
  > absl::StrFormat: Avoid passing null to memcpy
  > str_cat_test: Avoid using invalid enum values
  > hash_generator_testing: Avoid using invalid enum values
  > absl::Cord: Avoid passing null to memcpy and memset
  > graphcycles_test: Avoid applying a non-zero offset to a null pointer
  > Make warning about wrapping empty std::function in AnyInvocable stronger.
  > absl/random: Convert absl::BitGen / absl::InsecureBitGen to classes from aliases.
  > Fix buffer overflow the internal demangling function
  > Avoid calling `ShouldRehashForBugDetection` on the first two inserts to the table.
  > Remove the polyfill implementations for many type traits and alias them to their std equivalents. It is recommended that clients now simple use the std equivalents.
  > ROLLBACK: Limit slot_size to 2^16-1 and maximum table size to 2^43-1.
  > Limit `slot_size` to `2^16-1` and maximum table size to `2^43-1`.
  > Use C++17 [[nodiscard]] instead of the deprecated ABSL_MUST_USE_RESULT
  > Remove the polyfills for absl::apply and absl::make_from_tuple, which were only needed prior to C++17. It is recommended that clients simply use std::apply and std::make_from_tuple.
  > PR #1846: Fix build on big endian
  > Bazel: Move environment variables to --action_env
  > Remove the implementation of `absl::variant`, which was only needed prior to C++17. `absl::variant` is now an alias for `std::variant`. It is recommended that clients simply use `std::variant`.
  > MSVC: Fix warnings c4244 and c4267 in the main library code
  > Update LowLevelHashLenGt16 to be LowLevelHashLenGt32 now that the input is guaranteed to be >32 in length.
  > Xtensa does not support thread_local. Disable it in absl/base/config.h.
  > Add support for 8-bit and 16-bit integers to absl::SimpleAtoi
  > CI: Update Linux ARM latest container
  > Add time hash tests
  > `any_invocable`: Update comment that refer to C++17 and C++11
  > `check_test_impl.inc`: Use C++17 features unconditionally
  > Remove the implementation of `absl::optional`, which was only needed prior to C++17. `absl::optional` is now an alias for `std::optional`. It is recommended that clients simply use `std::optional`.
  > Move hashtable control bytes manipulation to a separate file.
  > Fix a use-after-free bug in which the string passed to `AtLocation` may be referenced after it is destroyed. While the string does live until the end of the full statement, logging (previously occurred) in the destructor of the `LogMessage` which may be constructed before the temporary string (and thus destroyed after the temporary string's destructor).
  > `internal/layout`: Delete pre-C++17 out of line definition of constexpr class member
  > Extract slow path for PrepareInsertNonSoo to a separate function `PrepareInsertNonSooSlow`.
  > Minor code cleanups
  > `internal/log_message`: Use `if constexpr` instead of SFINAE for `operator<<`
  > [absl] Use `std::min` in `constexpr` contexts in `absl::string_view`
  > Remove the implementation of `absl::any`, which was only needed prior to C++17. `absl::any` is now an alias for `std::any`. It is recommended that clients simply use `std::any`.
  > Remove ABSL_INTERNAL_NEED_REDUNDANT_CONSTEXPR_DECL which is longer needed with the C++17 floor
  > Make `OptimalMemcpySizeForSooSlotTransfer` ready to work with MaxSooSlotSize upto `3*sizeof(size_t)`.
  > `internal/layout`: Replace SFINAE with `if constexpr`
  > PR #1830: C++17 improvement: use if constexpr in internal/hash.h
  > `absl`: Deprecate `ABSL_HAVE_CLASS_TEMPLATE_ARGUMENT_DEDUCTION`
  > Add a verification for access of being destroyed table. Also enabled access after destroy check in ASAN optimized mode.
  > Store `CharAlloc` in SwissTable in order to simplify type erasure of functions accepting allocator as `void*`.
  > Introduce and use `SetCtrlInLargeTable`, when we know that table is at least one group. Similarly to `SetCtrlInSingleGroupTable`, we can save some operations.
  > Make raw_hash_set::slot_type private.
  > Delete absl/utility/internal/if_constexpr.h
  > `internal/any_invocable`: Use `if constexpr` instead of SFINAE when initializing storage accessor
  > Depend on string_view directly
  > Optimize and slightly simplify `PrepareInsertNonSoo`.
  > PR #1833: Make ABSL_INTERNAL_STEP_n macros consistent in crc code
  > `internal/any_invocable`: Use alias `RawT` consistently in `InitializeStorage`
  > Move the implementation of absl::ComputeCrc32c to the header file, to facilitate inlining.
  > Delete absl/base/internal/inline_variable.h
  > Add lifetimebound to absl::StripAsciiWhitespace
  > Revert: Random: Use target attribute instead of -march
  > Add return for opt mode in AssertNotDebugCapacity to make sure that code is not evaluated in opt mode.
  > `internal/any_invocable`: Delete TODO, improve comment and simplify pragma in constructor
  > Split resizing routines and type erase similar instructions.
  > Random: Use target attribute instead of -march
  > `internal/any_invocable`: Use `std::launder` unconditionally
  > `internal/any_invocable`: Remove suppresion of false positive -Wmaybe-uninitialized on GCC 12
  > Fix feature test for ABSL_HAVE_STD_OPTIONAL
  > Support C++20 iterators in raw_hash_map's random-access iterator detection
  > Fix mis-located test dependency
  > Disable the DestroyedCallsFail test on GCC due to flakiness.
  > `internal/any_invocable`: Implement invocation using `if constexpr` instead of SFINAE
  > PR #1835: Bump deployment_target version and add visionos to podspec
  > PR #1828: Fix spelling of pseudorandom in README.md
  > Make raw_hash_map::key_arg private.
  > `overload`: Delete obsolete macros for undefining `absl::Overload` when C++ < 17
  > `absl/base`: Delete `internal/invoke.h` and `invoke_test.cc`
  > Remove `WORKSPACE.bazel`
  > `absl`: Replace `base_internal::{invoke,invoke_result_t,is_invocable_r}` with `std` equivalents
  > Allow C++20 forward iterators to use fast paths
  > Factor out some iterator traits detection code
  > Type erase IterateOverFullSlots to decrease code size.
  > `any_invocable`: Delete pre-C++17 workarounds for `noexcept` and guaranteed copy elision
  > Make raw_hash_set::key_arg private.
  > Rename nullability macros to use new lowercase spelling.
  > Fix bug where ABSL_REQUIRE_EXPLICIT_INIT did not actually result in a linker error
  > Make Randen benchmark program use runtime CPU detection.
  > Add CI for the C++20/Clang/libstdc++ combination
  > Move Abseil to GoogleTest 1.16.0
  > `internal/any_invocable`: Use `if constexpr` instead of SFINAE in `InitializeStorage`
  > More type-erasing of InitializeSlots by removing the Alloc and AlignOfSlot template parameters.
  > Actually use the hint space instruction to strip PAC bits for return addresses in stack traces as the comment says
  > `log/internal`: Replace `..._ATTRIBUTE_UNUSED_IF_STRIP_LOG` with C++17 `[[maybe_unused]]`
  > `attributes`: Document `ABSL_ATTRIBUTE_UNUSED` as deprecated
  > `internal/any_invocable`: Initialize using `if constexpr` instead of ternary operator, enum, and templates
  > Fix flaky tests due to sampling by introducing utility to refresh sampling counters for the current thread.
  > Minor reformatting in raw_hash_set: - Add a clear_backing_array member to declutter calls to ClearBackingArray. - Remove some unnecessary `inline` keywords on functions. - Make PoisonSingleGroupEmptySlots static.
  > Update CI for linux_gcc-floor to use GCC9, Bazel 7.5, and CMake 3.31.5.
  > `internal/any_invocable`: Rewrite `IsStoredLocally` type trait into a simpler constexpr function
  > Add ABSL_REQUIRE_EXPLICIT_INIT to Abseil to enable enforcing explicit field initializations
  > Require C++17
  > Minimize number of `InitializeSlots` with respect to SizeOfSlot.
  > Leave the call to `SampleSlow` only in type erased InitializeSlots.
  > Update comments for Read4To8 and Read1To3.
  > PR #1819: fix compilation with AppleClang
  > Move SOO processing inside of InitializeSlots and move it once.
  > PR #1816: Random: use getauxval() via <sys/auxv.h>
  > Optimize `InitControlBytesAfterSoo` to have less writes and make them with compile time known size.
  > Remove stray plus operator in cleanup_internal::Storage
  > Include <cerrno> to fix compilation error in chromium build.
  > Adjust internal logging namespacing for consistency s/ABSL_LOGGING_INTERNAL_/ABSL_LOG_INTERNAL_/
  > Rewrite LOG_EVERY_N (et al) docs to clarify that the first instance is logged.  Also, deliberately avoid giving exact numbers or examples since IRL behavior is not so exact.
  > ABSL_ASSUME: Use a ternary operator instead of do-while in the implementations that use a branch marked unreachable so that it is usable in more contexts.
  > Simplify the comment for raw_hash_set::erase.
  > Remove preprocessors for now unsupported compilers.
  > `absl::ScopedMockLog`: Explicitly document that it captures logs emitted by all threads
  > Fix potential integer overflow in hash container create/resize
  > Add lifetimebound to StripPrefix/StripSuffix.
  > Random: Rollforward support runtime dispatch on AArch64 macOS
  > Crc: Only test non_temporal_store_memcpy_avx on AVX targets
  > Provide information about types of all flags.
  > Deprecate the precomputed hash find() API in swisstable.
  > Import of CCTZ from GitHub.
  > Adjust whitespace
  > Expand documentation for absl::raw_hash_set::erase to include idiom example of iterator post-increment.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Crc: Remove the __builtin_cpu_supports path for SupportsArmCRC32PMULL
  > Use absl::NoDestructor for some absl::Mutex instances in the flags library to prevent some exit-time destructor warnings
  > Update the WORKSPACE dependency of rules_cc to 0.1.0
  > Rollback support runtime dispatch on AArch64 macOS for breaking some builds
  > Downgrade to rules_cc 0.0.17 because 0.1.0 was yanked
  > Use unused set in testing.
  > Random: Support runtime dispatch on AArch64 macOS
  > crc: Use absl::nullopt when returning absl::optional
  > Annotate absl::FixedArray to warn when unused.
  > PR #1806: Fix undefined symbol: __android_log_write
  > Move ABSL_HAVE_PTHREAD_CPU_NUMBER_NP to the file where it is needed
  > Use rbit instruction on ARM rather than rev.
  > Debugging: Report the CPU we are running on under Darwin
  > Add a microbenchmark for very long int/string tuples.
  > Crc: Detect support for pmull and crc instructions on Apple AArch64 With a newer clang, we can use __builtin_cpu_supports which caches all the feature bits.
  > Add special handling for hashing integral types so that we can optimize Read1To3 and Read4To8 for the strings case.
  > Use unused FixedArray instances.
  > Minor reformatting
  > Avoid flaky expectation in WaitDurationWoken test case in MSVC.
  > Use Bazel rules_cc for many compiler-specific rules instead of our custom ones from before the Bazel rules existed.
  > Mix pointers twice in absl::Hash.
  > New internal-use-only classes `AsStructuredLiteralImpl` and `AsStructuredValueImpl`
  > Annotate some Abseil container methods with [[clang::lifetime_capture_by(...)]]
  > Faster copy from inline Cords to inline Strings
  > Add new benchmark cases for hashing string lengths 1,2,4,8.
  > Move the Arm implementation of UnscaledCycleClock::Now() into the header file, like the x86 implementation, so it can be more easily inlined.
  > Minor include cleanup in absl/random/internal
  > Import of CCTZ from GitHub.
  > Use Bazel Platforms to support AES-NI compile options for Randen
  > In HashState::Create, require that T is a subclass of HashStateBase in order to discourage users from defining their own HashState types.
  > PR #1801: Remove unncessary <iostream> includes
  > New class StructuredProtoField
  > Mix pointers twice in TSan and MSVC to avoid flakes in the PointerAlignment test.
  > Add a test case that type-erased absl::HashState is consistent with absl::HashOf.
  > Mix pointers twice in build modes in which the PointerAlignment test is flaky if we mix once.
  > Increase threshold for stuck bits in PointerAlignment test on android.
  > Use hashing ideas from Carbon's hashtable in absl hashing: - Use byte swap instead of mixing pointers twice. - Change order of branches to check for len<=8 first. - In len<=16 case, do one multiply to mix the data instead of using the logic from go/absl-hash-rl (reinforcement learning was used to optimize the instruction sequence). - Add special handling for len<=32 cases in 64-bit architectures.
  > Test that using a table that was moved-to from a moved-from table fails in sanitizer mode.
  > Remove a trailing comma causing an issue for an OSS user
  > Add missing includes in hash.h.
  > Use the public implementation rule for "@bazel_tools//tools/cpp:clang-cl"
  > Import of CCTZ from GitHub.
  > Change the definition of is_trivially_relocatable to be a bit less conservative.
  > Updates to CI to support newer versions of tools
  > Check if ABSL_HAVE_INTRINSIC_INT128 is defined
  > Print hash expansions in the hash_testing error messages.
  > Avoid flakiness in notification_test on MSVC.
  > Roll back: Add more debug capacity validation checks on moves.
  > Add more debug capacity validation checks on moves.
  > Add macro versions of nullability annotations.
  > Improve fork-safety by opening files with `O_CLOEXEC`.
  > Move ABSL_HARDENING_ASSERTs in constexpr methods to their own lines.
  > Add test cases for absl::Hash: - That hashes are consistent for the same int value across different int types. - That hashes of vectors of strings are unequal even when their concatenations are equal. - That FragmentedCord hashes works as intended for small Cords.
  > Skip the IterationOrderChangesOnRehash test case in ASan mode because it's flaky.
  > Add missing includes in absl hash.
  > Try to use file descriptors in the 2000+ range to avoid mis-behaving client interference.
  > Add weak implementation of the __lsan_is_turned_off in Leak Checker
  > Fix a bug where EOF resulted in infinite loop.
  > static_assert that absl::Time and absl::Duration are trivially destructible.
  > Move Duration ToInt64<unit> functions to be inline.
  > string_view: Add defaulted copy constructor and assignment
  > Use `#ifdef` to avoid errors when `-Wundef` is used.
  > Strip PAC bits for return addresses in stack traces
  > PR #1794: Update cpu_detect.cc fix hw crc32 and AES capability check, fix undefined
  > PR #1790: Respect the allocator's .destroy method in ~InlinedVector
  > Cast away nullability in the guts of CHECK_EQ (et al) where Clang doesn't see that the nullable string returned by Check_EQImpl is statically nonnull inside the loop.
  > string_view: Correct string_view(const char*, size_type) docs
  > Add support for std::string_view in StrCat even when absl::string_view != std::string_view.
  > Misc. adjustments to unit tests for logging.
  > Use local_config_cc from rules_cc and make it a dev dependency
  > Add additional iteration order tests with reservation. Reserved tables have a different way of iteration randomization compared to gradually resized tables (at least for small tables).
  > Use all the bits (`popcount`) in `FindFirstNonFullAfterResize` and `PrepareInsertAfterSoo`.
  > Mark ConsumePrefix, ConsumeSuffix, StripPrefix, and StripSuffix as constexpr since they are all pure functions.
  > PR #1789: Add missing #ifdef pp directive to the TypeName() function in the layout.h
  > PR #1788: Fix warning for sign-conversion on riscv
  > Make StartsWith and EndsWith constexpr.
  > Simplify logic for growing single group table.
  > Document that absl::Time and absl::Duration are trivially destructible.
  > Change some C-arrays to std::array as this enables bounds checking in some hardened standard library builds
  > Replace outdated select() on --cpu with platform API equivalent.
  > Take failure_message as const char* instead of string_view in LogMessageFatal and friends.
  > Mention `c_any_of` in the function comment of `absl::c_linear_search`.
  > Import of CCTZ from GitHub.
  > Rewrite some string_view methods to avoid a -Wunreachable-code warning
  > IWYU: Update includes and fix minor spelling mistakes.
  > Add comment on how to get next element after using erase.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND and a doc note about absl::LogAsLiteral to clarify its intended use.
  > Import of CCTZ from GitHub.
  > Reduce memory consumption of structured logging proto encoding by passing tag value
  > Remove usage of _LIBCPP_HAS_NO_FILESYSTEM_LIBRARY.
  > Make Span's relational operators constexpr since C++20.
  > distributions: support a zero max value in Zipf.
  > PR #1786: Fix typo in test case.
  > absl/random: run clang-format.
  > Add some nullability annotations in logging and tidy up some NOLINTs and comments.
  > CMake: Change the default for ABSL_PROPAGATE_CXX_STD to ON
  > Delete UnvalidatedMockingBitGen
  > PR #1783: [riscv][debugging] Fix a few warnings in RISC-V inlines
  > Add conversion operator to std::array for StrSplit.
  > Add a comment explaining the extra comparison in raw_hash_set::operator==. Also add a small optimization to avoid the extra comparison in sets that use hash_default_eq as the key_equal functor.
  > Add benchmark for absl::HexStringToBytes
  > Avoid installing options.h with the other headers
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::Span constructors.
  > Annotate absl::InlinedVector to warn when unused.
  > Make `c_find_first_of`'s `options` parameter a const reference to allow temporaries.
  > Disable Elf symbols for Xtensa
  > PR #1775: Support symbolize only on WINAPI_PARTITION_DESKTOP
  > Require through an internal presubmit that .h|.cc|.inc files contain either the string ABSL_NAMESPACE_BEGIN or SKIP_ABSL_INLINE_NAMESPACE_CHECK
  > Xtensa supports mmap, enable it in absl/base/config.h
  > PR #1777: Avoid std::ldexp in `operator double(int128)`.
  > Marks absl::Span as view and borrowed_range, like std::span.
  > Mark inline functions with only a simple comparison in strings/ascii.h as constexpr.
  > Add missing Abseil inline namespace and fix includes
  > Fix bug where the high bits of `__int128_t`/`__uint128_t` might go unused in the hash function. This fix increases the hash quality of these types.
  > Add a test to verify bit casting between signed and unsigned int128 works as expected
  > Add suggestions to enable sanitizers for asserts when doing so may be helpful.
  > Add nullability attributes to nullability type aliases.
  > Refactor swisstable moves.
  > Improve ABSL_ASSERT performance by guaranteeing it is optimized away under NDEBUG in C++20
  > Mark Abseil hardening assert in AssertSameContainer as slow.
  > Add workaround for q++ 8.3.0 (QNX 7.1) compiler by making sure MaskedPointer is trivially copyable and copy constructible.
  > Small Mutex::Unlock optimization
  > Optimize `CEscape` and `CEscapeAndAppend` by up to 40%.
  > Fix the conditional compilation of non_temporal_store_memcpy_avx to verify that AVX can be forced via `gnu::target`.
  > Delete TODOs to move functors when moving hashtables and add a test that fails when we do so.
  > Fix benchmarks in `escaping_benchmark.cc` by properly calling `benchmark::DoNotOptimize` on both inputs and outputs and by removing the unnecessary and wrong `ABSL_RAW_CHECK` condition (`check != 0`) of `BM_ByteStringFromAscii_Fail` benchmark.
  > It seems like commit abc9b916a94ebbf251f0934048295a07ecdbf32a did not work as intended.
  > Fix a bug in `absl::SetVLogLevel` where a less generic pattern incorrectly removed a more generic one.
  > Remove the side effects between tests in vlog_is_on_test.cc
  > Attempt to fix flaky Abseil waiter/sleep tests
  > Add an explicit tag for non-SOO CommonFields (removing default ctor) and add a small optimization for early return in AssertNotDebugCapacity.
  > Make moved-from swisstables behave the same as empty tables. Note that we may change this in the future.
  > Tag tests that currently fail on darwin_arm64 with "no_test_darwin_arm64"
  > add gmock to cmake defs for no_destructor_test
  > Optimize raw_hash_set moves by allowing some members of CommonFields to be uninitialized when moved-from.
  > Add more debug capacity validation checks on iteration/size.
  > Add more debug capacity validation checks on copies.
  > constinit -> constexpr for DisplayUnits
  > LSC: Fix null safety issues diagnosed by Clang’s `-Wnonnull` and `-Wnullability`.
  > Remove the extraneous variable creation in Match().
  > Import of CCTZ from GitHub.
  > Add more debug capacity validation checks on merge/swap.
  > Add `absl::` namespace to c_linear_search implementation in order to avoid ADL
  > Distinguish the debug message for the case of self-move-assigned swiss tables.
  > Update LowLevelHash comment regarding number of hash state variables.
  > Add an example for the `--vmodule` flag.
  > Remove first prefetch.
  > Add moved-from validation for the case of self-move-assignment.
  > Allow slow and fast abseil hardening checks to be enabled independently.
  > Update `ABSL_RETIRED_FLAG` comment to reflect `default_value` is no longer used.
  > Add validation against use of moved-from hash tables.
  > Provide file-scoped pragma behind macro ABSL_POINTERS_DEFAULT_NONNULL to indicate the default nullability. This is a no-op for now (not understood by checkers), but does communicate intention to human readers.
  > Add stacktrace config for android using the generic implementation
  > Fix nullability annotations in ABSL code.
  > Replace CHECKs with ASSERTs and EXPECTs -- no reason to crash on failure.
  > Remove ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW
  > Migrate ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW to ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW
  > Disable ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW prior to Clang-13 due to false positives.
  > Make ABSL_ATTRIBUTE_VIEW and ABSL_ATTRIBUTE_OWNER public
  > Optimize raw_hash_set::AssertHashEqConsistent a bit to avoid having as much runtime overhead.
  > PR #1728: Workaround broken compilation against NDK r25
  > Add validation against use of destroyed hash tables.
  > Do not truncate `ABSL_RAW_LOG` output at null bytes
  > Use several unused cord instances in tests and benchmarks.
  > Add comments about ThreadIdentity struct allocation behavior.
  > Refactoring followup for reentrancy validation in swisstable.
  > Add debug mode checks that element constructors/destructors don't make reentrant calls to raw_hash_set member functions.
  > Add tagging for cc_tests that are incompatible with Fuchsia
  > Add GetTID() implementation for Fuchsia
  > PR #1738: Fix shell option group handling in pkgconfig files
  > Disable weak attribute when absl compiled as windows DLL
  > Remove `CharIterator::operator->`.
  > Mark non-modifying container algorithms as constexpr for C++20.
  > PR #1739: container/internal: Explicitly include <cstdint>
  > Don't match -Wnon-virtual-dtor in the "flags are needed to suppress warnings in headers". It should fall through to the "don't impose our warnings on others" case. Do this by matching on "-Wno-*" instead of "-Wno*".
  > PR #1732: Fix build on NVIDIA Jetson board. Fix #1665
  > Update GoogleTest dependency to 1.15.2
  > Enable AsciiStrToLower and AsciiStrToUpper overloads for rvalue references.
  > PR #1735: Avoid `int` to `bool` conversion warning
  > Add `absl::swap` functions for `*_hash_*` to avoid calling `std::swap`
  > Change internal visibility
  > Remove resolved issue.
  > Increase test timeouts to support running on Fuchsia emulators
  > Add tracing annotations to absl::Notification
  > Suppress compiler optimizations which may break container poisoning.
  > Disable ABSL_INTERNAL_HAVE_DEBUGGING_STACK_CONSUMPTION for Fuchsia
  > Add tracing annotations to absl::BlockingCounter
  > Add absl_vlog_is_on and vlog_is_on to ABSL_INTERNAL_DLL_TARGETS
  > Update swisstable swap API comments to no longer guarantee that we don't move/swap individual elements.
  > PR #1726: cmake: Fix RUNPATH when using BUILD_WITH_INSTALL_RPATH=True
  > Avoid unnecessary copying when upper-casing or lower-casing ASCII string_view
  > Add weak internal tracing API
  > Fix LINT.IfChange syntax
  > PR #1720: Fix spelling mistake: occurrance -> occurrence
  > Add missing include for Windows ASAN configuration in poison.cc
  > Delete absl/strings/internal/has_absl_stringify.h now that the GoogleTest version we depend on uses the public file
  > Update versions of dependencies in preparation for release
  > PR #1699: Add option to build with MSVC static runtime
  > Remove unneeded 'be' from comment.
  > PR #1715: Generate options.h using CMake only once
  > Small type fix in absl/log/internal/log_impl.h
  > PR #1709: Handle RPATH CMake configuration
  > PR #1710: fixup! PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
  > PR #1695: Fix time library build for Apple platforms
  > Remove cyclic cmake dependency that breaks in cmake 3.30.0
  > Roll forward poisoned pointer API and fix portability issues.
  > Use GetStatus in IsOkAndHoldsMatcher
  > PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
  > PR #1706: Require CMake version 3.16
  > Add an MSVC implementation of ABSL_ATTRIBUTE_LIFETIME_BOUND
  > Mark c_min_element, c_max_element, and c_minmax_element as constexpr in C++17.
  > Optimize the absl::GetFlag cost for most non built-in flag types (including string).
  > Encode some additional metadata when writing protobuf-encoded logs.
  > Replace signed integer overflow, since that's undefined behavior, with unsigned integer overflow.
  > Make mutable CompressedTuple::get() constexpr.
  > vdso_support: support DT_GNU_HASH
  > Make c_begin, c_end, and c_distance conditionally constexpr.
  > Add operator<=> comparison to absl::Time and absl::Duration.
  > Deprecate `ABSL_ATTRIBUTE_NORETURN` in favor of the `[[noreturn]]` standardized in C++11
  > Rollback new poisoned pointer API
  > Static cast instead of reinterpret cast raw hash set slots as casting from void* to T* is well defined
  > Fix absl::NoDestructor documentation about its use as a global
  > Declare Rust demangling feature-complete.
  > Split demangle_internal into a tree of smaller libraries.
  > Decode Rust Punycode when it's not too long.
  > Add assertions to detect reentrance in `IterateOverFullSlots` and `absl::erase_if`.
  > Decoder for Rust-style Punycode encodings of bounded length.
  > Add `c_contains()` and `c_contains_subrange()` to `absl/algorithm/container.h`.
  > Three-way comparison spaceship <=> operators for Cord.
  > internal-only change
  > Remove erroneous preprocessor branch on SGX_SIM.
  > Add an internal API to get a poisoned pointer.
  > optimization.h: Add missing <utility> header for C++
  > Add a compile test for headers that require C compatibility
  > Fix comment typo
  > Expand documentation for SetGlobalVLogLevel and SetVLogLevel.
  > Roll back 6f972e239f668fa29cab43d7968692cd285997a9
  > PR #1692: Add missing `<utility>` include
  > Remove NOLINT for `#include <new>` for __cpp_lib_launder
  > Remove not used after all kAllowRemoveReentrance parameter from IterateOverFullSlots.
  > Create `absl::container_internal::c_for_each_fast` for SwissTable.
  > Disable flaky test cases in kernel_timeout_internal_test.
  > Document that swisstable and b-tree containers are not exception-safe.
  > Add `ABSL_NULLABILITY_COMPATIBLE` attribute.
  > LSC: Move expensive variables on their last use to avoid copies.
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to more types in Abseil
  > Drop std:: qualification from integer types like uint64_t.
  > Increase slop time on MSVC in PerThreadSemTest.Timeouts again due to continued flakiness.
  > Turn on validation for out of bounds MockUniform in MockingBitGen
  > Use ABSL_UNREACHABLE() instead of equivalent
  > If so configured, report which part of a C++ mangled name didn't parse.
  > Sequence of 1-to-4 values with prefix sum to support Punycode decoding.
  > Add the missing inline namespace to the nullability files
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to types in Abseil
  > Disallow reentrance removal in `absl::erase_if`.
  > Fix implicit conversion of temporary bitgen to BitGenRef
  > Use `IterateOverFullSlots` in `absl::erase_if` for hash table.
  > UTF-8 encoding library to support Rust Punycode decoding.
  > Disable negative NaN float ostream format checking on RISC-V
  > PR #1689: Minor: Add missing quotes in CMake string view library definition
  > Demangle template parameter object names, TA <template-arg>.
  > Demangle sr St <simple-id> <simple-id>, a dubious encoding found in the wild.
  > Try not to lose easy type combinators in S::operator const int*() and the like.
  > Demangle fixed-width floating-point types, DF....
  > Demangle _BitInt types DB..., DU....
  > Demangle complex floating-point literals.
  > Demangle <extended-qualifier> in types, e.g., U5AS128 for address_space(128).
  > Demangle operator co_await (aw).
  > Demangle fully general vendor extended types (any <template-args>).
  > Demangle transaction-safety notations GTt and Dx.
  > Demangle C++11 user-defined literal operator functions.
  > Demangle C++20 constrained friend names, F (<source-name> | <operator-name>).
  > Demangle dependent GNU vector extension types, Dv <expression> _ <type>.
  > Demangle elaborated type names, (Ts | Tu | Te) <name>.
  > Add validation that hash/eq functors are consistent, meaning that `eq(k1, k2) -> hash(k1) == hash(k2)`.
  > Demangle delete-expressions with the global-scope operator, gs (dl | da) ....
  > Demangle new-expressions with braced-init-lists.
  > Demangle array new-expressions, [gs] na ....
  > Demangle object new-expressions, [gs] nw ....
  > Demangle preincrement and predecrement, pp_... and mm_....
  > Demangle throw and rethrow (tw... and tr).
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Demangle ti... and te... expressions (typeid).
  > Demangle nx... syntax for noexcept(e) as an expression in a dependent signature.
  > Demangle alignof expressions, at... and az....
  > Demangle C++17 structured bindings, DC...E.
  > Demangle modern _ZGR..._ symbols.
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Demangle sizeof...(pack captured from an alias template), sP ... E.
  > Demangle types nested under vendor extended types.
  > Demangle il ... E syntax (braced list other than direct-list-initialization).
  > Avoid signed overflow for Ed <number> _ manglings with large <number>s.
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Remove obsolete TODO
  > Clarify function comment for `erase` by stating that this idiom only works for "some" standard containers.
  > Move SOVERSION to global CMakeLists, apply SOVERSION to DLL
  > Set ABSL_HAVE_THREAD_LOCAL to 1 on all platforms
  > Demangle constrained auto types (Dk <type-constraint>).
  > Parse <discriminator> more accurately.
  > Demangle lambdas in class member functions' default arguments.
  > Demangle unofficial <unresolved-qualifier-level> encodings like S0_IT_E.
  > Do not make std::filesystem::path hash available for macOS <10.15
  > Include flags in DLL build (non-Windows only)
  > Enable building monolithic shared library on macOS and Linux.
  > Demangle Clang's last-resort notation _SUBSTPACK_.
  > Demangle C++ requires-expressions with parameters (rQ ... E).
  > Demangle Clang's encoding of __attribute__((enable_if(condition, "message"))).
  > Demangle static_cast and friends.
  > Demangle decltype(expr)::nested_type (NDT...E).
  > Optimize GrowIntoSingleGroupShuffleControlBytes.
  > Demangle C++17 fold-expressions.
  > Demangle thread_local helper functions.
  > Demangle lambdas with explicit template arguments (UlTy and similar forms).
  > Demangle &-qualified function types.
  > Demangle valueless literals LDnE (nullptr) and LA<number>_<type>E ("foo").
  > Correctly demangle the <unresolved-name> at the end of dt and pt (x.y, x->y).
  > Add missing targets to ABSL_INTERNAL_DLL_TARGETS
  > Build abseil_test_dll with ABSL_BUILD_TESTING
  > Demangle C++ requires-expressions without parameters (rq ... E).
  > overload: make the constructor constexpr
  > Update Abseil CI Docker image to use Clang 19, GCC 14, and CMake 3.29.3
  > Workaround symbol resolution bug in Clang 19
  > Workaround bogus GCC14 -Wmaybe-uninitialized warning
  > Silence a bogus GCC14 -Warray-bounds warning
  > Forbid absl::Uniform<absl::int128>(gen)
  > Use IN_LIST to replace list(FIND) + > -1
  > Recognize C++ vendor extended expressions (e.g., u9__is_same...E).
  > `overload_test`: Remove a few unnecessary trailing return types
  > Demangle the C++ this pointer (fpT).
  > Stop eating an extra E in ParseTemplateArg for some L<type><value>E literals.
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to Abseil.
  > Demangle C++ direct-list-initialization (T{1, 2, 3}, tl ... E).
  > Demangle the C++ spaceship operator (ss, operator<=>).
  > Demangle C++ sZ encodings (sizeof...(pack)).
  > Demangle C++ so ... E encodings (typically array-to-pointer decay).
  > Recognize dyn-trait-type in Rust demangling.
  > Rework casting in raw_hash_set's IsFull().
  > Remove test references to absl::SharedBitGen, which was never part of the open source release. This was only used in tests that never ran as part in the open source release.
  > Recognize fn-type and lifetimes in Rust demangling.
  > Support int128/uint128 in validated MockingBitGen
  > Recognize inherent-impl and trait-impl in Rust demangling.
  > Recognize const and array-type in Rust mangled names.
  > Remove Asylo from absl.
  > Recognize generic arguments containing only types in Rust mangled names.
  > Fix missing #include <random> for std::uniform_int_distribution
  > Move `prepare_insert` out of the line as type erased `PrepareInsertNonSoo`.
  > Revert: Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
  > Add (unused) validation to absl::MockingBitGen
  > Support `AbslStringify` with `DCHECK_EQ`.
  > PR #1672: Optimize StrJoin with tuple without user defined formatter
  > Give ReturnAddresses and N<uppercase> namespaces separate stacks for clarity.
  > Demangle Rust backrefs.
  > Use Nt for struct and trait names in Rust demangler test inputs.
  > Allow __cxa_demangle on MIPS
  > Add a `string_view` overload to `absl::StrJoin`
  > Demangle Rust's Y<type><path> production for passably simple <type>s.
  > `convert_test`: Delete obsolete condition around ASSERT_EQ in TestWithMultipleFormatsHelper
  > `any_invocable`: Clean up #includes
  > Resynchronize absl/functional/CMakeLists.txt with BUILD.bazel
  > `any_invocable`: Add public documentation for undefined behavior when invoking an empty AnyInvocable
  > `any_invocable`: Delete obsolete reference to proposed standard type
  > PR #1662: Replace shift with addition in crc multiply
  > Doc fix.
  > `convert_test`: Extract loop over tested floats from helper function
  > Recognize some simple Rust mangled names in Demangle.
  > Use __builtin_ctzg and __builtin_clzg in the implementations of CountTrailingZeroesNonzero16 and CountLeadingZeroes16 when they are available.
  > Remove the forked absl::Status matchers implementation in statusor_test
  > Add comment hack to fix copybara reversibility
  > Add GoogleTest matchers for absl::Status
  > [random] LogUniform: Document as a discrete distribution
  > Enable Cord tests with Crc.
  > Fix order of qualifiers in `absl::AnyInvocable` documentation.
  > Guard against null pointer dereference in DumpNode.
  > Apply ABSL_MUST_USE_RESULT to try lock functions.
  > Add public aliases for default hash/eq types in hash-based containers
  > Import of CCTZ from GitHub.
  > Remove the hand-rolled CordLeaker and replace with absl::NoDestructor to test the after-exit behavior
  > `convert_test`: Delete obsolete `skip_verify` parameter in test helper
  > overload: allow using the underlying type with CTAD directly.
  > PR #1653: Remove unnecessary casts when calling CRC32_u64
  > PR #1652: Avoid C++23 deprecation warnings from float_denorm_style
  > Minor cleanup for `absl::Cord`
  > PR #1651: Implement ABSL_INTERNAL_DISABLE_DEPRECATED_DECLARATION_WARNING for MSVC compiler
  > Add `operator<=>` support to `absl::int128` and `absl::uint128`
  > [absl] Re-use the existing `std::type_identity` backfill instead of redefining it again
  > Add `absl::AppendCordToString`
  > `str_format/convert_test`: Delete workaround for [glibc bug](https://sourceware.org/bugzilla/show_bug.cgi?id=22142)
  > `absl/log/internal`: Document conditional ABSL_ATTRIBUTE_UNUSED, add C++17 TODO
  > `log/internal/check_op`: Add ABSL_ATTRIBUTE_UNUSED to CHECK macros when STRIP_LOG is enabled
  > log_benchmark: Add VLOG_IS_ON benchmark
  > Restore string_view detection check
  > Remove an unnecessary ABSL_ATTRIBUTE_UNUSED from a logging macro
  < Abseil LTS Branch, Jan 2024, Patch 2 (#1650)
  > In example code, add missing template parameter.
  > Optimize crc32 V128_From2x64 on Arm
  > Annotate that Mutex should warn when unused.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Cord::Flatten/TryFlat
  > Deprecate `absl::exchange`, `absl::forward` and `absl::move`, which were only useful before C++14.
  > Temporarily revert dangling std::string_view detection until dependent is fixed
  > Use _decimal_ literals for the CivilDay example.
  > Fix bug in BM_EraseIf.
  > Add internal traits to absl::string_view for lifetimebound detection
  > Add internal traits to absl::StatusOr for lifetimebound detection
  > Add internal traits to absl::Span for lifetimebound detection
  > Add missing dependency for log test build target
  > Add internal traits for lifetimebound detection
  > Use local decoding buffer in HexStringToBytes
  > Only check if the frame pointer is inside a signal stack with known bounds
  > Roll forward: enable small object optimization in swisstable.
  > Optimize LowLevelHash by breaking dependency between final loads and previous len/ptr updates.
  > Fix the wrong link.
  > Optimize InsertMiss for tables without kDeleted slots.
  > Use GrowthInfo without applying any optimizations based on it.
  > Disable small object optimization while debugging some failing tests.
  > Adjust conditonal compilation in non_temporal_memcpy.h
  > Reformat log/internal/BUILD
  > Remove deprecated errno constants from the absl::Status mapping
  > Introduce GrowthInfo with tests, but without usage.
  > Enable small object optimization in swisstable.
  > Refactor the GCC unintialized memory warning suppression in raw_hash_set.h.
  > Respect `NDEBUG_SANITIZER`
  > Revert integer-to-string conversion optimizations pending more thorough analysis
  > Fix a bug in `Cord::{Append,Prepend}(CordBuffer)`: call `MaybeRemoveEmptyCrcNode()`. Otherwise appending a `CordBuffer` an empty Cord with a CRC node crashes (`RemoveCrcNode()` which increases the refcount of a nullptr child).
  > Add `BM_EraseIf` benchmark.
  > Record sizeof(key_type), sizeof(value_type) in hashtable profiles.
  > Fix ClangTidy warnings in btree.h.
  > LSC: Move expensive variables on their last use to avoid copies.
  > PR #1644: unscaledcycleclock: remove RISC-V support
  > Reland: Make DLOG(FATAL) not understood as [[noreturn]]
  > Separate out absl::StatusOr constraints into statusor_internal.h
  > Use Layout::WithStaticSizes in btree.
  > `layout`: Delete outdated comments about ElementType alias not being used because of MSVC
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > `layout_benchmark`: Replace leftover comment with intended call to MyAlign
  > Remove absl::aligned_storage_t
  > Delete ABSL_ANNOTATE_MEMORY_IS_INITIALIZED under Thread Sanitizer
  > Remove vestigial variables in the DumpNode() helper in absl::Cord
  > Do hashtablez sampling on the first insertion into an empty SOO hashtable.
  > Add explicit #include directives for <tuple>, "absl/base/config.h", and "absl/strings/string_view.h".
  > Add a note about the cost of `VLOG` in non-debug builds.
  > Fix flaky test failures on MSVC.
  > Add template keyword to example comment for Layout::WithStaticSizes.
  > PR #1643: add xcprivacy to all subspecs
  > Record sampling stride in cord profiling to facilitate unsampling.
  > Fix a typo in a comment.
  > [log] Correct SetVLOGLevel to SetVLogLevel in comments
  > Add a feature to container_internal::Layout that lets you specify some array sizes at compile-time as template parameters. This can make offset and size calculations faster.
  > `layout`: Mark parameter of Slices with ABSL_ATTRIBUTE_UNUSED, remove old workaround
  > `layout`: Use auto return type for functions that explicitly instantiate std::tuple in return statements
  > Remove redundant semicolons introduced by macros
  > [log] Make :vlog_is_on/:absl_vlog_is_on public in BUILD.bazel
  > Add additional checks for size_t overflows
  > Replace //visibility:private with :__pkg__ for certain targets
  > PR #1603: Disable -Wnon-virtual-dtor warning for CommandLineFlag implementations
  > Add several missing includes in crc/internal
  > Roll back extern template instatiations in swisstable due to binary size increases in shared libraries.
  > Add nodiscard to SpinLockHolder.
  > Test that rehash(0) reduces capacity to minimum.
  > Add extern templates for common swisstable types.
  > Disable ubsan for benign unaligned access in crc_memcpy
  > Make swisstable SOO support GDB pretty printing and still compile in OSS.
  > Fix OSX support with CocoaPods and Xcode 15
  > Fix GCC7 C++17 build
  > Use UnixEpoch and ZeroDuration
  > Make flaky failures much less likely in BasicMocking.MocksNotTriggeredForIncorrectTypes test.
  > Delete a stray comment
  > Move GCC uninitialized memory warning suppression into MaybeInitializedPtr.
  > Replace usages of absl::move, absl::forward, and absl::exchange with their std:: equivalents
  > Fix the move to itself
  > Work around an implicit conversion signedness compiler warning
  > Avoid MSan: use-of-uninitialized-value error in find_non_soo.
  > Fix flaky MSVC test failures by using longer slop time.
  > Add ABSL_ATTRIBUTE_UNUSED to variables used in an ABSL_ASSUME.
  > Implement small object optimization in swisstable - disabled for now.
  > Document and test ability to use absl::Overload with generic lambdas.
  > Extract `InsertPosition` function to be able to reuse it.
  > Increase GraphCycles::PointerMap size
  > PR #1632: inlined_vector: Use trivial relocation for `erase`
  > Create `BM_GroupPortable_Match`.
  > [absl] Mark `absl::NoDestructor` methods with `absl::Nonnull` as appropriate
  > Automated Code Change
  > Rework casting in raw_hash_set's `IsFull()`.
  > Adds ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::BitGenRef
  > Workaround for NVIDIA C++ compiler being unable to parse variadic expansions in range of range-based for loop
  > Rollback: Make DLOG(FATAL) not understood as [[noreturn]]
  > Make DLOG(FATAL) not understood as [[noreturn]]
  > Optimize `absl::Duration` division and modulo: Avoid repeated redundant comparisons in `IDivFastPath`.
  > Optimize `absl::Duration` division and modulo: Allow the compiler to inline `time_internal::IDivDuration`, by splitting the slow path to a separate function.
  > Fix typo in example code snippet.
  > Automated Code Change
  > Add braces for conditional statements in raw_hash_map functions.
  > Optimize `prepare_insert`, when resize happens. It removes single unnecessary probing before resize that is beneficial for small tables the most.
  > Add noexcept to move assignment operator and swap function
  > Import of CCTZ from GitHub.
  > Minor documentation updates.
  > Change find_or_prepare_insert to return std::pair<iterator, bool> to match return type of insert.
  > PR #1618: inlined_vector: Use trivial relocation for `SwapInlinedElements`
  > Improve raw_hash_set tests.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Use const_cast to avoid duplicating the implementation of raw_hash_set::find(key).
  > Import of CCTZ from GitHub.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Annotate that SpinLock should warn when unused.
  > PR #1625: absl::is_trivially_relocatable now respects assignment operators
  > Introduce `Group::MaskNonFull` without usage.
  > `demangle`: Parse template template and C++20 lambda template param substitutions
  > PR #1617: fix MSVC 32-bit build with -arch:AVX
  > Minor documentation fix for `absl::StrSplit()`
  > Prevent overflow in `absl::CEscape()`
  > `demangle`: Parse optional single template argument for built-in types
  > PR #1412: Filter out `-Xarch_` flags from pkg-config files
  > `demangle`: Add complexity guard to `ParseQRequiresExpr`
  < Prepare 20240116.1 patch for Apple Privacy Manifest (#1623)
  > Remove deprecated symbol absl::kuint128max
  > Add ABSL_ATTRIBUTE_WARN_UNUSED.
  > `demangle`: Parse `requires` clauses on template params, before function return type
  > On Apple, implement absl::is_trivially_relocatable with the fallback.
  > `demangle`: Parse `requires` clauses on functions
  > Make `begin()` to return `end()` on empty tables.
  > `demangle`: Parse C++20-compatible template param declarations, except those with `requires` expressions
  > Add the ABSL_DEPRECATE_AND_INLINE() macro
  > Span: Fixed comment referencing std::span as_writable_bytes() as as_mutable_bytes().
  > Switch rank structs to be consistent with written guidance in go/ranked-overloads
  > Avoid hash computation and `Group::Match` in small tables copy and use `IterateOverFullSlots` for iterating for all tables.
  > Optimize `absl::Hash` by making `LowLevelHash` faster.
  > Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
  < Backport Apple Privacy Manifest (#1613)
  > Stop using `std::basic_string<uint8_t>` which relies on a non-standard generic `char_traits<>` implementation, recently removed from `libc++`.
  > Add absl_container_hash-based HashEq specialization
  > `demangle`: Implement parsing for simplest constrained template arguments
  > Roll forward 9d8588bfc4566531c4053b5001e2952308255f44 (which was rolled back in 146169f9ad357635b9cd988f976b38bcf83476e3) with fix.
  > Add a version of absl::HexStringToBytes() that returns a bool to validate that the input was actually valid hexadecimal data.
  > Enable StringLikeTest in hash_function_defaults_test
  > Fix a typo.
  > Minor changes to the BUILD file for absl/synchronization
  > Avoid static initializers in case of ABSL_FLAGS_STRIP_NAMES=1
  > Rollback 9d8588bfc4566531c4053b5001e2952308255f44 for breaking the build
  > No public description
  > Decrease the precision of absl::Now in x86-64 debug builds
  > Optimize raw_hash_set destructor.
  > Add ABSL_ATTRIBUTE_UNINITIALIZED macros for use with clang and GCC's `uninitialized`
  > Optimize `Cord::Swap()` for missed compiler optimization in clang.
  > Type erased hash_slot_fn that depends only on key types (and hash function).
  > Replace `testonly = 1` with `testonly = True` in abseil BUILD files.
  > Avoid extra `& msbs` on every iteration over the mask for GroupPortableImpl.
  > Missing parenthesis.
  > Early return from destroy_slots for trivially destructible types in flat_hash_{*}.
  > Avoid export of testonly target absl::test_allocator in CMake builds
  > Use absl::NoDestructor for cordz global queue.
  > Add empty WORKSPACE.bzlmod
  > Introduce `RawHashSetLayout` helper class.
  > Fix a corner case in SpyHashState for exact boundaries.
  > Add nullability annotations
  > Use absl::NoDestructor for global HashtablezSampler.
  > Always check if the new frame pointer is readable.
  > PR #1604: Add privacy manifest
  < Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds  (#1606)
  > Remove code pieces for no longer supported GCC versions.
  > Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds
  > Prevent brace initialization of AlphaNum
  > Remove code pieces for no longer supported MSVC versions.
  > Added benchmarks for smaller size copy constructors.
  > Migrate empty CrcCordState to absl::NoDestructor.
  > Add protected copy ctor+assign to absl::LogSink, and clarify thread-safety requirements to apply to the interface methods.
  < Apply LTS transformations for 20240116 LTS branch (#1599)

Closes scylladb/scylladb#28756
2026-04-08 12:19:54 +03:00
Liapkovich
4f17cc6d83 docs: add missing rack value for internode_compression parameter
The rack option was fully implemented in the code but omitted from
both docs/operating-scylla/admin.rst and conf/scylla.yaml comments.

Closes scylladb/scylladb#29239
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
0ea76a468f schema: Avoid copies in column_mapping::operator==
In a multi-declarator declaration, the & ref-qualifier is part of each
individual declarator, not the shared type specifier. So:

    const auto& a = x(), b = y();

declares 'a' as a reference but 'b' as a value, silently copying y().
The same applies to:

    const T& a = v[i], b = v[j];

Both operator== lines had this pattern, causing an unnecessary copy of
the column vector and an unnecessary copy of each entry on every call.

Fix by repeating & on the second declarator in both lines.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29213
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
b7c14c6d29 token_metadata: Clear _topology_change_info gently
clear_gently() (introduced in 322aa2f8b5) clears all token_metadata_impl
members using co_await to avoid reactor stalls on large data structures.

_topology_change_info (introduced in 10bf8c7901) was added later and not
included in clear_gently().

update_topology_change_info() already uses utils::clear_gently() when
replacing the value, so it looks reasonable to apply the same pattern
in clear_gently().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29210
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
54fbbf0410 locator/tablets: Fix missing selector value in error messages
Some on_internal_error() calls have the selector argument to a format
string with no placeholder for it in the format string.

"While at it", disambiguate selector type in the message text.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29208
2026-04-08 12:19:54 +03:00
Botond Dénes
418141ec08 Merge 'Drop create_dataset() helper from object_store tests' from Pavel Emelyanov
There's only one test left that uses it, and it can be patched to use standard ks/cf creation helpers from pylib. This patch does so and drops the lengthy create_dataset() helper

Tests improvements, no need to backport

Closes scylladb/scylladb#29176

* github.com:scylladb/scylladb:
  test/backup: drop create_dataset helper
  test/backup: use new_test_keyspace in test_restore_primary_replica
2026-04-08 12:19:54 +03:00
Petr Gusev
1e3c8c5a87 test_mutation_schema_change: use tablets
The enable_tablets(false) was added when LWT wasn't supported for tablets, now it's, so no need in this attribute are more.

The test covers behavior which should work in similar way for both vnodes and tablets -> it doesn't seem it would benefit much from running it in both enable_tablets(true) and enable_tablets(false) modes.

Closes scylladb/scylladb#29167
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
7f854c0255 hints: Use shorter fault-injection overload
In order to apply fsult-injected delay, there's the inject(duration)
overload. Results in shorter code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29168
2026-04-08 10:51:37 +03:00
Botond Dénes
aeefbda304 Merge 'Simplify and improve API descibe_ring code flow' from Pavel Emelyanov
The endpoint in question has some places worth fixing, in particular

- the keyspace parameter is not validated
- the validated table name is resolved into table_id, but the id is unused
- two ugly static helpers to stream obtained token ranges into json

Improving the API code flow, not backporting

Closes scylladb/scylladb#29154

* github.com:scylladb/scylladb:
  api: Inline describe_ring JSON handling
  storage_service: Make describe_ring_for_table() take table_id
2026-04-08 10:50:07 +03:00
Artsiom Mishuta
b1e9c0b867 test/pylib: add typed skip markers plugin
Add skip_reason_plugin.py — a framework-agnostic pytest plugin that
provides typed skip markers (skip_bug, skip_not_implemented, skip_slow,
skip_env) so that the reason a test is skipped is machine-readable in
JUnit XML and Allure reports.  Bare untyped pytest.mark.skip now
triggers a warning (to become an error after full migration).  Runtime
skips via skip() are also enriched by parsing the [type] prefix from
the skip message.

The plugin is a class (SkipReasonPlugin) that receives the concrete
SkipType enum and an optional report_callback from conftest.py, keeping
it decoupled from allure and project-specific types.

Extract SkipType enum and convenience runtime skip wrappers (skip_bug,
skip_env, etc.) into test/pylib/skip_types.py so callers only need a
single import instead of importing both SkipType and skip() separately.
conftest.py imports SkipType from the new module and registers the
plugin instance unconditionally (for all test runners).

New files:
- test/pylib/skip_reason_plugin.py: core plugin — typed marker
  processing, bare-skip warnings, JUnit/Allure report enrichment
  (including runtime skip() parsing via _parse_skip_type helper)
- test/pylib/skip_types.py: SkipType enum and convenience wrappers
  (skip_bug, skip_not_implemented, skip_slow, skip_env)
- test/pylib_test/test_skip_reason_plugin.py: 17 pytester-based
  test functions (51 cases across 3 build modes) covering markers,
  warnings, reports, callbacks, and skip_mode interaction

Infrastructure changes:
- test/conftest.py: import SkipType from skip_types, register
  SkipReasonPlugin with allure report callback
- test/pylib/runner.py: set SKIP_TYPE_KEY/SKIP_REASON_KEY stash keys
  for skip_mode so the report hook can enrich JUnit/Allure with
  skip_type=mode without longrepr parsing
- test/pytest.ini: register typed marker definitions (required for
  --strict-markers even when plugin is not loaded)

Migrated test files (representative samples):
- test/cluster/test_tablet_repair_scheduler.py:
  skip -> skip_bug (#26844), skip -> skip_not_implemented
- test/cqlpy/.../timestamp_test.py: skip -> skip_slow
- test/cluster/dtest/schema_management_test.py: skip -> skip_not_implemented
- test/cluster/test_change_replication_factor_1_to_0.py: skip -> skip_bug (#20282)
- test/alternator/conftest.py: skip -> skip_env
- test/alternator/test_https.py: use skip_env() wrapper

Fixes SCYLLADB-79

Closes scylladb/scylladb#29235
2026-04-08 10:38:56 +03:00
Pavel Emelyanov
e0fa9ee332 Merge 'storage: implement sstable clone for object storage' from Ernest Zaslavsky
This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction.

The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed).

The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path:
- GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body.
- S3 `copy_object` silently ignored the abort_source parameter.

1. **gcp_client: fix copy_object request method and body** — Fix two bugs in the GCS rewrite API call.
2. **s3_client: pass through abort_source in copy_object** — Stop ignoring the abort_source parameter.
3. **object_storage: add copy_object to object_storage_client** — New interface method with S3 and GCS implementations.
4. **storage: add make_object_name overload with generation** — Helper for building destination object names with a different generation.
5. **storage: make delete_object const** — Needed by the const clone method.
6. **storage: implement object_storage_base::clone** — The actual clone implementation plus a copy_object wrapper.
7. **test/boost: enable sstable clone tests for S3 and GCS** — Re-enable the previously skipped tests.

A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both  S3 and GCS.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045
Prerequisite: https://github.com/scylladb/scylladb/pull/28790
No need to backport since this code targets future feature

Closes scylladb/scylladb#29166

* github.com:scylladb/scylladb:
  compaction_test: enable sstable clone tests for S3 and GCS
  storage: implement object_storage_base::clone
  storage: make delete_object const in object_storage_base
  storage: add make_object_name overload with generation
  sstables: add get_format() accessor to sstable
  object_storage: add copy_object to object_storage_client
  s3_client: pass through abort_source in copy_object
  gcp_client: fix copy_object request method and body
2026-04-08 09:35:10 +03:00
Nadav Har'El
4eeb9f4120 lwt, vector: write to CDC when vector index is enabled.
The vector-search feature introduced the somewhat confusing feature of
enabling CDC without explicitly enabling CDC: When a vector index is
enabled on a table, CDC is "enabled" for it even if the user didn't
ask to enable CDC.

For this, write-path code began to use a new cdc_enabled() function
instead of checking schema.cdc_options.enabled() directly.  This
cdc_enabled() function checks if either this enabled() is true, or
has_vector_index() is true.

Unfortunately, LWT writes continued to use cdc_options.enabled() instead
of the new cdc_enabled(). This means that if a vector index is used and
a vector is written using an LWT write, the new value is not indexed.

This patch fixes this bug. It also adds a regression test that fails
before this patch and passes afterwards - the new test verifies that
when a table has a vector index (but no explicit CDC enabled), the CDC
log is updated both after regular writes and after successful LWT writes.

This patch was also tested in the context of the upcoming vector-search-
for-Alternator pull request, which has a test reproducing this bug
(Alternator uses LWT frequently, so this is very important there).
It will also be tested by the vector-store test suite ("validator").

Fixes SCYLLADB-1342

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29300
2026-04-08 07:55:05 +03:00
Marcin Maliszkiewicz
1bf3110adb Merge 'test: add test_upgrade_preserves_ddl_audit_for_tables' from Andrzej Jackowski
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations ([SCYLLADB-1155](https://scylladb.atlassian.net/browse/SCYLLADB-1155)).

Test time in dev: 4s

Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
No backport, test for bug on master

[SCYLLADB-1155]: https://scylladb.atlassian.net/browse/SCYLLADB-1155?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29223

* github.com:scylladb/scylladb:
  test: add test_upgrade_preserves_ddl_audit_for_tables
  test: audit: split validate helper so callers need not pass audit_settings
  test: audit: declare manager attribute in AuditTester base class
2026-04-07 17:29:11 +02:00
Marcin Maliszkiewicz
895fdb6d29 Merge 'ldap: fix double-free of LDAPMessage in poll_results()' from Andrzej Jackowski
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.

Fixes: SCYLLADB-1344

Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere.

Closes scylladb/scylladb#29302

* github.com:scylladb/scylladb:
  test: ldap: add regression test for double-free on unregistered message ID
  ldap: fix double-free of LDAPMessage in poll_results()
2026-04-07 17:27:43 +02:00
Ernest Zaslavsky
422f107122 compaction_test: enable sstable clone tests for S3 and GCS
Now that object_storage_base::clone is implemented,
remove the early-return skips and re-enable the
sstable_clone_leaving_unsealed_dest_sstable tests for
both S3 and GCS storage backends.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
7cd9bbb010 storage: implement object_storage_base::clone
Implement the clone method for object_storage_base, which creates
a copy of an sstable with a new generation using server-side object
copies. Also add a const copy_object convenience wrapper, similar
to the existing put_object and delete_object wrappers.

A dedicated test for the new object storage clone path will be
added in the following commit. The preexisting local-filesystem
clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable
test.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
8fa82e6b6f storage: make delete_object const in object_storage_base
The method doesn't modify any member state. Making it
const is needed for calling it from the const clone
method.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
47387341bb storage: add make_object_name overload with generation
Add a make_object_name overload that accepts a target
generation parameter for constructing object names with
a generation different from the source sstable's own.

Refactor the original make_object_name to delegate to
the new overload, eliminating code duplication.

This is needed by clone to build destination object
names for the new generation.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
8bd891c6ed sstables: add get_format() accessor to sstable
Add a public get_format() accessor for the _format member, following
the same pattern as the existing get_version(). This allows storage
implementations to access the sstable format without reaching into
private members, and is needed by the upcoming object_storage_base::clone
to construct entry_descriptor for the sstables registry.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
3d23490615 object_storage: add copy_object to object_storage_client
Add a copy_object method to the object_storage_client
interface for server-side object copies, with
implementations for both S3 and GCS wrappers.

The S3 wrapper delegates to s3::client::copy_object.
The GCS wrapper delegates to gcp::storage::client's
cross-bucket copy_object overload.

This is a prerequisite for implementing sstable clone
on object storage.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
1702d6e6d4 s3_client: pass through abort_source in copy_object
The abort_source parameter in s3::client::copy_object
was ignored — the function accepted it but always passed
nullptr to the underlying copy_s3_object. Forward it
properly so callers can cancel in-progress copies.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
bfdc1e5267 gcp_client: fix copy_object request method and body
The GCP copy_object (rewrite API) had two bugs:

1. The request body was an empty string, but the GCP
   rewrite endpoint always parses it as JSON metadata.
   An empty string is not valid JSON, resulting in
   400 "Metadata in the request couldn't decode".
   Fix: send "{}" (empty JSON object) as the body.

2. The HTTP method was PUT, but the GCP Objects: rewrite
   API requires POST per the documentation.
   Fix: use POST.

Test coverage in a follow-up patch
2026-04-07 18:16:52 +03:00
Nadav Har'El
a0e79f391f Merge 'alternator: fix batch write item squashing cdc entries' from Radosław Cybulski
When `BatchWriteItem` operates on multiple items sharing the same partition key in `always_use_lwt` write isolation mode, all CDC log entries are emitted under a single timestamp. The previous `get_records` parsing algorithm in `alternator/streams.cc` assumed that all CDC log entries sharing the same timestamp correspond to a single DynamoDB item change. As a result, it would incorrectly squash multiple distinct item changes into a single Streams record — producing wrong event data (e.g., one INSERT instead of four, with mismatched key/attribute values).

Note: the bug is specific to `always_use_lwt` mode because only in LWT mode does the entire batch share a single timestamp. In non-LWT modes, each item in the batch receives a separate timestamp, so the entries naturally stay separate.

**Commit 1: alternator: add BatchWriteItem Streams test**

- Adds new tests `test_streams_batchwrite_no_clustering_deletes_non_existing_items` and `test_streams_batchwrite_no_clustering_deletes_existing_items` that cover the corner cases of batch-deleting a existing and non-existing item in a table without a clustering key. CDC tables without clustering keys are handled differently, and this path was previously untested for delete operations.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data`, that is a simple way to trigger a bug.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_deletes_existing_items`, that validates various combinations of puts and deletes in a single BatchWrite against the same partition.
- Adds a new `test_table_ss_new_and_old_images_write_isolation_always` fixture and extends `create_table_ss` to accept `additional_tags`, enabling tests with a specific write isolation mode.

**Commit 2: alternator: fix BatchWriteItem squashed Streams entries**

The core fix rewrites the CDC log entry parsing in `get_records` to distinguish items by their clustering key:

- Introduces `managed_bytes_ptr_hash` and `managed_bytes_ptr_equal` helper structs for pointer-based hash map lookups on `managed_bytes`.
- Replaces the single `record`/`dynamodb` pair with a `std::unordered_map<const managed_bytes*, Record, ...>` (`records_map`) keyed by the base table's clustering key value from each CDC log row. For tables without a clustering key, all entries map to a single sentinel key.
- Adds a validation that Alternator tables have at most one clustering key column (as required by the DynamoDB data model).
- On end-of-record (`eor`), flushes all accumulated per-clustering-key records into the output, each with a unique `eventID` (the `event_id` format now includes an index suffix).
- Adjusts the limit check: since a single CDC timestamp bucket can now produce multiple output records, the limit may be slightly exceeded to avoid breaking mid-batch.

Fixes #28439
Fixes: SCYLLADB-540

Closes scylladb/scylladb#28452

* github.com:scylladb/scylladb:
  alternator/test: explain why 'always' write isolation mode is used in tests
  alternator/test: add scylla_only to always write isolation fixture
  alternator: fix BatchWriteItem squashed Streams entries
  alternator: add BatchWriteItem test (failing)
2026-04-07 17:49:23 +03:00
Nadav Har'El
22e7ef46a7 Merge 'vector_search: fix SELECT on local vector index' from Karol Nowacki
Queries against local vector indexes were failing with the error:
```ANN ordering by vector requires the column to be indexed using 'vector_index'```

This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.

Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have  different target format.

This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.

This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.

Fixes: SCYLLADB-895

Backport to 2026.1 is required as this issue occurs also on this branch.

Closes scylladb/scylladb#28862

* github.com:scylladb/scylladb:
  index: fix DESC INDEX for vector index
  vector_search: test: refactor boilerplate setup
  vector_search: fix SELECT on local vector index
  index: test: vector index target option serialization test
  index: test: secondary index target option serialization test
2026-04-07 17:43:35 +03:00
Michał Jadwiszczak
9cf94116c2 db/view/view_building_worker: fix indentation 2026-04-07 16:12:04 +02:00
Michał Jadwiszczak
c9aa5bb09c db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks
To create `process_staging` view building tasks, we firstly need to
collect informations about them on shard0, create necessary mutations,
commit them to group0 and move staging sstables objects to their
original shards.

But there is a possible race after committing the group0 command
and before moving the staging sstables to their shards.
Between those two events, the coordinator may schedule freshly created
tasks and dispatch them to the worker but the worker won't have the
sstables objects because they weren't moved yet.

This patch fixes the race by holding `_staging_sstables_mutex` locks
from necessary shards when executing `create_staging_sstable_tasks()`.
With this, even if the task will be scheduled and dispatched quickly,
the worker will wait with executing it until the sstables objects are
moved and the locks are released.

Fixes SCYLLADB-816
2026-04-07 16:11:45 +02:00
Pavel Emelyanov
58e59e8c0d Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.

The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.

* No functional change and no backport needed

Closes scylladb/scylladb#29209

* github.com:scylladb/scylladb:
  test: add test_sstable_clone_preserves_staging_state
  test: derive sstable state from directory in test_env::make_sstable
  sstables: log debug message in filesystem_storage::clone
2026-04-07 17:02:04 +03:00
Botond Dénes
816f2bf163 Merge 'cql3: fix null handling in data_value formatting' from Dario Mirovic
`data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead.

Added tests after the fix. Manually checked that tests fail without the fix.

Fixes SCYLLADB-1350

This is a fix that prevents format crash. No known occurrence in production, but backport is desirable.

Closes scylladb/scylladb#29262

* github.com:scylladb/scylladb:
  test: boost: test null data value to_parsable_string
  cql3: fix null handling in data_value formatting
2026-04-07 16:35:31 +03:00
Dimitrios Symonidis
701808d7aa test/object_store: parametrize test_basic over replication factor
Extend test_basic to run with both RF=1 and RF=3 to verify that
object storage works correctly with multiple replicas. The test now
starts one server per replica (each on its own rack), flushes all
nodes, validates tablet replica counts for RF>1, and restarts all
servers before verifying data is still readable.

Fixes: SCYLLADB-546

Closes scylladb/scylladb#28583
2026-04-07 16:27:44 +03:00
Nadav Har'El
f642db0693 test/alternator: tests for missing support of ReturnConsumedCapacity
As noted in issue #5027 and issue #29138, Alternator's support for
ReturnConsumedCapacity is lacking in a two areas:

1. While ReturnConsumedCapacity is supported for most relevant
   operations, it's not supported in two operations: Query and Scan.

2. While ReturnConsumedCapacity=TOTAL is supported, INDEXES is not
   supported at all.

This patch adds extensive tests for all these cases. All these tests
pass on DynamoDB but fail on Alternator, so are marked with "xfail".

The tests for ReturnConsumedCapacity=INDEXES are deliberately split
into two: First, we test the case where the table has no indexes, so
INDEXES is almost the same as TOTAL and should be very easy to
implement. A second test checks the cases where there are indexes,
and different operations increment the capacity of the base table
and/or indexes differently - it will require significantly more work
to make the second test pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29188
2026-04-07 16:07:41 +03:00
Nadav Har'El
f590ee2b7e cdc, vector: fix CDC result tracker for vector indexes
When a table has a vector index, cdc::cdc_enabled() returns true because
vector index writes are implemented via the CDC augmentation path. However,
register_cdc_operation_result_tracker() was checking only
cdc_options().enabled(), which is false for tables that have a vector index
but not traditional CDC.

As a result, the operation_result_tracker was never attached to write
response handlers for vector-indexed tables. This tracker was added in
commit 1b92cbe, and its job is to update metrics of CDC operations,
and since vector search really does use CDC under the hood, these
metrics could be useful when diagnosing problems.

Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which
covers both traditional CDC and vector-indexed tables.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29343
2026-04-07 15:54:51 +03:00
Avi Kivity
8c629d55b0 test: vector_search: check [[nodiscard]] return values of expected<> types
Clang 22 verifies [[nodiscard]] for co_await,
causing compilation failures where return values of expected<> were
silently discarded.

These call sites were discarding the return value of client::request()
and vector_store_client::ann(), both of which return expected<> types
marked [[nodiscard]]. Rather than suppressing the warning with (void)
casts, properly check the return values using the established test
patterns: BOOST_CHECK(result) where the call is expected to succeed,
and BOOST_CHECK(!result) where the call is expected to fail.

Closes scylladb/scylladb#29297
2026-04-07 15:25:08 +03:00
Anna Stuchlik
176f6fb59e doc: add the 2026.x patch release upgrade guide-from-2025
This issue adds the upgrade guide for all patch releases within 2026.x major release.

In addition, it fixes the link to Upgrade Policy in the 2025.x-to-2026.1 upgrade guide.

Fixes SCYLLADB-1247

Closes scylladb/scylladb#29307
2026-04-07 13:52:16 +02:00
Anna Stuchlik
d329c91f9e doc: remove About Upgrade and redirect to Upgrade Policy
While fixing https://github.com/scylladb/scylladb/issues/28997, we added a new page about upgrade policy:
https://docs.scylladb.com/stable/versioning/upgrade-policy.html

This commit removes the old page and adds redirections to the new Upgrade Policy page
in the unversioned documentation set.

Closes scylladb/scylladb#29251
2026-04-07 13:44:10 +02:00
Andrei Chekun
93583bf193 test.py: use safe_drive_shutdown in the tests
These methods for closing driver was missed during original fix.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#29093
2026-04-07 14:35:18 +03:00
Avi Kivity
00409b61f1 Merge 'Add Vnodes to Tablets Migration Procedure' from Nikos Dragazis
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.

The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
    1. Mark a node for upgrade in the topology state.
    2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
    3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
    1. Update the keyspace schema to mark it as tablet-based.
    2. Clear the group0 state related to the migration.

From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.

During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.

The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.

Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.

New feature, no backport is needed.

Closes scylladb/scylladb#29065

* github.com:scylladb/scylladb:
  docs: Add ops guide for vnodes-to-tablets migration
  test: cluster: Add test for migration of multiple keyspaces
  test: cluster: Add test for error conditions
  test: cluster: Add vnodes->tablets migration test (rollback)
  test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
  test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
  scylla-nodetool: Add migrate-to-tablets subcommand
  api: Add REST endpoint for vnode-to-tablet migration status
  api: Add REST endpoint for migration finalization
  topology_coordinator: Add `finalize_migration` request
  database: Construct migrating tables with tablet ERMs
  api: Add REST endpoint for upgrading nodes to tablets
  api: Add REST endpoint for starting vnodes-to-tablets migration
  topology_state_machine: Add intended_storage_mode to system.topology
  distributed_loader: Wire vnode-based resharding into table populator
  replica: Pick any compaction group for resharding
  compaction: resharding_compaction: add vnodes_resharding option
  storage_service: Preserve ERM flavor of migrating tables
  tablet_allocator: Exclude migrating tables from load balancing
  feature_service: Add vnodes_to_tablets_migrations feature
2026-04-07 14:32:22 +03:00
Łukasz Paszkowski
6f364fd3b7 db: fix system.size_estimates to aggregate sstable estimates across all shards
The estimate() function in the size_estimates virtual reader only
considered sstables local to the shard that happened to own the
keyspace's partition key token. Since sstables are distributed across
shards, this caused partition count estimates to be approximately
1/smp_count of the actual value.

This bug has been present since the virtual reader was introduced in
225648780d.

Use db.container().map_reduce0() to aggregate sstable estimates
across all shards. Each shard contributes its local count and
estimated_histogram, which are then merged to produce the correct
total.

Also fix the `test_partitions_estimate_full_overlap` test which becomes
flaky (xpassing ~1% of runs) because autocompaction could merge the
two overlapping sstables before the size estimate was read. Wrap the
test body in nodetool.no_autocompaction_context to prevent this race.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179
Refs https://github.com/scylladb/scylladb/issues/9083

Closes scylladb/scylladb#29286
2026-04-07 14:13:26 +03:00
Piotr Smaron
7d449a307c docs: remove old audit design doc
As discussed with @ScyllaPiotr in
https://github.com/scylladb/scylladb/pull/29232, the doc about to be
removed is just:
> Looking at history, I think this audit.md is a design doc: scylladb/scylla-enterprise@87a5c19, for which the feature has been implemented differently, eventually, and was created around the time when design docs, apparently, where stored within the repository itself. So for me it's some trash (sorry for strong language) that can be safely removed.

Closes scylladb/scylladb#29316
2026-04-07 14:11:53 +03:00
Avi Kivity
8b4a91982b cmake: add missing rolling_max_tracker_test and symmetric_key_test
Added in 5b2a07b408 and c596ae6eb1 without cmake integration.

Closes scylladb/scylladb#29328
2026-04-07 14:09:00 +03:00
Avi Kivity
d01c9a425f test: test_out_of_storage_prevention: fix invalid escape in regex
Python warns that the sequence "\(" is an invalid escape and
might be rejected in the future. Protect against that by using
a raw string.

Closes scylladb/scylladb#29334
2026-04-07 14:06:32 +03:00
Pavel Emelyanov
0ae781c008 Merge 'test: auth_test: coroutinize' from Avi Kivity
Convert auth_test.cc to coroutines for improved readability. Each test is converted in its own commit. Some
are trivial.

Indentation is left broken in some commits to reduce the diff, then fixed up in the last commit.

Code cleanup, so no backport.

Closes scylladb/scylladb#29336

* github.com:scylladb/scylladb:
  auth_test: fix whitespace
  auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
  auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
  auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
  auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
  auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
  auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
  auth_test: coroutinize test_alter_with_workload_type
  auth_test: coroutinize test_alter_with_timeouts
  auth_test: coroutinize role_permissions_table_is_protected
  auth_test: coroutinize role_members_table_is_protected
  auth_test: coroutinize roles_table_is_protected
  auth_test: coroutinize test_password_authenticator_operations
  auth_test: coroutinize test_password_authenticator_attributes
  auth_test: coroutinize test_default_authenticator
2026-04-07 14:05:32 +03:00
Botond Dénes
513af59130 encryption: improve error message when KMS host is not configured
When an SSTable was encrypted with a KMS host that is not present in
scylla.yaml, the error thrown was:

  std::invalid_argument (No such host: <host-name>)

This message is very obscure in general, and especially confusing when
encountered while using the scylla-sstable tool: it gives no indication
that the SSTable is encrypted, that a KMS host lookup is involved, or
what the user needs to do to fix the problem.

Replace it with a message that names the missing host and points
directly to the relevant scylla.yaml section:

  Encryption host "<host-name>" is not defined in scylla.yaml.
  Make sure it is listed under the "kmip_hosts" section.

The wording is intentionally kept neutral (not framed as an SSTable tool
problem) because the same code path is exercised by production ScyllaDB
when a node's configuration no longer contains a host referenced by an
existing data file (e.g. after a config rollback or when restoring data
from a different cluster). The production use-case takes precedence, but
the message is equally actionable from the tool.

Closes scylladb/scylladb#29228
2026-04-07 14:00:27 +03:00
Botond Dénes
7344c05494 scylla-gdb.py: fix small_vector.__len__()
start - end will result in negative length, rejected by the python
runtime. Use the correct end - start to calculate length.

Closes scylladb/scylladb#29249
2026-04-07 13:57:21 +03:00
Botond Dénes
f71d2e78d8 tombstone_gc: don't use real-db for validation and determining default
data_dictionary::database was converted to replica::database in two
places, just to call find_keyspace(), then call
get_replication_strategy() on the returned keyspace. This is not
necessary, data_dictionary::database already has find_keyspace() and the
returned data_dictionary::keyspace also has get_replication_strategy().

This patch removes a small layering violation but more importantly, it
is necessary for the sstable tool to be able to load schemas from disk,
when said schema has tombstone_gc props.

Closes scylladb/scylladb#29279
2026-04-07 13:56:24 +03:00
Pavel Emelyanov
d6df5ef60a Merge 'compaction_test: Make compaction tests backend‑agnostic and add S3/GCS support' from Ernest Zaslavsky
This series updates the storage abstraction and extends the compaction tests to support object‑storage backends (S3 and GCS), while tightening several parts of the test environment.

The changes include:

- New exists/object_exists helpers across storage backends and clock fixes in the S3 client to make signature generation stable under test conditions.

- A new get_storage_for_tests accessor and adjustments to the test environment to avoid premature teardown of the sstable registry.

- Refactoring of compaction tests to remove direct sstable access, ensure proper schema setup, and avoid use of moved‑from objects.

- Extraction of test_env‑based logic into reusable functions and addition of S3/GCS variants of the compaction tests.

Not all tests were converted to be backend‑agnostic yet, and a few require further investigation before they can run cleanly against S3/GCS backends. These will be addressed in follow‑up work.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-704 however, followup is needed

No backport needed since this change targeting future feature

Closes scylladb/scylladb#28790

* github.com:scylladb/scylladb:
  compaction_test: fix formatting after previous patches
  compaction_test: add S3/GCS variations to tests
  compaction_test: extract test_env-based tests into functions
  compaction_test: replace file_exists with storage::exists
  compaction_test: initialize tables with schema via make_table_for_tests
  compaction_test: use sstable APIs to manipulate component files
  compaction_test: fix use-after-move issue
  sstable_utils: add `get_storage` and `open_file` helpers
  test_env: delay unplugging sstable registry
  storage: add `exists` method to storage abstraction
  s3_client: use lowres_system_clock for aws_sigv4
  s3_client: add `object_exists` helper
  gcs_client: add `object_exists` helper
2026-04-07 13:53:48 +03:00
Piotr Dulikowski
4161273b4c Merge 'view_building_worker: fix race during draining procedure' from Michał Jadwiszczak
View building worker was breaking semaphores without holding their locks.
This lead to races like SCYLLADB-844 and SCYLLADB-543,
where a new batch was started after `view_building_worker::state` was cleared in the `drain()` process.

This patch fix the race by:
- taking a lock of the mutex before breaking it
- distinguishing between `state::clear()`(can happen multiple times) and `state::drain()`(can be called only once during shutdown)
- asserting that the state is not doing any new work after it was drained

Fixes SCYLLADB-844
Fixes SCYLLADB-543

This PR should be backported to all versions containing view building coordinator (2025.4 and newer).

Closes scylladb/scylladb#29303

* github.com:scylladb/scylladb:
  view_building_worker: extract starting a new batch to state's method
  view_building_worker: distinguish between state's `clear()` and `drain()`
  view_building_worker: lock mutexes before breaking them in `drain()`
  view_building_worker: execute drain() once
2026-04-07 12:13:51 +02:00
Avi Kivity
bc10e1a171 test: fix flaky test_login by not retrying authentication failures
The fix for SCYLLADB-1373 (b4f652b7c1) changed get_session() to use
the default timeout=30 for the retry loop in patient_*_cql_connection
(previously timeout=0.1). This correctly allowed retrying transient
NoHostAvailable errors during node startup, but introduced a new
flakiness in test_login and other auth tests.

The failure chain:

1. test_login connects with bad credentials (e.g. user="doesntexist")
2. get_session() calls patient_exclusive_cql_connection(), which calls
   retry_till_success() with bypassed_exception=NoHostAvailable
3. The first attempt correctly fails: the server rejects the credentials
   with AuthenticationFailed, wrapped in NoHostAvailable
4. retry_till_success() catches NoHostAvailable indiscriminately and
   retries, not distinguishing between transient errors (node not ready)
   and permanent errors (bad credentials)
5. A subsequent retry attempt times out (connect_timeout=5), producing
   OperationTimedOut wrapped in NoHostAvailable
6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping
   OperationTimedOut instead of the original AuthenticationFailed
7. The assertion `isinstance(..., AuthenticationFailed)` fails

With the old timeout=0.1, the deadline was already exceeded after the
first attempt, so the original AuthenticationFailed propagated.

Fix: Add a `should_retry` predicate parameter to retry_till_success()
and use it in patient_cql_connection() and
patient_exclusive_cql_connection() to immediately re-raise
NoHostAvailable when it wraps AuthenticationFailed. Retrying
authentication failures is never useful since the credentials won't
change between attempts.

Fixes: SCYLLADB-1382

Closes scylladb/scylladb#29348
2026-04-07 10:17:31 +03:00
Michał Jadwiszczak
51c164c8d2 view_building_worker: extract starting a new batch to state's method
Following the previous commit, a new batch cannot be started if the
state was already drained.
This commit also adds a check that only one batch is running at a time.
2026-04-07 08:39:05 +02:00
Michał Jadwiszczak
639aa223f3 view_building_worker: distinguish between state's clear() and drain()
While both of this methods do the same (abort current batch, clear
data), we can clear the state multiple times during view_building_worker
lifetime (for instance when processing base table is changed) but
`view_building_worker::state::drain()` should be called only once and
after this no other work on the state should be done.
2026-04-07 08:39:05 +02:00
Michał Jadwiszczak
7aea524f52 view_building_worker: lock mutexes before breaking them in drain()
Not doing this may lead to races like SCYLLADB-844.
If some consumer is holding a lock of a mutex and `drain()`
is just braking the mutex without locking it beforehand,
then the consumer may process its code which should be aborted.

An example of the race is SCYLLADB-844, where `work_on_tasks()` is
holding `_state._mutex` while it is broken by `drain()`.
This causes a new batch is started after the `_state` is cleared.
2026-04-07 08:39:00 +02:00
Michał Jadwiszczak
91c7ac1fb2 view_building_worker: execute drain() once
Future changes will require that the view building worker is drained
only once per its lifetime.
2026-04-07 08:35:02 +02:00
Avi Kivity
b4f652b7c1 test: fix flaky test_create_ks_auth by removing bad retry timeout
get_session() was passing timeout=0.1 to patient_exclusive_cql_connection
and patient_cql_connection, leaving only 0.1 seconds for the retry loop
in retry_till_success(). Since each connection attempt can take up to 5
seconds (connect_timeout=5), the retry loop effectively got only one
attempt with no chance to retry on transient NoHostAvailable errors.

Use the default timeout=30 seconds, consistent with all other callers.

Fixes: SCYLLADB-1373

Closes scylladb/scylladb#29332
2026-04-05 19:13:15 +03:00
Avi Kivity
2f0d178510 auth_test: fix whitespace
Fix over-indented lines inside do_with_mc lambda bodies introduced
during coroutinization.
2026-04-05 18:28:23 +03:00
Avi Kivity
7a24da9e88 auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
e1b52cf337 auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
24d36ad459 auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
6f20129eec auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
cece181113 auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
752391f757 auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
287625b297 auth_test: coroutinize test_alter_with_workload_type
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
4eeb5ef54d auth_test: coroutinize test_alter_with_timeouts
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
170c71b25d auth_test: coroutinize role_permissions_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
13eccf519f auth_test: coroutinize role_members_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
43ff3798ad auth_test: coroutinize roles_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
c586eeb003 auth_test: coroutinize test_password_authenticator_operations
Flatten continuation chains (.then()) into linear thread-style code
with .get() calls for improved readability. Remove the now-unused
require_throws helper template.
2026-04-05 18:26:25 +03:00
Avi Kivity
fbccfe5c9d auth_test: coroutinize test_password_authenticator_attributes
Use co_await instead of return+do_with_cql_env+make_ready_future
for improved readability.
2026-04-05 17:28:09 +03:00
Avi Kivity
e3dee64003 auth_test: coroutinize test_default_authenticator
Use co_await instead of return+do_with_cql_env+make_ready_future
for improved readability.
2026-04-05 17:27:45 +03:00
Jenkins Promoter
ab4a2cdde2 Update pgo profiles - aarch64 2026-04-05 16:58:02 +03:00
Jenkins Promoter
b97cf0083c Update pgo profiles - x86_64 2026-04-05 16:00:15 +03:00
Nikos Dragazis
6d50e67bd2 scylla_swap_setup: Remove Before=swap.target dependency from swap unit
When a Scylla node starts, the scylla-image-setup.service invokes the
`scylla_swap_setup` script to provision swap. This script allocates a
swap file and creates a swap systemd unit to delegate control to
systemd. By default, systemd injects a Before=swap.target dependency
into every swap unit, allowing other services to use swap.target to wait
for swap to be enabled.

On Azure, this doesn't work so well because we store the swap file on
the ephemeral disk [1] which has network dependencies (`_netdev` mount
option, configured by cloud-init [2]). This makes the swap.target
indirectly depend on the network, leading to dependency cycles such as:

swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target
-> network.target -> systemd-resolved.service -> tmp.mount -> swap.target

This patch breaks the cycle by removing the swap unit from swap.target
using DefaultDependencies=no. The swap unit will still be activated via
WantedBy=multi-user.target, just not during early boot.

Although this problem is specific to Azure, this patch applies the fix
to all clouds to keep the code simple.

Fixes #26519.
Fixes SCYLLADB-1257

[1] https://github.com/scylladb/scylla-machine-image/pull/426
[2] https://github.com/canonical/cloud-init/pull/1213#issuecomment-1026065501

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28504
2026-04-05 15:07:50 +03:00
Tomasz Grabiec
74542be5aa test: pylib: Ignore exceptions in wait_for()
ManagerClient::get_ready_cql() calls server_sees_others(), which waits
for servers to see each other as alive in gossip. If one of the
servers is still early in boot, RESTful API call to
"gossiper/endpoint/live" may fail. It throws an exception, which
currently terminates the wait_for() and propagates up, failing the test.

Fix this by ignoring errors when polling inside wait_for. In case of
timeout, we log the last exception. This should fix the problem not
only in this case, for all uses of wait_for().

Example output:

```
pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
                    raise AssertionError(timeout_msg) from last_exception
                raise AssertionError(timeout_msg)

            try:
>               res = await pred()

test/pylib/util.py:80:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    async def _sees_min_others():
>       raise Exception("asd")
E       Exception: asd

test/pylib/manager_client.py:802: Exception

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7f022af7e7b0>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
>       cql, _ = await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:33:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:820: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:806: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
>                   raise AssertionError(timeout_msg) from last_exception
E                   AssertionError: wait_for(_sees_min_others) timed out after 45.30s (46 retries); last exception: Exception: asd

test/pylib/util.py:76: AssertionError
```

Fixes a failure observed in test_auth_after_reset:

```
manager = <test.pylib.manager_client.ManagerClient object at 0x7fb3740e1630>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
        cql, _ = await manager.get_ready_cql(servers)
        await cql.run_async("ALTER ROLE cassandra WITH PASSWORD = 'forgotten_pwd'")

        logging.info("Stopping cluster")
        await asyncio.gather(*[manager.server_stop_gracefully(server.server_id) for server in servers])

        logging.info("Deleting sstables")
        for table in ["roles", "role_members", "role_attributes", "role_permissions"]:
            await asyncio.gather(*[manager.server_wipe_sstables(server.server_id, "system", table) for server in servers])

        logging.info("Starting cluster")
        # Don't try connect to the servers yet, with deleted superuser it will be possible only after
        # quorum is reached.
        await asyncio.gather(*[manager.server_start(server.server_id, connect_driver=False) for server in servers])

        logging.info("Waiting for CQL connection")
        await repeat_until_success(lambda: manager.driver_connect(auth_provider=PlainTextAuthProvider(username="cassandra", password="cassandra")))
>       await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:819: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:805: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
test/pylib/util.py:71: in wait_for
    res = await pred()
test/pylib/manager_client.py:802: in _sees_min_others
    alive_nodes = await self.api.get_alive_endpoints(server_ip)
test/pylib/rest_client.py:243: in get_alive_endpoints
    data = await self.client.get_json(f"/gossiper/endpoint/live", host=node_ip)
test/pylib/rest_client.py:99: in get_json
    ret = await self._fetch("GET", resource_uri, response_type = "json", host = host,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <test.pylib.rest_client.TCPRESTClient object at 0x7fb2404a0650>
method = 'GET', resource = '/gossiper/endpoint/live', response_type = 'json'
host = '127.15.252.8', port = 10000, params = None, json = None, timeout = None
allow_failed = False

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None, allow_failed: bool = False) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")

        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if allow_failed:
                return await resp.json()
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 404, uri: http://127.15.252.8:10000/gossiper/endpoint/live, params: None, json: None, body:
E               {"message": "Not found", "code": 404}

test/pylib/rest_client.py:77: HTTPError
```

Fixes: SCYLLADB-1367

Closes scylladb/scylladb#29323
2026-04-05 13:52:26 +03:00
Ernest Zaslavsky
c7a74237b3 compaction_test: fix formatting after previous patches 2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
101b4ad7fa compaction_test: add S3/GCS variations to tests
Add S3 and GCS variants of the compaction tests to expand coverage for
keyspaces configured to use object_storage backends.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
03bd3010bf compaction_test: extract test_env-based tests into functions
Move all test code that relies on test_env into standalone free
functions so they can be reused by upcoming S3 and GCS test suites.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
b18528e97e compaction_test: replace file_exists with storage::exists
Replace direct filesystem checks (file_exists) with the
storage-agnostic exists() method in unsealed_sstable_compaction,
sstable_clone_leaving_unsealed_dest_sstable, and
failure_when_adding_new_sstable tests, making them compatible
with object-storage backends (S3, GCS).
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
98492e4ea8 compaction_test: initialize tables with schema via make_table_for_tests
Start using `table_for_tests::make_default_schema` so test tables are
created with a real schema. This is required for object-storage
backends, which cannot operate correctly without proper schema
initialization.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
5ba79e2ed4 compaction_test: use sstable APIs to manipulate component files
Switch tests to use sstable member functions for file manipulation
instead of opening files directly on the filesystem. This affects the
helpers that emulate sstable corruption: we now overwrite the entire
component file rather than just the first few kilobytes, which is
sufficient for producing a corrupted sstable.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
405c032f48 compaction_test: fix use-after-move issue
We were moving `compaction_type_options` inside a loop, so on the
second iteration the test received an already moved-from instance.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
437a581b04 sstable_utils: add get_storage and open_file helpers
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
2ad2dbae03 test_env: delay unplugging sstable registry
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
8f6630e9cd storage: add exists method to storage abstraction
Add an `exists` method to the storage abstraction to allow S3, GCS,
and local storage implementations to check whether an sstable
component is present.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
ba785f6cab s3_client: use lowres_system_clock for aws_sigv4
Switch aws_sigv4 to lowres_system_clock since it is not affected by
time offsets often introduced in tests, which can skew db_clock. S3
requests cannot represent time shifts greater than 15 minutes from
server time, so a stable clock is required.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
e08d779922 s3_client: add object_exists helper
Introduce `object_exists` to the S3 client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Ernest Zaslavsky
016b344a8a gcs_client: add object_exists helper
Introduce `object_exists` to the GCS client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Andrzej Jackowski
8c0920202b test: protect populate_range in row_cache_test from bad_alloc
When test_exception_safety_of_update_from_memtable was converted from
manual fail_after()/catch to with_allocation_failures() in 74db08165d,
the populate_range() call ended up inside the failure injection scope
without a scoped_critical_alloc_section guard. The other two tests
converted in the same commit (test_exception_safety_of_transitioning...
and test_exception_safety_of_partition_scan) were correctly guarded.

Without the guard, the allocation failure injector can sometimes
target an allocation point inside the cleanup path of populate_range().
In a rare corner case, this triggers a bad_alloc in a noexcept context
(reader_concurrency_semaphore::stop()), causing std::terminate.

Fixes SCYLLADB-1346

Closes scylladb/scylladb#29321
2026-04-04 21:13:26 +03:00
Andrzej Jackowski
ec274cf7b6 test: add test_upgrade_preserves_ddl_audit_for_tables
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations (SCYLLADB-1155).

Test time in dev: 4s

Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
2026-04-03 13:53:28 +02:00
Andrzej Jackowski
9c7b7ac3e3 test: audit: split validate helper so callers need not pass audit_settings
The old execute_and_validate_audit_entry required every caller to
pass audit_settings so it could decide internally whether to expect
an entry. A test added later in this series needs to simply assert
an entry was produced, without specifying audit_settings at all.

Split into two methods:
- execute_and_validate_new_audit_entry: unconditionally expects an
  audit entry.
- execute_and_validate_if_category_enabled: checks audit_settings
  to decide whether to expect an entry or assert absence.

Local wrapper functions and **kwargs forwarding are removed in favor
of explicit arguments at each call site, and expected-error cases are
handled inline with assert_invalid + assert_entries_were_added.
2026-04-03 13:52:47 +02:00
Andrzej Jackowski
189bff1d5c test: audit: declare manager attribute in AuditTester base class
AuditTester uses self.manager throughout but never declares it.
The attribute is only assigned in the CQLAuditTester subclass
__init__, so the type checker reports 'Attribute "manager" is
unknown' on every self.manager reference in the base class.

Add an __init__ to AuditTester that accepts and stores the manager
instance, and update CQLAuditTester to forward it via super().__init__
instead of assigning self.manager directly.
2026-04-03 13:52:47 +02:00
Botond Dénes
2c22d69793 Merge 'Pytest: fix variable handling in GSServer (mock) and ensure docker service logs go to test log as well' from Calle Wilund
Fixes: SCYLLADB-1106

* Small fix in scylla_cluster - remove debug print
* Fix GSServer::unpublish so it does not except if publish was not called beforehand
* Improve dockerized_server so mock server logs echo to the test log to help diagnose CI failures (because we don't collect log files from mocks etc, and in any case correlation will be much easier).

No backport needed.

Closes scylladb/scylladb#29112

* github.com:scylladb/scylladb:
  dockerized_service: Convert log reader to pipes and push to test log
  test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
  scylla_cluster: Use thread safe future signalling
  scylla_cluster: Remove left-over debug printout
2026-04-03 06:38:05 +03:00
Raphael S. Carvalho
b6ebbbf036 test/cluster/test_tablets2: Fix test_split_stopped_on_shutdown race with stale log messages
The test was failing because the call to:

    await log.wait_for('Stopping.*ongoing compactions')

was missing the 'from_mark=log_mark' argument. The log mark was updated
(line: log_mark = await log.mark()) immediately after detecting
'splitting_mutation_writer_switch_wait: waiting', and just before
launching the shutdown task. However, the wait_for call on the following
line was scanning from the beginning of the log, not from that mark.

As a result, the search immediately matched old 'Stopping N tasks for N
ongoing compactions for table system.X due to table removal' messages
emitted during initial server bootstrap (for system.large_partitions,
system.large_rows, system.large_cells), rather than waiting for the
shutdown to actually stop the user-table split compaction.

This caused the test to prematurely send the message to the
'splitting_mutation_writer_switch_wait' injection. The split compaction
was unblocked before the shutdown had aborted it, so it completed
successfully. Since the split succeeded, 'Failed to complete splitting
of table' was never logged.

Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting
for a message. With the split already done, the test was stuck waiting
for the expected failure log that would never come (600s timeout). At
the same time, after 60s the 'storage_service_drain_wait' injection
timed out internally, triggering on_internal_error() which -- with
--abort-on-internal-error=1 -- crashed the server (exit code -6).

Fix: pass from_mark=log_mark to the wait_for('Stopping.*ongoing
compactions') call so it only matches messages that appear after the
shutdown has started, ensuring the test correctly synchronizes with the
shutdown aborting the user-table split compaction before releasing the
injection.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29311
2026-04-03 06:28:51 +03:00
Andrei Chekun
6526a78334 test.py: fix nodetool mock server port collision
Replace the random port selection with an OS-assigned port. We open
a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back
the port number the OS selected, then close the socket before launching
rest_api_mock.py.
Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py
so the server itself can also reclaim a TIME_WAIT port if needed.

Fixes: SCYLLADB-1275

Closes scylladb/scylladb#29314
2026-04-02 16:24:07 +02:00
Botond Dénes
eb78498e07 test: fix flaky test_timeout_is_applied_on_lookup by using eventually_true
On slow/overloaded CI machines the lowres_clock timer may not have
fired after the fixed 2x sleep, causing the assertion on
get_abort_exception() to fail. Replace the fixed sleep with
sleep(1x) + eventually_true() which retries with exponential backoff,
matching the pattern already used in test_time_based_cache_eviction.

Fixes: SCYLLADB-1311

Closes scylladb/scylladb#29299
2026-04-01 18:20:11 +03:00
Marcin Maliszkiewicz
a74665b300 transport: add per-service-level pending response memory metric
Track the total memory consumed by responses waiting to be
written to the socket, exposed as a per-scheduling-group gauge
(cql_pending_response_memory). This complements the response
memory accounting added in the previous commits by giving
visibility into how much memory each service level is holding
in unsent response buffers.
2026-04-01 17:15:28 +02:00
Robert Bindar
e7527392c4 test: close clients if cluster teardown throws
make sure the driver is stopped even though cluster
teardown throws and avoid potential stale driver
connections entering infinite reconnect loops which
exhaust cpu resources.

Fixes: SCYLLADB-1189

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#29230
2026-04-01 17:22:19 +03:00
Tomasz Grabiec
2ec47a8a21 tests: address_map_test: Fix flakiness in debug mode due to task reordering
Debug mode shuffles task position in the queue. So the following is possible:
 1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there
 2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call
 3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor
 4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..)
 5) shard 0 inserts the completion from (4) before task from (1)
 6) the check on shard 0: m.find(id1) fails because the timer is not expired yet

To fix that, wait for timer expiration on shard 0, so that the test
doesn't depend on task execution order.

Note: I was not able to reproduce the problem locally using test.py --mode
debug --repeat 1000.

It happens in jenkins very rarely. Which is expected as the scenario which
leads to this is quite unlikely.

Fixes SCYLLADB-1265

Closes scylladb/scylladb#29290
2026-04-01 17:17:35 +03:00
Aleksandra Martyniuk
4d4ce074bb test: node_ops_tasks_tree: reconnect driver after topology changes
The test exercises all five node operations (bootstrap, replace, rebuild,
removenode, decommission) and by the end only one node out of four
remains alive. The CQL driver session, however, still holds stale
references to the dead hosts in its connection pool and load-balancing
policy state.

When the new_test_keyspace context manager exits and attempts
DROP KEYSPACE, the driver routes the query to the dead hosts first,
gets ConnectionShutdown from each, and throws NoHostAvailable before
ever trying the single live node.

Fix by calling driver_connect() after the decommission step, which
closes the old session and creates a fresh one connected only to the
servers the test manager reports as running.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1313.

Closes scylladb/scylladb#29306
2026-04-01 17:13:11 +03:00
Dario Mirovic
85127fded8 test: boost: test null data value to_parsable_string
Add tests for null value in data_type::to_parsable_string().
We now explicitly return "null".

Refs SCYLLADB-1350
2026-04-01 14:15:25 +02:00
Dario Mirovic
fc705dfb4b cql3: fix null handling in data_value formatting
data_value::to_parsable_string() crashed with a null pointer
dereference when called on a null data_value. Return "null" instead.

Fixes SCYLLADB-1350
2026-04-01 14:15:18 +02:00
Andrzej Jackowski
cccb014747 test: ldap: add regression test for double-free on unregistered message ID
Sends a search via the raw LDAP handle (bypassing _msgid_to_promise
registration), then triggers poll_results() through the public API
to exercise the unregistered-ID branch.

Refs: SCYLLADB-1344
2026-04-01 12:57:50 +02:00
Botond Dénes
0351756b15 Merge 'test: fix fuzzy_test timeout in release mode' from Piotr Smaron
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.

In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each.  With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.

Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200.  This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.

Fixes: SCYLLADB-1270

No need to backport for now, only appeared on master.

Closes scylladb/scylladb#29293

* github.com:scylladb/scylladb:
  test: clean up fuzzy_test_config and add comments
  test: fix fuzzy_test timeout in release mode
2026-04-01 11:50:15 +03:00
Andrzej Jackowski
f0028c06dc ldap: fix double-free of LDAPMessage in poll_results()
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.

Fixes: SCYLLADB-1344
2026-04-01 10:35:13 +02:00
Andrei Chekun
18f41dcd71 test.py: introduce new scheduler for choosing job count
This commit improves how test.py chohoses the default number of
parallele jobs.
This update keeps logic of selecting number of jobs from memory and cpu limits
but simplifies the heuristic so it is smoother, easier to reason about.
This avoids discontinuities such as neighboring machine sizes producing
unexpectedly different job counts, and behaves more predictably on asymmetric
machines where CPU and RAM do not scale together.

Compared to the current threshold-based version, this approach:
- avoids hard jumps around memory cutoffs
- avoids bucketed debug scaling based on CPU count
- keeps CPU and memory as separate constraints and combines them in one place
- avoids double-penalizing debug mode
- is easier to tune later by adjusting a few constants instead of rewriting branching logic

Closes scylladb/scylladb#28904
2026-04-01 11:11:15 +03:00
Avi Kivity
d438e35cdd test/cluster: fix race in test_insert_failure_standalone audit log query
get_audit_partitions_for_operation() returns None when no audit log
rows are found. In _test_insert_failure_doesnt_report_success_assign_nodes,
this None is passed to set(), causing TypeError: 'NoneType' object is
not iterable.

The audit log entry may not yet be visible immediately after executing
the INSERT, so use wait_for() from test.pylib.util with exponential
backoff to poll until the entry appears. Import it as wait_for_async
to avoid shadowing the existing wait_for from test.cluster.dtest.dtest_class,
which has a different signature (timeout vs deadline).

Fixes SCYLLADB-1330

Closes scylladb/scylladb#29289
2026-04-01 10:59:02 +03:00
Botond Dénes
2d2ff4fbda sstables: use chunked_managed_vector for promoted indexes in partition_index_page
Switch _promoted_indexes storage in partition_index_page from
managed_vector to chunked_managed_vector to avoid large contiguous
allocations.

Avoid allocation failure (or crashes with --abort-on-internal-error)
when large partitions have enough promoted index entries to trigger a
large allocation with managed_vector.

Fixes: SCYLLADB-1315

Closes scylladb/scylladb#29283
2026-03-31 18:43:57 +03:00
Piotr Smaron
2ce409dca0 test: clean up fuzzy_test_config and add comments
Remove the unused timeout field from fuzzy_test_config.  It was
declared, initialized per build mode, and logged, but never actually
enforced anywhere.

Document the intentionally small max_size (1024 bytes) passed to
read_partitions_with_paged_scan in run_fuzzy_test_scan: it forces
many pages per scan to stress the paging and result-merging logic.
2026-03-31 17:13:26 +02:00
Piotr Smaron
df2924b2a3 test: fix fuzzy_test timeout in release mode
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.

In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each.  With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.

Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200.  This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.

Fixes: SCYLLADB-1270
2026-03-31 17:13:06 +02:00
Piotr Szymaniak
6d8ec8a0c0 alternator: fix flaky test_update_condition_unused_entries_short_circuit
The test was flaky because it stopped dc2_node immediately after an
LWT write, before cross-DC replication could complete. The LWT commit
uses LOCAL_QUORUM, which only guarantees persistence in the
coordinator's DC. Replication to the remote DC is async background
work, and CAS mutations don't store hints. Stopping dc2_node could
drop in-flight RPCs, leaving DC1 without the mutation.

Fix by polling both live DC1 nodes after the write to confirm
cross-DC replication completed before stopping dc2_node. Both nodes
must have the data so that the later ConsistentRead=True
(LOCAL_QUORUM) read on restarted node1 is guaranteed to succeed.

Fixes SCYLLADB-1267

Closes scylladb/scylladb#29287
2026-03-31 16:50:51 +03:00
Dawid Mędrek
f040f1b703 Merge 'raft: remake the read barrier optimization' from Patryk Jędrzejczak
The approach taken in 1ae2ae50a6 turned
out to be incorrect. The Raft member requesting a read barrier could
incorrectly advance its commit_idx and break linearizability. We revert that
commit in this PR.

We also remake the read barrier optimization with a completely new approach.
We make the leader replicate to the non-voting requester of a read barrier if
its `commit_idx` is behind.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-998

No backport: the issue is present only in master.

Closes scylladb/scylladb#29216

* github.com:scylladb/scylladb:
  raft: speed up read barrier requested by non-voters
  Revert "raft: read_barrier: update local commit_idx to read_idx when it's safe"
2026-03-31 15:11:56 +02:00
Marcin Maliszkiewicz
a26ca0f5f7 transport: hold memory permit until response write completes
Capture the memory permit in the leave lambda's .finally()
continuation so that the semaphore units are kept alive until
write_response finishes, preventing premature release of
memory accounting.

This is especially important with slow network and big responses
when buffers can accumulate and deplete node's memory.
2026-03-31 14:05:00 +02:00
Avi Kivity
216d39883a Merge 'test: audit: fix audit test syslog race' from Dario Mirovic
Fix two independent race conditions in the syslog audit test that cause intermittent `assert 2 <= 1` failures in `assert_entries_were_added`.

**Datagram ordering race:**
`UnixSockerListener` used `ThreadingUnixDatagramServer`, where each datagram spawns a new thread. The notification barrier in `get_lines()` assumes FIFO handling, but the notification thread can win the lock before an audit entry thread, so `clear_audit_logs()` misses entries that arrive moments later. Fix: switch to sequential `UnixDatagramServer`.

**Config reload race:**
The live-update path used `wait_for_config` (REST API poll on shard 0) which can return before `broadcast_to_all_shards()` completes. Fix: wait for `"completed re-reading configuration file"` in the server log after each SIGHUP, which guarantees all shards have the new config.

Fixes SCYLLADB-1277

This is CI improvement for the latest code. No need for backport.

Closes scylladb/scylladb#29282

* github.com:scylladb/scylladb:
  test: cluster: wait for full config reload in audit live-update path
  test: cluster: fix syslog listener datagram ordering race
2026-03-31 13:53:01 +03:00
Tomasz Grabiec
b355bb70c2 dtest/alternator: stop concurrent-requests test when workers hit limit
`test_limit_concurrent_requests` could create far more tables than intended
because worker threads looped indefinitely and only the probe path terminated
the test. In practice, workers often hit `RequestLimitExceeded` first, but the
test kept running and creating tables, increasing memory pressure and causing
flakiness due to bad_alloc errors in logs.

Fix by replacing the old probe-driven termination with worker-driven
termination. Workers now run until any worker sees
`RequestLimitExceeded`.

Fixes SCYLLADB-1181

Closes scylladb/scylladb#29270
2026-03-31 13:35:50 +03:00
Patryk Jędrzejczak
b9f82f6f23 raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.

Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.

We also provide a regression test. It takes 6s in dev mode.

Fixes SCYLLADB-959

Closes scylladb/scylladb#29266
2026-03-31 12:33:56 +02:00
Marcin Maliszkiewicz
2645b95888 transport: account for response size exceeding initial memory estimate
After obtaining the CQL response, check if its actual size exceeds
the initially acquired memory permit. If so, take semaphore units
and adopt them into the permit (non blocking).

This doesn't fully prevent from allocating too much memory as
size is known when buffer is already allocated but improves
memory accounting for big responses.
2026-03-31 11:57:41 +02:00
Dario Mirovic
0cb63fb669 test: cluster: wait for full config reload in audit live-update path
_apply_config_to_running_servers used wait_for_config (REST API poll)
to confirm live config updates. The REST API reads from shard 0 only,
so it can return before broadcast_to_all_shards() completes — other
shards may still have stale audit config, generating unexpected entries.
Additionally, server_remove_config_option for absent keys sent separate
SIGHUPs before server_update_config, and the single wait_for_config at
the end could match a completion from an earlier SIGHUP.

Wait for "completed re-reading configuration file" in the server log
after each SIGHUP-producing operation. This message is logged only
after both read_config() and broadcast_to_all_shards() finish,
guaranteeing all shards have the new config. Each operation gets its
own mark+wait so no stale completion is matched.

Fixes SCYLLADB-1277
2026-03-31 02:27:11 +02:00
Dario Mirovic
1d623196eb test: cluster: fix syslog listener datagram ordering race
UnixSockerListener used ThreadingUnixDatagramServer, which spawns a
new thread per datagram. The notification barrier in get_lines() relies
on all prior datagrams being handled before the notification. With
threading, the notification handler can win the lock before an audit
entry handler, so get_lines() returns before the entry is appended.
clear_audit_logs() then clears an incomplete buffer, and the late
entry leaks into the next test's before/after diff.

Switch to sequential UnixDatagramServer. The server thread now handles
datagrams in kernel FIFO order, so the notification is always processed
after all preceding audit entries.

Refs SCYLLADB-1277
2026-03-31 02:27:11 +02:00
Karol Nowacki
493a4433e7 index: fix DESC INDEX for vector index
The `DESC INDEX` command returned incorrect results for local vector
indexes and for vector indexes that included filtering columns.

This patch corrects the implementation to ensure `DESCRIBE INDEX`
accurately reflects the index configuration.

This was a pre-existing issue, not a regression from recent
serialization schema changes for vector index target options.
2026-03-30 16:46:48 +02:00
Karol Nowacki
a32e4bb9f4 vector_search: test: refactor boilerplate setup
The test boilerplate setup for some vector store client tests
has been extracted to a common function.
2026-03-30 16:46:48 +02:00
Karol Nowacki
6bc88e817f vector_search: fix SELECT on local vector index
Queries against local vector indexes were failing with the error:
"ANN ordering by vector requires the column to be indexed using 'vector_index'"

This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.

Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have  different target format.

This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.

This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.

Fixes: SCYLLADB-895
2026-03-30 16:46:48 +02:00
Karol Nowacki
c0b78477a5 index: test: vector index target option serialization test
This test ensures that the serialization format for vector index target
options remains stable. Maintaining backward compatibility is critical
because the index is restored from this property on startup.
Any unintended changes to the serialization schema could break existing
indexes after an upgrade.

This option is also an interface for the vector-store service,
which uses it to identify the indexed column.
2026-03-30 16:46:48 +02:00
Karol Nowacki
4dc28dfa52 index: test: secondary index target option serialization test
Target option serialization must remain stable for backward compatibility.
The index is restored from this property on startup, so unintentional
changes to the serialization schema can break indexes after upgrade.
2026-03-30 16:46:47 +02:00
Patryk Jędrzejczak
ba54b2272b raft: speed up read barrier requested by non-voters
We achieve this by making the leader replicate to the non-voting requester
of a read barrier if its commit_idx is behind.

There are some corner cases where the new `replicate_to(*opt_progress, true);`
call will be a no-op, while the corresponding call in `tick_leader()` would
result in sending the AppendEntries RPC to the follower. These cases are:
- `progress.state == follower_progress::state::PROBE && progress.probe_sent`,
- `progress.state == follower_progress::state::PIPELINE
  && progress.in_flight == follower_progress::max_in_flight`.
We could try to improve the optimization by including some of the cases above,
but it would only complicate the code without noticeable benefits (at least
for group0).

Note: this is the second attempt for this optimization. The first approach
turned out to be incorrect and was reverted in the previous commit. The
performance improvement is the same as in the previous case.
2026-03-30 15:56:24 +02:00
Patryk Jędrzejczak
4913acd742 Revert "raft: read_barrier: update local commit_idx to read_idx when it's safe"
This reverts commit 1ae2ae50a6.

The reverted change turned out to be incorrect. The Raft member requesting
a read barrier could incorrectly advance its commit_idx and break
linearizability. More details in
https://scylladb.atlassian.net/browse/SCYLLADB-998?focusedCommentId=42935
2026-03-30 15:56:24 +02:00
Andrzej Jackowski
ab43420d30 test: use exclusive driver connection in test_limited_concurrency_of_writes
Use get_cql_exclusive(node1) so the driver only connects to node1 and
never attempts to contact the stopped node2. The test was flaky because
the driver received `Host has been marked down or removed` from node2.

Fixes: SCYLLADB-1227

Closes scylladb/scylladb#29268
2026-03-30 11:50:44 +02:00
Botond Dénes
068a7894aa test/cluster: fix flaky test_cleanup_stop by using asyncio.sleep
The test was using time.sleep(1) (a blocking call) to wait after
scheduling the stop_compaction task, intending to let it register on
the server before releasing the sstable_cleanup_wait injection point.

However, time.sleep() blocks the asyncio event loop entirely, so the
asyncio.create_task(stop_compaction) task never gets to run during the
sleep. After the sleep, the directly-awaited message_injection() runs
first, releasing the injection point before stop_compaction is even
sent. By the time stop_compaction reaches Scylla, the cleanup has
already completed successfully -- no exception is raised and the test
fails.

Fix by replacing time.sleep(1) with await asyncio.sleep(1), which
yields control to the event loop and allows the stop_compaction task
to actually send its HTTP request before message_injection is called.

Fixes: SCYLLADB-834

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29202
2026-03-30 11:40:47 +03:00
Nikos Dragazis
3b3b02b15a docs: Add ops guide for vnodes-to-tablets migration
The vnodes-to-tablets migration is a manual procedure, so instructions
need to be provided to the users.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-29 22:18:46 +03:00
Ernest Zaslavsky
1d779804a0 scripts: remove lua library rename workaround from comparison script
Now that cmake/FindLua.cmake uses pkg-config (matching configure.py),
both build systems resolve to the same 'lua' library name.  Remove the
lua/lua-5.4 entries from _KNOWN_LIB_ASYMMETRIES and add 'm' (math
library) as a known transitive dependency that configure.py gets via
pkg-config for lua.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
c32851b102 cmake: add custom FindLua using pkg-config to match configure.py
CMake's built-in FindLua resolves to the versioned library file
(e.g. liblua-5.4.so) instead of the unversioned symlink (liblua.so),
causing a library name mismatch between the two build systems.
Add a custom cmake/FindLua.cmake that uses pkg-config — matching
configure.py's approach — and find_library(NAMES lua) to find the
unversioned symlink.  This also mirrors the pattern used by other
Find modules in cmake/ (FindxxHash, Findlz4, etc.).
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
f3a91df0b4 test/cmake: add missing tests to boost test suite
Add symmetric_key_test (standalone, links encryption library) and
auth_cache_test to the combined_tests binary. These tests already
exist in configure.py; this aligns the CMake build.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
de606cc17a test/cmake: remove per-test LTO disable
The per-test -fno-lto link option is now redundant since -fno-lto
was added globally in mode.common.cmake. LTO-enabled targets
(the scylla binary in RelWithDebInfo) override it via enable_lto().
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
38ba58567a cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
Match configure.py's Boost handling:
- Add BOOST_ALL_DYN_LINK when using shared Boost libraries.
- Strip per-component defines (BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK,
  BOOST_REGEX_DYN_LINK, etc.) that CMake's Boost package config
  adds on imported targets. configure.py only uses the umbrella
  BOOST_ALL_DYN_LINK define.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
7e72898150 cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
Place add_compile_definitions(SEASTAR_TESTING_MAIN) after both
add_subdirectory(seastar) and add_subdirectory(abseil) are processed.
This matches configure.py's global define without leaking into
seastar's subdirectory build (which would cause a duplicate main
symbol in seastar_testing).
Remove the now-redundant per-test SEASTAR_TESTING_MAIN compile
definition from test/CMakeLists.txt.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
b0837ead3e cmake: add -fno-sanitize=vptr for abseil sanitizer flags
Match configure.py line 2192: abseil gets sanitizer flags with
-fno-sanitize=vptr to exclude vptr checks which are incompatible
with abseil's usage of type-punning patterns.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
dd829fa69c cmake: align Seastar build configuration with configure.py
- Set BUILD_SHARED_LIBS based on build type to match configure.py's
  build_seastar_shared_libs: Debug and Dev build Seastar as a shared
  library, all other modes build it static.
- Add sanitizer link options on the seastar target for Coverage
  mode. Seastar's CMake only activates sanitizer targets for
  Debug/Sanitize configs, but Coverage mode needs them too since
  configure.py's seastar_libs_coverage carries -fsanitize flags.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
52e4d44a75 cmake: align global compile defines and options with configure.py
- Disable CMake's automatic -fcolor-diagnostics injection for
  Clang+Ninja (CMake 3.24+), matching configure.py which does not
  add any color diagnostics flags.
- Add SEASTAR_NO_EXCEPTION_HACK and XXH_PRIVATE_API as global
  defines (previously SEASTAR_NO_EXCEPTION_HACK was only on the
  seastar target as PRIVATE; it needs to be project-wide).
- Add -fpch-validate-input-files-content to check precompiled
  header content when timestamps don't match.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
6f2fe3c2fc cmake: fix Coverage mode in mode.Coverage.cmake
Fix multiple deviations from configure.py's coverage mode:
- Remove -fprofile-list from CMAKE_CXX_FLAGS_COVERAGE. That flag
  belongs in COVERAGE_INST_FLAGS applied to other modes, not to
  coverage mode itself.
- Replace incorrect defines (DEBUG, SANITIZE, DEBUG_LSA_SANITIZER,
  SCYLLA_ENABLE_ERROR_INJECTION) with the correct Seastar debug
  defines (SEASTAR_DEBUG, SEASTAR_DEFAULT_ALLOCATOR, etc.) that
  configure.py's pkg-config query produces for coverage mode.
- Add sanitizer and stack-clash-protection compile flags for
  Coverage config, matching the flags that Seastar's pkg-config
  --cflags output includes for debug builds.
- Change CMAKE_STATIC_LINKER_FLAGS_COVERAGE to
  CMAKE_EXE_LINKER_FLAGS_COVERAGE. Coverage flags need to reach
  the executable linker, not the static archiver.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
7d23ba7dc8 cmake: align mode.common.cmake flags with configure.py
Add three flag-alignment changes:
- -Wno-error=stack-usage= alongside the stack-usage threshold flag,
  preventing hard errors from stack-usage warnings (matching
  configure.py behavior).
- -fno-lto global link option. configure.py adds -fno-lto to all
  binaries; LTO-enabled targets override it via enable_lto().
- Sanitizer link flags (-fsanitize=address, -fsanitize=undefined) for
  Debug/Sanitize configs, matching configure.py's cxx_ld_flags.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
38088a8a94 configure.py: add sstable_tablet_streaming to combined_tests 2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
33bca2428a docs: add compare-build-systems.md
Document the purpose, usage, and examples for
scripts/compare_build_systems.py which compares the configure.py
and CMake build systems by parsing their ninja build files.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
d3972369a0 scripts: add compare_build_systems.py to compare ninja build files
Add a script that compares configure.py and CMake build systems by
parsing their generated build.ninja files. The script checks:
  - Per-file compilation flags (defines, warnings, optimization)
  - Link target sets (detect missing/extra targets)
  - Per-target linker flags and libraries

configure.py is treated as the baseline. CMake should match it.
Both systems are always configured into a temporary directory so the
user's build tree is never touched.

Usage:
  scripts/compare_build_systems.py -m dev   # single mode
  scripts/compare_build_systems.py          # all modes
  scripts/compare_build_systems.py --ci     # CI mode (strict)
2026-03-29 16:17:44 +03:00
Nadav Har'El
d32fe72252 Merge 'alternator: check concurrency limit before memory acquisition' from Łukasz Paszkowski
Fix the ordering of the concurrency limit check in the Alternator HTTP server so it happens before memory acquisition, and reduce test pressure to avoid LSA exhaustion on the memory-constrained test node.

The patch moves the concurrency check to right after the content-length early-out, before any memory acquisition or I/O. The check was originally placed before memory acquisition but was inadvertently moved after it during a refactoring. This allowed unlimited requests to pile up consuming memory, reading bodies, verifying signatures, and decompressing — all before being rejected. Restores the original ordering and mirrors the CQL transport (`transport/server.cc`).

Lowers `concurrent_requests_limit` from 5 to 3 and the thread multiplier from 5 to 2 (6 threads instead of 25). This is still sufficient to reliably trigger RequestLimitExceeded, while keeping flush pressure within what 512MB per shard can sustain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1248
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1181

The test started to fail quite recently. It affects master only. No backport is needed. We might want to consider backporting a commit moving the concurrency check earlier.

Closes scylladb/scylladb#29272

* github.com:scylladb/scylladb:
  test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion
  alternator: check concurrency limit before memory acquisition
2026-03-29 11:08:28 +03:00
Łukasz Paszkowski
b8e3ef0c64 test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion
The test_limit_concurrent_requests dtest uses concurrent CreateTable
requests to verify Alternator's concurrency limiting.  Each admitted
CreateTable triggers Raft consensus, schema mutations, and memtable
flushes—all of which consume LSA memory.  On the 1 GB test node
(2 SMP × 512 MB), the original settings (limit=5, 25 threads) created
enough flush pressure to exhaust the LSA emergency reserve, producing
logalloc::bad_alloc errors in the node log.  The test was always
marginal under these settings and became flaky as new system tables
increased baseline LSA usage over time.

Lower concurrent_requests_limit from 5 to 3 and the thread multiplier
from 5 to 2 (6 threads total).  This is still well above the limit and
sufficient to reliably trigger RequestLimitExceeded, while keeping flush
pressure within what 512 MB per shard can sustain.
2026-03-28 20:40:33 +01:00
Łukasz Paszkowski
a86928caa1 alternator: check concurrency limit before memory acquisition
The concurrency limit check in the Alternator server was positioned after
memory acquisition (get_units), request body reading (read_entire_stream),
signature verification, and decompression. This allowed unlimited requests
to pile up consuming memory before being rejected, exhausting LSA memory
and causing logalloc::bad_alloc errors that cascade into Raft applier
and topology coordinator failures, breaking subsequent operations.

Without this fix, test_limit_concurrent_requests on a 1GB node produces
50 logalloc::bad_alloc errors and cascading failures: reads from
system.scylla_local fail, the Raft applier fiber stops, the topology
coordinator stops, and all subsequent CreateTable operations fail with
InternalServerError (500). With this fix, the cascade is eliminated --
admitted requests may still cause LSA pressure on a memory-constrained
node, but the server remains functional.

Move the concurrency check to right after the content-length early-out,
before any memory acquisition or I/O. This mirrors the CQL transport
which correctly checks concurrency before memory acquisition
(transport/server.cc).

The concurrency check was originally added in 1b8c946ad7 (Sep 2020)
*before* memory acquisition, which at the time lived inside with_gate
(after the concurrency gate). The ordering was inverted by f41dac2a3a
(Mar 2021, "avoid large contiguous allocation for request body"), which
moved get_units() earlier in the function to reserve memory before
reading the newly-introduced content stream -- but inadvertently also
moved it before the concurrency check. c3593462a4 (Mar 2025) further
worsened the situation by adding a 16MB fallback reservation for
requests without Content-Length and ungzip/deflate decompression steps
-- all before the concurrency check -- greatly increasing the memory
consumed by requests that would ultimately be rejected.
2026-03-28 20:40:33 +01:00
Emil Maskovsky
9dad68e58d raft: abort stale snapshot transfers when term changes
**The Bug**

Assertion failure: `SCYLLA_ASSERT(res.second)` in `raft/server.cc`
when creating a snapshot transfer for a destination that already had a
stale in-flight transfer.

**Root Cause**

If a node loses leadership and later becomes leader again before the next
`io_fiber` iteration, the old transfer from the previous term can remain
in `_snapshot_transfers` while `become_leader()` resets progress state.
When the new term emits `install_snapshot(dst)`, `send_snapshot(dst)`
tries to create a new entry for the same destination and can hit the
assertion.

**The Fix**

Abort all in-flight snapshot transfers in `process_fsm_output()` when
`term_and_vote` is persisted. A term/vote change marks existing transfers
as stale, so we clean them up before dispatching messages from that batch
and before any new snapshot transfer is started.

With cross-term cleanup moved to the term-change path, `send_snapshot()`
now asserts the within-term invariant that there is at most one in-flight
transfer per destination.

Fixes: SCYLLADB-862

Backport: The issue is reproducible in master, but is present in all
active branches.

Closes scylladb/scylladb#29092
2026-03-27 10:00:15 +01:00
Andrzej Jackowski
181ad9f476 Revert "audit: disable DDL by default"
This reverts commit c30607d80b.

With the default configuration, enabling DDL has no effect because
no `audit_keyspaces` or `audit_tables` are specified. Including DDL
in the default categories can be misleading for some customers, and
ideally we would like to avoid it.

However, DDL has been one of the default audit categories for years,
and removing it risks silently breaking existing deployments that
depend on it. Therefore, the recent change to disable DDL by default
is reverted.

Fixes: SCYLLADB-1155

Closes scylladb/scylladb#29169
2026-03-27 09:55:11 +01:00
Botond Dénes
854c374ebf test/encryption: wait for topology convergence after abrupt restart
test_reboot uses a custom restart function that SIGKILLs and restarts
nodes sequentially. After all nodes are back up, the test proceeded
directly to reads after wait_for_cql_and_get_hosts(), which only
confirms CQL reachability.

While a node is restarted, other nodes might execute global token
metadata barriers, which advance the topology fence version. The
restarted node has to learn about the new version before it can send
reads/writes to the other nodes. The test issues reads as soon as the
CQL port is opened, which might happen before the last restarted node
learns of the latest topology version. If this node acts as a
coordinator for reads/write before this happens, these will fail as the
other nodes will reject the ops with the outdated topology fence
version.

Fix this by replacing wait_for_cql_and_get_hosts() on the abrupt-restart
path with the more robus get_ready_cql(), which makes sure servers see
each other before refreshing the cql connection. This should ensure that
nodes have exchanged gossip and converged on topology state before any
reads are executed. The rolling_restart() path is unaffected as it
handles this internally.

Fixes: SCYLLADB-557

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29211
2026-03-27 09:52:27 +01:00
Avi Kivity
b708e5d7c9 Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin
`test_crashed_node_substitution` intermittently failed:
```python
   assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node.

This change: Wait until the gossiper state is visible on peers before continuing the test and asserting.

Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256).

backport: this issue may affect CI for all branches, so should be backported to all versions.

[SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29254

* github.com:scylladb/scylladb:
  test: test_crashed_node_substitution: add docstring and fix whitespace
  test: fix race condition in test_crashed_node_substitution
2026-03-26 21:40:33 +02:00
Petr Gusev
c38e312321 test_lwt_fencing_upgrade: fix quorum failure due to gossip lag
If lwt_workload() sends an update immediately after a
rolling restart, the coordinator might still see a replica as
down due to gossip lagging behind. Concurrently restarting another
node leaves only one available replica, failing the
LOCAL_QUORUM requirement for learn or eventually consistent
sp::query() in sp::cas() and resulting in
a mutation_write_failure_exception.

We fix this problem by waiting for the restarted server
to see 2 other peers. The server_change_version
doesn't do that by default -- it passes
wait_others=0 to server_start().

Fixes SCYLLADB-1136

Closes scylladb/scylladb#29234
2026-03-26 21:25:53 +02:00
bitpathfinder
627a8294ed test: test_crashed_node_substitution: add docstring and fix whitespace
Add a description of the test's intent and scenario; remove extra blanks.
2026-03-26 18:40:17 +01:00
bitpathfinder
5a086ae9b7 test: fix race condition in test_crashed_node_substitution
`test_crashed_node_substitution` intermittently failed:
```
    assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake
("finished do_send_ack2_msg"), assuming the node state was
visible to all peers. However, since gossip is eventually
consistent, the update may not have propagated yet, so some
nodes did not see the failed node.

This change: Wait until the gossiper state is visible on
peers before continuing the test and asserting.

Fixes: SCYLLADB-1256.
2026-03-26 18:25:05 +01:00
Robert Bindar
c575bbf1e8 test_refresh_deletes_uploaded_sstables should wait for sstables to get deleted
SSTable unlinking is async, so in some cases it may happen that
the upload dir is not empty immediately after refresh is done.
This patch adjusts test_refresh_deletes_uploaded_sstables so
it waits with a timeout till the upload dir becomes empty
instead of just assuming the API will sync on sstables being
gone.

Fixes SCYLLADB-1190

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#29215
2026-03-26 08:43:14 +03:00
Nikos Dragazis
8789c95a85 test: cluster: Add test for migration of multiple keyspaces
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
25af8bdc24 test: cluster: Add test for error conditions
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
01a51817c4 test: cluster: Add vnodes->tablets migration test (rollback)
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
56ec33d3e0 test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
58e930c490 test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
This test runs the vnodes-to-tablets migration for a single table on a
single-node cluster. The node has multiple shards and multiple
power-of-two aligned vnodes, so resharding is triggered.

More details in the docstring.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
8837dac2f9 scylla-nodetool: Add migrate-to-tablets subcommand
The vnodes-to-tablets migration is a manual procedure, so orchestration
must be done via nodetool.

This patch adds the following new commands:

* nodetool migrate-to-tablets start {ks}
* nodetool migrate-to-tablets upgrade
* nodetool migrate-to-tablets downgrade
* nodetool migrate-to-tablets status {ks}
* nodetool migrate-to-tablets finalize {ks}

The commands are just wrappers over the REST API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
2a5e6b832a api: Add REST endpoint for vnode-to-tablet migration status
If the keyspace is migrating, it reports the intended and actual storage
mode for each node.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:24 +02:00
Marcin Maliszkiewicz
7fdd650009 Merge 'test: audit: clean up test helper class naming' from Dario Mirovic
Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`.

Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`.

Logs before the fixes:
```
test/cluster/test_audit.py:514: 14 warnings
  /home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py)
    @pytest.mark.single_node
```

Fixes SCYLLADB-1237

This is an addition to the latest master code. No backport needed.

Closes scylladb/scylladb#29237

* github.com:scylladb/scylladb:
  test: audit: rename TestCQLAudit to CQLAuditTester
  test: audit: remove unused pytest.mark.single_node
2026-03-25 15:30:16 +01:00
Radosław Cybulski
1dc20cc8f9 alternator/test: explain why 'always' write isolation mode is used in tests
Improve test comments for test_streams_batchwrite_into_the_same_partition_deletes_existing_items
and test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data to explain why
'always' write isolation mode is required: in always_use_lwt mode all items in a batch get the
same CDC timestamp, which triggers the squashing bug. In other modes each item gets a separate
timestamp so the bug doesn't manifest.

Also fix the example in the second test comment to use cleaner key values and correct event type
(INSERT, not MODIFY, since items are inserted into an empty table), and fix the issue reference
from #28452 (the PR) to #28439 (the issue).
2026-03-25 15:15:20 +01:00
Dario Mirovic
552a2d0995 test: audit: rename TestCQLAudit to CQLAuditTester
pytest tries to collect tests for execution in several ways.
One is to pick all classes that start with 'Test'. Those classes
must not have custom '__init__' constructor. TestCQLAudit does.

TestCQLAudit after migration from test/cluster/dtest is not a test
class anymore, but rather a helper class. There are two ways to fix
this:
1. Add __init__ = False to the TestCQLAudit class
2. Rename it to not start with 'Test'

Option 2 feels better because the new name itself does not convey
the wrong message about its role.

Fixes SCYLLADB-1237
2026-03-25 13:21:08 +01:00
Dario Mirovic
73de865ca3 test: audit: remove unused pytest.mark.single_node
Remove unused pytest.mark.single_node in TestCQLAudit class.
This is a leftover from audit tests migration from
test/cluster/dtest to test/cluster.

Refs SCYLLADB-1237
2026-03-25 13:18:37 +01:00
Radosław Cybulski
ded62b2c5e alternator/test: add scylla_only to always write isolation fixture
Add scylla_only fixture dependency to the
test_table_ss_new_and_old_images_write_isolation_always fixture.
This ensures all tests using the 'always' write isolation mode
are skipped when running against DynamoDB (--aws), since the
system:write_isolation tag is a Scylla-only feature.
2026-03-25 12:38:09 +01:00
Radosław Cybulski
7d404cdd51 alternator: fix BatchWriteItem squashed Streams entries
BatchWriteItem with items for the same partition (and write isolation
set to always) will trigger LWT and run different cdc code path, which
will result in wrong Streams data being returned to the user -
changes will be randomly squashed together.
For example batch write:

  batch.put_item(Item={'p': 'p', 'c': 'c0'})
  batch.put_item(Item={'p': 'p', 'c': 'c1'})
  batch.put_item(Item={'p': 'p', 'c': 'c2'})

instead of producing 3 modify / insert events will produce one:

  type=INSERT, key={'c': {'S': 'c0'}, 'p': {'S': 'p'}},
      old_image=None, new_image={'c': {'S': 'c2'}, 'p': {'S': 'p'}}

with `new_image` having different `c` key from `key` field.

This happens because BatchWriteItem (when using LWT) emits it's changes
to cdc under the same timestamp. This results in in all log entries
being put in single cdc "bucket" (under the same cdc$timestamp key).
Previous parsing algorithm would interpret those changes as a change
to a single item and squash them together.

The patch rewrites algorithm to use `std::unordered_map` for records
based on value of clustering key, that is added to every cdc log entry.
This allows rebuilding all item modifications.

Fixes #28439
Fixes: SCYLLADB-540
2026-03-25 11:40:53 +01:00
Radosław Cybulski
85da03c88d alternator: add BatchWriteItem test (failing)
Add additional BatchWriteItem tests (some failing):
- `test_streams_batchwrite_no_clustering_deletes_non_existing_items`
  `test_streams_batchwrite_no_clustering_deletes_existing_items` -
  those tests pass, we add it here for completness, as non clustering
  tables trigger different paths.
- `test_streams_batchwrite_into_the_same_partition_deletes_existing_items` -
  failing test, that checks combinations of puts and deletes in a single
  batch write (so for example 3 items, 2 puts and 1 delete).
- `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data` -
  failing simple test.

Tests fail, because current implementation, when writing cdc log
entries will squash all changes done to the same partition together.
The data is still there, but when GetRecords is called and we parse
cdc log entries, we don't correctly recover it (see issue #28439 for
more details).
2026-03-25 11:40:53 +01:00
Marcin Maliszkiewicz
f988ec18cb test/lib: fix port in-use detection in start_docker_service
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216

Closes scylladb/scylladb#29219
2026-03-25 11:45:53 +02:00
Artsiom Mishuta
cd1679934c test/pylib: use exponential backoff in wait_for()
Change wait_for() defaults from period=1s/no backoff to period=0.1s
with 1.5x backoff capped at 1.0s. This catches fast conditions in
100ms instead of 1000ms, benefiting ~100 call sites automatically.

Add completion logging with elapsed time and iteration count.

Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode),
log output:

  wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations)
  wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations)

Fixes SCYLLADB-738

Closes scylladb/scylladb#29173
2026-03-24 23:49:49 +02:00
Botond Dénes
d52fbf7ada Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137

Backport: The test is present on all supported branches, and so we
          should backport these changes to them.

Closes scylladb/scylladb#29218

* github.com:scylladb/scylladb:
  test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
  test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
2026-03-24 21:09:19 +02:00
Patryk Jędrzejczak
141aa2d696 Merge 'test/cluster/test_incremental_repair.py: fix typo + enable compaction DEBUG logs' from Botond Dénes
This PR contains two small improvements to `test_incremental_repair.py`
motivated by the sporadic failure of
`test_tablet_incremental_repair_and_scrubsstables_abort`.

The test fails with `assert 3 == 2` on `len(sst_add)` in the second
repair round. The extra SSTable has `repaired_at=0`, meaning scrub
unexpectedly produced more unrepaired SSTables than anticipated. Since
scrub (and compaction in general) logs at DEBUG level and the test did
not enable debug logging, the existing logs do not contain enough
information to determine the root cause.

**Commit 1** fixes a long-standing typo in the helper function name
(`preapre` -> `prepare`).

**Commit 2** enables `compaction=debug` for the Scylla nodes started by
`do_tablet_incremental_repair_and_ops`, which covers all
`test_tablet_incremental_repair_and_*` variants. This will capture full
compaction/scrub activity on the next reproduction, making the failure
diagnosable.

Refs: SCYLLADB-1086

Backport: test improvement, no backport

Closes scylladb/scylladb#29175

* https://github.com/scylladb/scylladb:
  test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
  test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
2026-03-24 16:27:01 +01:00
Pavel Emelyanov
2d8540f1ee transport: fix process_startup cert-auth path missing connection-ready setup
When authenticate() returns a user directly (certificate-based auth,
introduced in 20e9619bb1), process_startup was missing the same
post-authentication bookkeeping that the no-auth and SASL paths perform:

  - update_scheduling_group(): without it, the connection runs under the
    default scheduling group instead of the one mapped to the user's
    service level.

  - _authenticating = false / _ready = true: without them,
    system.clients reports connection_stage = AUTHENTICATING forever
    instead of READY.

  - on_connection_ready(): without it, the connection never releases its
    slot in the uninitialized-connections concurrency semaphore (acquired
    at connection creation), leaking one unit per cert-authenticated
    connection for the lifetime of the connection.

The omission was introduced when on_connection_ready() was added to the
else and SASL branches in 474e84199c but the cert-auth branch was missed.

Fixes: 20e9619bb1 ("auth: support certificate-based authentication")

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 18:02:46 +03:00
Pavel Emelyanov
da6fe14035 transport: test that connection_stage is READY after auth via all process_startup paths
The cert-auth path in process_startup (introduced in 20e9619bb1) was
missing _ready = true, _authenticating = false, update_scheduling_group()
and on_connection_ready(). The result is that connections authenticated
via certificate show connection_stage = AUTHENTICATING in system.clients
forever, run under the wrong service-level scheduling group, and hold
the uninitialized-connections semaphore slot for the lifetime of the
connection.

Add a parametrized cluster test that verifies all three process_startup
branches result in connection_stage = READY:
  - allow_all: AllowAllAuthenticator (no-auth path)
  - password:  PasswordAuthenticator (SASL/process_auth_response path)
  - cert_bypass: CertificateAuthenticator with transport_early_auth_bypass
                 error injection (cert-auth path -- the buggy one)

The injection is added to certificate_authenticator::authenticate() so
tests can bypass actual TLS certificate parsing while still exercising
the cert-auth code path in process_startup.

The cert_bypass case is marked xfail until the bug is fixed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 18:01:28 +03:00
Benny Halevy
1a7b013377 test: add test_sstable_clone_preserves_staging_state 2026-03-24 16:48:01 +02:00
Benny Halevy
22f2010477 test: derive sstable state from directory in test_env::make_sstable
Instead of always passing sstable_state::normal, infer the state from
the last component of the directory path by comparing against the known
state subdirectory constants (staging_dir, upload_dir, quarantine_dir).
Any unrecognized path component (the common case for normal-state
sstables) maps to sstable_state::normal.

When a non-normal state is detected, strip the state subdirectory from
dir so that the base table directory is passed to storage.
2026-03-24 16:48:01 +02:00
Ernest Zaslavsky
c670183be8 cmake: fix precompiled header (PCH) creation
Two issues prevented the precompiled header from compiling
successfully when using CMake directly (rather than the
configure.py + ninja build system):

a) Propagate build flags to Rust binding targets reusing the
   PCH. The wasmtime_bindings and inc targets reuse the PCH
   from scylla-precompiled-header, which is compiled with
   Seastar's flags (including sanitizer flags in
   Debug/Sanitize modes). Without matching compile options,
   the compiler rejects the PCH due to flag mismatch (e.g.,
   -fsanitize=address). Link these targets against
   Seastar::seastar to inherit the required compile options.

Closes scylladb/scylladb#28941
2026-03-24 15:53:40 +02:00
Dawid Mędrek
e639dcda0b test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137
2026-03-24 14:27:36 +01:00
Patryk Jędrzejczak
503a6e2d7e locator: everywhere_replication_strategy: fix sanity_check_read_replicas when read_new is true
ERMs created in `calculate_vnode_effective_replication_map` have RF computed based
on the old token metadata during a topology change. The reading replicas, however,
are computed based on the new token metadata (`target_token_metadata`) when
`read_new` is true. That can create a mismatch for EverywhereStrategy during some
topology changes - RF can be equal to the number of reading replicas +-1. During
bootstrap, this can cause the
`everywhere_replication_strategy::sanity_check_read_replicas` check to fail in
debug mode.

We fix the check in this commit by allowing one more reading replica when
`read_new` is true.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147

Closes scylladb/scylladb#29150
2026-03-24 13:43:39 +01:00
Jenkins Promoter
0f02c0d6fa Update pgo profiles - x86_64 2026-03-24 14:11:38 +02:00
Dawid Mędrek
4fead4baae test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
One of the tests,
test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces,
didn't have the marker. Let's add it now.
2026-03-24 12:52:00 +01:00
Botond Dénes
ffd58ca1f0 Merge 'test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints' from Dawid Mędrek
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.

Fixes SCYLLADB-1133

Backport: The test is present on all supported branches. To avoid
          running into more failures, we should backport these changes
          to them.

Closes scylladb/scylladb#29191

* github.com:scylladb/scylladb:
  test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Introduce auxiliary function keyspace_has_tablets
  test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
2026-03-24 13:39:56 +02:00
Calle Wilund
f1b3bff4a5 dockerized_service: Convert log reader to pipes and push to test log
Refs: SCYLLADB-1106

Ensures any stderr logs from mock services will echo to the test log
regardless of the log file we write. To help debug failed CI.
2026-03-24 12:35:42 +01:00
Calle Wilund
38aaed1ed4 test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
Use checked dict access to check the set vars.

Fixes: SCYLLADB-1106
2026-03-24 12:33:56 +01:00
Calle Wilund
b382f3593c scylla_cluster: Use thread safe future signalling 2026-03-24 12:33:56 +01:00
Nikos Dragazis
d09196068c api: Add REST endpoint for migration finalization
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization

When called, it issues a `finalize_migration` topology request and waits
for its completion.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:21:12 +02:00
Nikos Dragazis
c88ddecfca topology_coordinator: Add finalize_migration request
Vnodes-to-tablets migration needs a finalization step to finish or
rollback the migration. Finishing the migration involves switching the
keyspace schema to tablets and clearing the `intended_storage_mode` from
system.topology. Rolling back the migration involves deleting the tablet
maps and clearing the `intended_storage_mode`.

The finalization needs to be done as a topology request to exclude with
other operations such as repair and TRUNCATE.

This patch introduces the `finalize_migration` global topology request
for this purpose. The request takes a keyspace name as an argument.
The direction of the finalization (i.e., forward path vs rollback) is
inferred from the `intended_storage_mode` of all nodes (not ideal,
should be made explicit).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:39 +02:00
Nikos Dragazis
0e1e6ebdc5 database: Construct migrating tables with tablet ERMs
Extend `database::add_column_family()` with a `storage_mode` argument.
If the table is under vnodes-to-tablets migration and the storage mode
is "tablets", create a tablet ERM.

Make the distributed loader determine the storage mode from topology
(`intended_storage_mode` column in system.topology).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:39 +02:00
Nikos Dragazis
2f93ab281b api: Add REST endpoint for upgrading nodes to tablets
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}

This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).

Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:35 +02:00
Nikos Dragazis
c4c3a95863 api: Add REST endpoint for starting vnodes-to-tablets migration
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}

Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.

When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.

The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:19:47 +02:00
Andrei Chekun
f6fd3bbea0 test.py: reduce timeout for one test
Reduce the timeout for one test to 60 minutes. The longest test we had
so far was ~10-15 minutes. So reducing this timeout is pretty safe and
should help with hanging tests.

Closes scylladb/scylladb#29212
2026-03-24 12:50:10 +02:00
Benny Halevy
ca9ff134b8 sstables: log debug message in filesystem_storage::clone 2026-03-24 12:26:03 +02:00
Nikos Dragazis
b7f4ae8218 topology_state_machine: Add intended_storage_mode to system.topology
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.

We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.

To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
bc8109f1a4 distributed_loader: Wire vnode-based resharding into table populator
Make the table populator migration-aware. If a table is migrating to
tablets, switch from normal resharding to vnode-based resharding.

Vnode-based resharding requires passing a vector of "owned ranges" upon
which resharding will segregate the SSTables. Compute it from the tablet
map. We could also compute them from the vnodes, since tablets are
identical to vnodes during the migration, but in the future we may
switch to a different model (multiple tablets per vnode).

Let the distributed loader decide if a table is migrating or not and
communicate that to the table populator. A table is migrating if the
keyspace replication strategy uses vnodes but the table replication
strategy uses tablets.

Currently, tables cannot enter this "migrating" state; support for this
will be introduced in the next patches.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
63399951df replica: Pick any compaction group for resharding
In the previous patch, reshard compaction was extended with a special
operation mode where SSTables from vnode-based tables are segregated on
vnode boundaries and not with the static sharder. This will later be
wired into vnodes-to-tablets migration.

The problem is that resharding requires a compaction group. With a
vnode-based table, there is only one compaction group per shard, and
this is what the current code utilizes
(`try_get_compaction_group_view_with_static_sharding()`). But the new
operation mode will apply to migrating tables, which use a
`tablet_storage_group_manager`, which creates one compaction group for
each tablet. Some compaction group needs to be selected.

Pick any compaction group that is available on the current shard.
Reshard compaction is an operation that happens early in the startup
process; compaction groups do not own any SSTables yet, so all
compaction groups are equivalent.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Benny Halevy
d1c6141407 compaction: resharding_compaction: add vnodes_resharding option
In this mode, the output sstables generated by resharding
compaction are segregated by token range, based on the keyspace
vnode-based owned token ranges vector.

A basic unit test was also added to sstable_directory_test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
d153a95943 storage_service: Preserve ERM flavor of migrating tables
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.

Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
4a3e26d5e3 tablet_allocator: Exclude migrating tables from load balancing
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.

With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
3e2dc078c9 feature_service: Add vnodes_to_tablets_migrations feature
Vnodes-to-tablets migrations require cluster-level support: the REST API
and the group0 state need to be supported by all nodes.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Marcin Maliszkiewicz
66be0f4577 Merge 'test: cluster: audit test suite optimization' from Dario Mirovic
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.

The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.

This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
    - Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest

Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters

Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)

This PR is an improvement and does not require a backport.

[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28650

* github.com:scylladb/scylladb:
  test: cluster: fix log clear race condition in test_audit.py
  test: pylib: shut down exclusive cql connections in ManagerClient
  test: cluster: fix multinode audit entry comparison in test_audit.py
  test: cluster: dtest: remove old audit tests
  test: cluster: group migrated audit tests for cluster reuse
  test: cluster: enable migrated audit tests and make them work
  test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
  test: pylib: scylla cluster after_test log fix
  test: audit: copy audit test from dtest
2026-03-24 09:29:52 +01:00
Dario Mirovic
120f381a9d pgo: fix maintenance socket path too long
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.

To prevent this:
- pass an explicit --maintenance-socket override
  pointing to a short determinitic path in /tmp derived from the MD5
  hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
  so that exec_cql.py connects to the right socket

The short path socket files are cleaned up after the cluster stops.

The path is using MD5 hash of the workdir path, so it is deterministic.

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29149
2026-03-24 09:17:10 +01:00
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Botond Dénes
56c375b1f3 Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.

The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.

The problem doesn't show itself for real, not worth backporting

Closes scylladb/scylladb#29142

* github.com:scylladb/scylladb:
  table: merge adjacent querier_opt checks in query()
  table: don't close a disengaged querier in query()
2026-03-24 08:47:35 +02:00
Yaniv Kaul
e59a21752d .github/workflows/trigger_jenkins.yaml: add workflow permissions
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.

To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.

The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.

Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:

```yaml
permissions:
  contents: read
```

between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27815
2026-03-24 08:40:30 +02:00
Yaniv Kaul
85a531819b .github/workflows/trigger-scylla-ci.yaml: add permissions to workflow
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.

In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.

The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.

Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:

```yaml
permissions:
  contents: read
```

after line 1. No other lines in the file need to change.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27812
2026-03-24 08:37:49 +02:00
Dawid Mędrek
148217bed6 test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
We increase the log level of `hints_manager` to TRACE in the test.
If it fails, it may be incredibly difficult to debug it without any
additional information.
2026-03-23 19:19:17 +01:00
Dawid Mędrek
2b472fe7fd test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints 2026-03-23 19:19:17 +01:00
Dawid Mędrek
ae12c712ce test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
The test relies on the assumption that mutations will be distributed
more or less uniformly over the nodes. Although in practice this should
not be possible, theoretically it's possible that there's only one
tablet allocated for the table.

To clearly indicate this precondition, we explicitly set the property
`min_tablet_count` when creating the table. This way, we have a gurantee
that the table has multiple tablets. The load balancer should now take
care of distributing them over the nodes equally. Thanks to that,
`servers[1]` will have some tablets, and so it'll be the target for some
of the mutations we perform.
2026-03-23 19:19:14 +01:00
Dawid Mędrek
dd446aa442 test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
The context manager is the de-facto standard in the test suite. It will
also allow us for a prettier way to conditionally enable per-table
tablet options in the following commit.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
dea79b09a9 test: cluster: Introduce auxiliary function keyspace_has_tablets
The function is adapted from its counterpart in the cqlpy test suite:
cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this
series to conditionally set tablet properties when creating a table.
It might also be useful in general.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
3d04fd1d13 test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

Fixes SCYLLADB-1133
2026-03-23 19:06:57 +01:00
Piotr Dulikowski
63067f594d strong_consistency: fake taking and dropping snapshots
Snapshots are not implemented yet for strong consistency - attempting to
take, transfer or drop a snapshot results in an exception. However, the
logic of our state machine forces snapshot transfer even if there are no
lagging replicas - every raft::server::configuration::snapshot_threshold
log entries. We have actually encountered an issue in our benchmarks
where snapshots were being taken even though the cluster was not under
any disruption, and this is one of the possible causes.

It turns out that we can safely allow for taking snapshots right now -
we can just implement it as a no-op and return a random UUID.
Conversely, dropping a snapshot can also be a no-op. This is safe
because snapshot transfer still throws an exception - as long as the
taken/recovered snapshots are never attempted to be transferred.
2026-03-23 17:03:36 +01:00
Piotr Dulikowski
dd1d3dd1ee strong_consistency: adjust limits for snapshots
Raft snapshots are not implemented yet for strong consistency. Adjust
the current raft group config to make them much less likely to occur:

- snapshot_threshold config option decides how many log entries need to
  be applied after the last snapshot. Set it to the maximum value for
  size_t in order to effectively disable it.
- snapshot_threshold_log_size defines a threshold for the log memory
  usage over which a snapshot is created. Increase it from the default
  2MB to 10MB.
- max_log_size defines the threshold for the log memory usage over which
  requests are stopped to be admitted until the log is shrunk back by a
  snapshot. Set it to 20MB, as this option is recommended to be at least
  twice as much as snapshot_threshold_log_size.

Refs: SCYLLADB-1115
2026-03-23 17:03:36 +01:00
Botond Dénes
772b32d9f7 test/scylla_gdb: fix flakiness by preparing objects at test time
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).

Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.

Fixes: SCYLLADB-1020

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28999
2026-03-23 16:54:03 +02:00
Piotr Dulikowski
60fb5270a9 logstor: fix fmt::format use with std::filesystem::path
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.

Closes scylladb/scylladb#29148
2026-03-23 15:15:52 +01:00
Pavel Emelyanov
3b9398dfc8 Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128

Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side

Closes scylladb/scylladb#29110

* github.com:scylladb/scylladb:
  encryption: fix deadlock in encrypted_data_source::get()
  test_lib: mark `limiting_data_source_impl` as not `final`
  Fix formatting after previous patch
  Fix indentation after previous patch
  test_lib: make limiting_data_source_impl available to tests
2026-03-23 17:12:44 +03:00
Pavel Emelyanov
57ef712243 test/backup: drop create_dataset helper
It has no more callers after the previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 17:01:20 +03:00
Pavel Emelyanov
2353091cbd test/backup: use new_test_keyspace in test_restore_primary_replica
Replace create_dataset + manual DROP/CREATE KEYSPACE with two sequential
new_test_keyspace context manager blocks, matching the pattern used by
do_test_streaming_scopes. The first block covers backup, the second
covers restore. Keyspace lifecycle is now automatic.

The streaming directions validation loop is moved outside of the second
context block, since it only parses logs and has no dependency on the
keyspace being alive.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 16:59:47 +03:00
Botond Dénes
f5438e0587 test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
The test sporadically fails because scrub produces an unexpected number
of SSTables. Compaction logs are needed to diagnose why, but were not
captured since scrub runs at DEBUG level. Enable compaction=debug for
the servers started by do_tablet_incremental_repair_and_ops so the next
reproduction provides enough information to root-cause the issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:26 +02:00
Botond Dénes
f6ab576ed9 test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:12 +02:00
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul
051107f5bc scylla-gdb: fix sstable-summary crash on ms-format sstables
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.

Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
  to return early when the data length is zero, consistent with the
  existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
  scylla_sstable_summary.invoke() and printing an informative message
  instead of attempting to read the unpopulated summary.

Fixes: SCYLLADB-1180

Closes scylladb/scylladb#29162
2026-03-23 12:44:47 +02:00
Calle Wilund
b36dc80835 scylla_cluster: Remove left-over debug printout 2026-03-23 11:07:59 +01:00
Piotr Szymaniak
c8e7e20c5c test/cluster: retry create_table on transient schema agreement timeout
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.

Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.

Fixes: SCYLLADB-1135

Closes scylladb/scylladb#29132
2026-03-23 10:45:30 +02:00
Yaniv Kaul
fb1f995d6b .github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139)
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,

To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.

The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:

```yaml
    permissions:
      contents: read
      issues: write
```

This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27810
2026-03-23 10:30:01 +02:00
Piotr Smaron
32225797cd dtest: fix flaky test_writes_schema_recreated_while_node_down
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.

Fixes: SCYLLADB-1139

No need to backport yet, appeared only on master.

Closes scylladb/scylladb#29151
2026-03-23 10:25:54 +02:00
Michał Chojnowski
f29525f3a6 test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152
2026-03-23 09:57:11 +02:00
Raphael S. Carvalho
05b11a3b82 sstables_loader: use new sstable add path
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.

This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.

Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.

This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.

Fixes  https://scylladb.atlassian.net/browse/SCYLLADB-1085.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29072
2026-03-23 10:33:04 +03:00
Piotr Szymaniak
f511264831 alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.

Fixes SCYLLADB-1167

Closes scylladb/scylladb#29156
2026-03-22 11:01:45 +02:00
Pavel Emelyanov
c114d1b82c api: Inline describe_ring JSON handling
There are two helpers for describe_ring endpoint. Both can be squashed
together for code brevity.

Also, while at it, the "keyspace" parameter is not properly validated by
the endpoint.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:51:32 +03:00
Pavel Emelyanov
9a2e583f29 storage_service: Make describe_ring_for_table() take table_id
All callers already have it. It makes no difference for the method
itself with which table identifier to work, but will help to simplify
the flow in API handler (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:49:24 +03:00
Pavel Emelyanov
4bc8ec174c repair: Remove db/config.hh from repair/*.cc files
Now all the code uses repair_service::config and no longer needs global
config description.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-20 19:36:50 +03:00
Pavel Emelyanov
35f625e5c7 repair: Move repair_multishard_reader options onto repair_service::config
This actually uses two interconnected options:
repair_multishard_reader_buffer_hint_size and
repair_multishard_reader_enable_read_ahead.

Both are propagated through repair_service::config and pass their
values to repair_reader/make_reader at construction time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:36:50 +03:00
Pavel Emelyanov
9bc0d27aae repair: Move critical_disk_utilization_level onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
80aa0fcdc2 repair: Move repair_partition_count_estimation_ratio onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
585cb0c718 repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
d8f7f86e10 repair: Move enable_small_table_optimization_for_rbno onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
38a23ff927 repair: Introduce repair_service::config
Most other services have their configs, rpair still uses global
db::config.

Add an empty config struct to repair_service to carry db::config options
the repair service needs.

Subsequent patches will populate the struct with options.

The config is created in main.cc as sharded_parameter because all future
options are live-updateable and should capture theirs source from
db::config on correct shard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
7dce43363e table: merge adjacent querier_opt checks in query()
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.

No functional change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 14:48:08 +03:00
Piotr Dulikowski
cc695bc3f7 Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832

Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.

Closes scylladb/scylladb#29031

* github.com:scylladb/scylladb:
  vector_search: test: fix flaky test
  vector_search: fix race condition on connection timeout
2026-03-20 11:12:04 +01:00
Petr Gusev
4bfcd035ae test_fencing: add missing await-s
Fixes SCYLLADB-1099

Closes scylladb/scylladb#29133
2026-03-20 10:55:35 +01:00
Pavel Emelyanov
9c1c41df03 table: don't close a disengaged querier in query()
The condition guarding querier_opt->close() was:

When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged.  If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created.  Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.

Fix by checking querier_opt first:

This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.

Why this doesn't surface today: the sole production call site, database::query(),
in practice.  The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 12:25:13 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Botond Dénes
bb5c328a16 Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.

This is the continuation of #28569

Tests improvement, not backporting

Closes scylladb/scylladb#28994

* github.com:scylladb/scylladb:
  test: Replace a bunch of ternary operators with an if-else block
  test: Squash test_restore_primary_replica_same|different_domain tests
  test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
2026-03-20 10:05:16 +02:00
Pavel Emelyanov
ea2a214959 test/backup: Use unique_name() for backup prefix instead of cf_dir
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:

    prefix = f'{cf_dir}/backup'

The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29030
2026-03-20 10:04:22 +02:00
Pavel Emelyanov
65032877d4 api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.

This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28996
2026-03-20 09:52:33 +02:00
Botond Dénes
de0bdf1a65 Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)

Enhancing tests, not backporting

Closes scylladb/scylladb#29130

* github.com:scylladb/scylladb:
  test/refresh: Simplify refresh invocation
  test/refresh: Remove r_servers alias for servers
  test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
  test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
  test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
  test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
  test/refresh: Remove unused wait_for_cql_and_get_hosts import
2026-03-20 09:29:15 +02:00
Botond Dénes
97430e2df5 Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access

Fixes #29058

The code appeared in 2026.1, worth fixing it there as well

Closes scylladb/scylladb#29059

* github.com:scylladb/scylladb:
  sstables: Fix object storage lister not resetting position in batch vector
  sstables: Fix object storage lister skipping entries when filter is active
2026-03-20 09:12:42 +02:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Botond Dénes
34473302b0 Merge 'docs: document existing guardrails' from Andrzej Jackowski
This patch series introduces a new documentation for exiting guardrails.

Moreover:
 - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
 - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.

Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)

No backport, just new docs and tests.

[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29011

* github.com:scylladb/scylladb:
  test: add new guardrail tests matching documentation scenarios
  test: add metric assertions to guardrail replication strategy tests
  test: use regex matching in guardrail replication strategy tests
  test: extract ks_opts helper in test_guardrail_replication_strategy
  docs: document CQL guardrails
  cql: improve write consistency level guardrail messages
2026-03-20 08:56:00 +02:00
artem.penner
9898e5700b scylla-node-exporter: Add systemd collector to node exporter
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.

**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.

This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.

example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`

Closes #28402
2026-03-20 08:39:56 +02:00
Andrzej Jackowski
10c4b9b5b0 test: verify signal() detects resource negative leak in rcs
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.

Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.

Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031

Closes scylladb/scylladb#28993
2026-03-20 09:21:20 +03:00
Botond Dénes
f9adbc7548 test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.

Fixes: SCYLLADB-1062

Closes scylladb/scylladb#29077
2026-03-20 09:14:29 +03:00
Michał Chojnowski
6b18d95dec test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py
Need to work around https://github.com/scylladb/python-driver/issues/295,
lest a CQL query fail spuriously after the cluster restart.

Fixes: SCYLLADB-1114

Closes scylladb/scylladb#29118
2026-03-20 09:05:14 +03:00
Botond Dénes
89388510a0 test/cluster/test_data_resurrection_in_memtable.py: use explicit CL
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node

However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.

Use explicit CL for the ingestion to prevent this.

The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.

Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102

Closes scylladb/scylladb#29128
2026-03-20 09:02:57 +03:00
Avi Kivity
6b259babeb Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.

Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.

Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.

to use, add to config:
```
enable_logstor: true

experimental_features:
  - logstor
```

create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```

INSERT, SELECT, DELETE work as expected
UPDATE not supported yet

no backport - new feature

Closes scylladb/scylladb#28706

* github.com:scylladb/scylladb:
  logstor: trigger separator flush for buffers that hold old segments
  docs/dev: add logstor documentation
  logstor: recover segments into compaction groups
  logstor: range read
  logstor: change index to btree by token per table
  logstor: move segments to replica::compaction_group
  db: update dirty mem limits dynamically
  logstor: track memory usage
  logstor: logstor stats api
  logstor: compaction buffer pool
  logstor: separator: flush buffer when full
  logstor: hold segment until index updates
  logstor: truncate table
  logstor: enable/disable compaction per table
  logstor: separator buffer pool
  test: logstor: add separator and compaction tests
  logstor: segment and separator barrier
  logstor: separator debt controller
  logstor: compaction controller
  logstor: recovery: recover mixed segments using separator
  logstor: wait for pending reads in compaction
  logstor: separator
  logstor: compaction groups
  logstor: cache files for read
  logstor: recovery: initial
  logstor: add segment generation
  logstor: reserve segments for compaction
  logstor: index: buckets
  logstor: add buffer header
  logstor: add group_id
  logstor: record generation
  logstor: generation utility
  logstor: use RIPEMD-160 for index key
  test: add test_logstor.py
  api: add logstor compaction trigger endpoint
  replica: add logstor to db
  schema: add logstor cf property
  logstor: initial commit
  db: disable tablet balancing with logstor
  db: add logstor experimental feature flag
2026-03-20 00:18:09 +02:00
Avi Kivity
062751fcec Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.

New functionality. No backport needed.

This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's  https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.

Closes scylladb/scylladb#28960

* github.com:scylladb/scylladb:
  db/config: announce ms format as highest supported
  db/config: enable `ms` sstable format by default
  cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
  api/system: add /system/chosen_sstable_version
  test/cluster/dtest: reduce num_tokens to 16
2026-03-19 18:19:01 +02:00
Pavel Emelyanov
969dddb630 test/refresh: Simplify refresh invocation
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:57 +03:00
Pavel Emelyanov
de21572b31 test/refresh: Remove r_servers alias for servers
r_servers = servers was a no-op assignment; use servers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:52 +03:00
Pavel Emelyanov
20b1531e6d test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-19 18:42:48 +03:00
Pavel Emelyanov
c591b9ebe2 test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:44 +03:00
Pavel Emelyanov
06006a6328 test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:40 +03:00
Pavel Emelyanov
67d8cde42d test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:36 +03:00
Pavel Emelyanov
04f046d2d8 test/refresh: Remove unused wait_for_cql_and_get_hosts import
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:32 +03:00
Botond Dénes
e8b37d1a89 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions
2026-03-19 17:13:53 +02:00
Dario Mirovic
d2c44722e1 test: cluster: fix log clear race condition in test_audit.py
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"

If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.

Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
  after server_update_config
- Draining pending syslog lines before clearing the buffer

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
d94999f87b test: cluster: fix multinode audit entry comparison in test_audit.py
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.

Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
249a6cec1b test: cluster: dtest: remove old audit tests
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
adc790a8bf test: cluster: group migrated audit tests for cluster reuse
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.

Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s

Fixes SCYLLADB-573
2026-03-19 16:11:47 +01:00
Dario Mirovic
967b7ff6bf test: cluster: enable migrated audit tests and make them work
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.

This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.

All the tests are confirmed to be running after the change.
No dead code present.

Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.

Refs SCYLLADB-573
2026-03-19 16:07:28 +01:00
Dario Mirovic
5d51501a0b pgo: use maintenance socket for CQL setup in PGO training
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.

Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.

Changes:
- exec_cql.py: replace host/port/username/password arguments with a
  single --socket argument; add connect_maintenance_socket() with
  wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
  populate_auth_conns() and populate_counters() to pass the socket
  path to exec_cql.py

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29081
2026-03-19 16:52:36 +02:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
0a7a69345c test: pylib: scylla cluster after_test log fix
Before any test, a pool of ScyllaCluster objects is created.

At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.

Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
  of that test suite
- A ScyllaCluster object is fetched from the pool
  - If the pool is empty, a new ScyllaCluster object is created
  - If the pool is not empty, a cached ScyllaCluster object is returned

After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
  - If the cluster is dirty, the pool destroys it
  - If the cluster is clean, the pool caches it
- ManagerClient is destroyed

Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.

The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.

The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.

Another approach would be to reopen the log file if closed, but this
approach seems more clean.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
899ae71349 test: audit: copy audit test from dtest
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Andrzej Jackowski
4deeb7ebfc test: add new guardrail tests matching documentation scenarios
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.

Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
2a03c634c0 test: add metric assertions to guardrail replication strategy tests
Verify that guardrail violations increment the corresponding metrics.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
81c4e717e2 test: use regex matching in guardrail replication strategy tests
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Anna Stuchlik
6b1df5202c doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103
2026-03-19 15:47:00 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Botond Dénes
4981e72607 Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.

The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.

This series fixes that by:

1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.

Improvements. No backport is required.

Closes scylladb/scylladb#28963

* github.com:scylladb/scylladb:
  replica/table: avoid computing token range side in storage_group_of() on hot path
  replica/compaction_group: add lazy select_compaction_group() overloads
  locator/tablets: add tablet_map::get_tablet_range_side()
2026-03-19 14:27:12 +02:00
Ernest Zaslavsky
aa9da87e97 encryption: fix deadlock in encrypted_data_source::get()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
f74a54f005 test_lib: mark limiting_data_source_impl as not final 2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
151e945d9f Fix formatting after previous patch 2026-03-19 13:54:44 +02:00
Andrzej Jackowski
517bb8655d test: extract ks_opts helper in test_guardrail_replication_strategy
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.

This facilitates further modification of the tests later in this
patch series.

Refs: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Andrzej Jackowski
9b24d9ee7d docs: document CQL guardrails
Add docs/cql/guardrails.rst covering replication factor, replication
strategy, write consistency level, and compact storage guardrails.

Fixes: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Ernest Zaslavsky
537747cf5d Fix indentation after previous patch 2026-03-19 13:48:53 +02:00
Ernest Zaslavsky
2535164542 test_lib: make limiting_data_source_impl available to tests
Relocate the `limiting_data_source_impl` declaration to the header file
so that test code can access it directly.
2026-03-19 13:48:53 +02:00
Botond Dénes
86d7c82993 test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.

Fixes: SCYLLADB-940

Closes scylladb/scylladb#29071
2026-03-19 12:42:18 +03:00
Michael Litvak
399260a6c0 test: mv: fix flaky wait for commitlog sync
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.

It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.

To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.

Fixes SCYLLADB-1005

Closes scylladb/scylladb#29008
2026-03-19 10:41:21 +01:00
Pavel Emelyanov
f27dc12b7c Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close
2026-03-19 12:40:23 +03:00
Raphael S. Carvalho
3143134968 test: avoid split/major compaction deadlock in tablet split test
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.

The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.

At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.

Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29066
2026-03-19 11:12:21 +02:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Piotr Smaron
a2ad57062f docs/cql: clarify WHERE clause boolean limitations
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.

Fixes: SCYLLADB-1116

Closes scylladb/scylladb#29100
2026-03-19 09:47:22 +02:00
Michael Litvak
31d339e54a logstor: trigger separator flush for buffers that hold old segments
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.

Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.

To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
2026-03-18 19:24:28 +01:00
Michael Litvak
ad87eda835 docs/dev: add logstor documentation 2026-03-18 19:24:28 +01:00
Michael Litvak
a0da07e5b7 logstor: recover segments into compaction groups
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
2026-03-18 19:24:28 +01:00
Michael Litvak
24379acc76 logstor: range read
extend the logstor mutation reader to support range read
2026-03-18 19:24:28 +01:00
Michael Litvak
a9d0211a64 logstor: change index to btree by token per table
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
2026-03-18 19:24:28 +01:00
Michael Litvak
e7c3942d43 logstor: move segments to replica::compaction_group
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.

When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
2026-03-18 19:24:28 +01:00
Michael Litvak
d69f7eb0ee db: update dirty mem limits dynamically
when logstor is enabled, update the db dirty memory limits dynamically.

previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).

when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
2026-03-18 19:24:27 +01:00
Michael Litvak
65cd0b5639 logstor: track memory usage
add logstor::get_memory_usage() that returns an estimate of the memory
usage by logstor. add tracking to how many unique keys are held in the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
b7bdb1010a logstor: logstor stats api
add api to get logstor statistics about segments for a table
2026-03-18 19:24:27 +01:00
Michael Litvak
8bd3bd7e2a logstor: compaction buffer pool
pre-allocate write buffers for compaction
2026-03-18 19:24:27 +01:00
Michael Litvak
caf5aa47c2 logstor: separator: flush buffer when full
flush separator buffers when they become full and switched instead of
aggregating all the buffers and flushing them when the separator is
switched.
2026-03-18 19:24:27 +01:00
Michael Litvak
6ddb7a4d13 logstor: hold segment until index updates
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.

in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.

when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.

this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
2026-03-18 19:24:27 +01:00
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
489efca47c logstor: enable/disable compaction per table
add functions to enable or disable compaction for a specific compaction
group or for all compaction groups of a table.
2026-03-18 19:24:27 +01:00
Michael Litvak
21db4f3ed8 logstor: separator buffer pool
pre-allocate write buffers for the separator
2026-03-18 19:24:27 +01:00
Michael Litvak
37c485e3d1 test: logstor: add separator and compaction tests 2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
1231fafb46 logstor: separator debt controller
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
17cb173e18 logstor: compaction controller
adjust compaction shares by the compaction overhead: how many segments
compaction writes to generate a single free segment for new writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
1da1bb9d99 logstor: recovery: recover mixed segments using separator
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
2026-03-18 19:24:27 +01:00
Michael Litvak
b78cc787a6 logstor: wait for pending reads in compaction
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
009fc3757a logstor: compaction groups
divide the segments in the compaction manager to compaction group.
compaction will compact only segments from a single compaction group at
a time.
2026-03-18 19:24:27 +01:00
Michael Litvak
b3293f8579 logstor: cache files for read
keep all files for all segments open for read to improve reads.
2026-03-18 19:24:26 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
bc9fc96579 logstor: add segment generation
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
2026-03-18 19:24:26 +01:00
Michael Litvak
719f7cca57 logstor: reserve segments for compaction
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.

do the compaction writes into full new segments instead of the active
segment.
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
99c3b1998a logstor: add buffer header
add a buffer header in each write buffer we write that contains some
information that can be useful for recovery and reading.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
08bea860ef logstor: record generation
add a record generation number for each record so we can compare
records and find which one is newer.
2026-03-18 19:24:26 +01:00
Michael Litvak
28f820eb1c logstor: generation utility
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
a521bcbcee test: add test_logstor.py
add basic tests for key-value tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
1ae1f37ec1 api: add logstor compaction trigger endpoint
add a new api endpoint that triggers logstor compaction.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
9172cc172e schema: add logstor cf property
add a schema property for tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00
Michael Litvak
27fd0c119f db: disable tablet balancing with logstor
initially logstor tables will not support tablet migrations, so
disable tablet balancing if the experimental feature flag is set.
2026-03-18 19:24:26 +01:00
Michael Litvak
ed852a2af2 db: add logstor experimental feature flag
add a new experimental feature flag for key-value tables with the new
logstor storage engine.
2026-03-18 19:24:26 +01:00
Anna Stuchlik
88b98fac3a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111
2026-03-18 19:35:18 +02:00
Avi Kivity
46a6f8e1d3 Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.

This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.

Refs SCYLLADB-1070

This is an improvement, no need for backport.

Closes scylladb/scylladb#29080

* github.com:scylladb/scylladb:
  test: use NetworkTopologyStrategy in maintenance socket tests
  test: use cleanup fixture in maintenance socket auth tests
  auth: add maintenance_socket_authorizer
2026-03-18 19:29:57 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Tomasz Grabiec
4410e9c61a sstables: mx: index_reader: Optimize parsing for no promoted index case
It's a common case with small partition workloads.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
32f8609b89 vint: Use std::countl_zero()
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.

Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.

Move to header so that it's inlined in parsers.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
6017688445 test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement 2026-03-18 16:25:21 +01:00
Tomasz Grabiec
f55bb154ec sstables: mx: index_reader: Amoritze partition key storage
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:

 - index entries don't store keys as separate small objects (managed_bytes)
   They are written into one managed_bytes fragmented storage, entries
   hold offset into it.

   Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
   the storage (1 byte) plus back-reference in the storage (8 bytes),
   so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
   bytes, that's a reduction from 31 bytes to 20 bytes per key.

 - index entries and key storage are now trivially moveable, so LSA
   migration can use memcpy() which amortizes the cost per key.
   memcpy().

   LSA eviction is now trivial and constant time for the whole page
   regardless of the number of entries. Page eviction dropped from
   14 us to 1 us.

This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

    15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
    15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
    15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
    15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
    15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
    15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
    15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)

After:

    21620.18 tps (148.4 allocs/op,  13.4 logallocs/op,  43.7 tasks/op,  176817 insns/op,  153183 cycles/op,        0 errors)
    20644.03 tps (149.8 allocs/op,  13.5 logallocs/op,  44.3 tasks/op,  187941 insns/op,  160409 cycles/op,        0 errors)
    20588.06 tps (150.1 allocs/op,  13.5 logallocs/op,  44.5 tasks/op,  188090 insns/op,  160818 cycles/op,        0 errors)
    20789.29 tps (149.5 allocs/op,  13.5 logallocs/op,  44.2 tasks/op,  186495 insns/op,  159382 cycles/op,        0 errors)
    20977.89 tps (149.5 allocs/op,  13.4 logallocs/op,  44.2 tasks/op,  183969 insns/op,  158140 cycles/op,        0 errors)
    21125.34 tps (149.1 allocs/op,  13.4 logallocs/op,  44.1 tasks/op,  183204 insns/op,  156925 cycles/op,        0 errors)
    21244.42 tps (148.6 allocs/op,  13.4 logallocs/op,  43.8 tasks/op,  181276 insns/op,  155973 cycles/op,        0 errors)

Mostly because the index now fits in memory.

When it doesn't, the benefits are still visible due to lower LSA overhead.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
50dc7c6dd8 sstables: mx: index_reader: Keep promoted_index info next to index_entry
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.

For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.

promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.

Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.

This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

9714.78 tps (170.9 allocs/op,  16.9 logallocs/op,  55.3 tasks/op,  494788 insns/op,  343920 cycles/op,        0 errors)
9603.13 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  502358 insns/op,  348344 cycles/op,        0 errors)
9621.43 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  500612 insns/op,  347508 cycles/op,        0 errors)
9597.75 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  501428 insns/op,  348604 cycles/op,        0 errors)
9615.54 tps (171.6 allocs/op,  16.9 logallocs/op,  55.6 tasks/op,  501313 insns/op,  347935 cycles/op,        0 errors)
9577.03 tps (171.8 allocs/op,  17.0 logallocs/op,  55.7 tasks/op,  503283 insns/op,  349251 cycles/op,        0 errors)

After:

15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5e228a8387 sstables: mx: index_reader: Extract partition_index_page::clear_gently()
There will be more elements to clear. And partition_index_page should
know how to clear itself.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
2d77e4fc28 sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
The std::optional<> adds 8 bytes.

And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.

The size changd from 56 to 48 bytes.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
e9c98274b5 sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
If the page has many entries, we continuously enter and leave the
allocating section for every key. This can be avoided by batching LSA
operations for the whole page, after collecting all the entries.

Later optimizations will also build on this, where we will allocate
fragmented storage for keys in LSA using a single managed_bytes
constructor.

This alone brings only a minor improvement, but it does reduce LSA
allocations, probably due to less frequent memory reclamation:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

After:

  9562.21 tps (172.0 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  499225 insns/op,  347832 cycles/op,        0 errors)
  9480.20 tps (172.3 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  507271 insns/op,  350640 cycles/op,        0 errors)
  9512.42 tps (172.1 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  504247 insns/op,  350392 cycles/op,        0 errors)
  9498.45 tps (172.4 allocs/op,  17.1 logallocs/op,  55.9 tasks/op,  505765 insns/op,  350320 cycles/op,        0 errors)
  9076.30 tps (173.5 allocs/op,  17.1 logallocs/op,  56.5 tasks/op,  512791 insns/op,  354792 cycles/op,        0 errors)
  9542.62 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  502532 insns/op,  348922 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
0e0f9f41b3 sstables: mx: index_reader: Keep index_entry directly in the vector
Partition index entries are relatively small, and if the workload has
small partitions, index pages have a lot of elements. Currently, index
entries are indirected via managed_ref, which causes increased cost of
LSA eviction and compaction. This patch amortizes this cost by storing
them dierctly in the managed_chunked_vector.

This gives about 23% improvement in throughput in perf-simple-query
for a workload where the index doesn't fit in memory:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
  7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
  7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
  7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
  7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)

After:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

Disabling the partition index doesn't improve the throuhgput beyond
that.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
b6bfdeb111 dht: Introduce raw_token
Most tokens stored in data structures are for key-scoped tokens, and
we don't need to pay for token::kind storage.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
3775593e53 test: perf_simple_query: Add 'sstable-format' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
6ee9bc63eb test: perf_simple_query: Add 'sstable-summary-ratio' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
38d130d9d0 test: perf-simple-query: Add option to disable index cache 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5ee61f067d test: cql_test_env: Respect enable-index-cache config
Mirrors the code in main.cc
2026-03-18 16:25:20 +01:00
Aleksandra Martyniuk
2d16083ba6 tasks: fix indentation 2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
1fbf3a4ba1 tasks: do not fail the wait request if rpc fails
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.
2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
d4fdeb4839 tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).

Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.

Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
2026-03-18 15:37:24 +01:00
Calle Wilund
0013f22374 memtable_test::memtable_flush_period: Change sleep to use injection signal instead
Fixes: SCYLLADB-942

Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.

Closes scylladb/scylladb#29070
2026-03-18 16:23:13 +02:00
Botond Dénes
ae17596c2a Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.

Only 2026.1 is affected.

Closes scylladb/scylladb#29032

* github.com:scylladb/scylladb:
  replica: Demote log level on split failure during shutdown
  service: Demote log level on split failure during shutdown
2026-03-18 16:21:05 +02:00
Pavel Emelyanov
8b1ca6dcd6 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050
2026-03-18 13:50:48 +01:00
Pavel Emelyanov
d68c92ec04 test: Replace a bunch of ternary operators with an if-else block
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
b1d4fc5e6e test: Squash test_restore_primary_replica_same|different_domain tests
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.

The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
21c603a79e test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:07:09 +03:00
Emil Maskovsky
34f3916e7d .github: update test instructions for unified pytest runner
Update test running instructions to reflect unified pytest-based runner.

The test.py now requires full test paths with file extensions for both
C++ and Python tests.

No backport: The change is only relevant for recent test.py changes in
master.

Closes scylladb/scylladb#29062
2026-03-18 09:28:28 +01:00
Marcin Maliszkiewicz
04bf631d7f auth: switch query_attribute_for_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
cf578fd81a auth: switch get_attribute to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
06d16b6ea2 auth: cache: add heterogeneous map lookups
Some callers have only string_view role name,
they shouldn't need to allocate sstring to do the
lookup.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
7fdb1118f5 auth: switch query_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
fca11c5a21 auth: switch query_all_directly_granted to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
6f682f7eb1 auth: cache: add ability to go over all roles
This is needed to implement auth service api where
we list all roles.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
61952cd985 raft: service: reload auth cache before service levels
Since service levels depend on auth data, and not other
way around, we need to ensure a proper loading order.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
c4cfb278bc service: raft: move update_service_levels_effective_cache check
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.

While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
2026-03-18 09:06:20 +01:00
Benny Halevy
c2a6d1e930 test/boost/database_test: add test_snapshot_ctl_details_exception_handling
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:37:44 +02:00
Benny Halevy
6dc4ea766b table: get_snapshot_details: fix indentation inside try block
Whitespace-only change: indent the loop body one level inside the
try block added in the previous commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:28:50 +02:00
Benny Halevy
b09d45b89a table: per-snapshot get_snapshot_details: fix typo in comment
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.

Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:40 +02:00
Benny Halevy
580cc309d2 table: per-snapshot get_snapshot_details: always close lister using try/catch
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.

Fixes: SCYLLADB-1013

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:23 +02:00
Benny Halevy
78c817f71e table: get_snapshot_details: always close lister using deferred_close
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:26:26 +02:00
Dario Mirovic
71e6918f28 test: use NetworkTopologyStrategy in maintenance socket tests
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.

Refs SCYLLADB-1070
2026-03-17 20:20:47 +01:00
Dario Mirovic
278535e4e3 test: use cleanup fixture in maintenance socket auth tests
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.

Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.

Refs SCYLLADB-1070
2026-03-17 20:15:30 +01:00
Dario Mirovic
2e4b72c6b9 auth: add maintenance_socket_authorizer
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.

This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.

Refs SCYLLADB-1070
2026-03-17 19:19:41 +01:00
Botond Dénes
172c786079 Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload
2026-03-17 18:38:11 +02:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Anna Stuchlik
f4a6bb1885 doc: remove the Open Source Example from Installation
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.

Fixes https://github.com/scylladb/scylladb/issues/29087
2026-03-17 14:54:32 +01:00
Anna Stuchlik
95bc8911dd doc: replace http with https in the installation instructions
Fixes https://github.com/scylladb/scylladb/issues/17227
2026-03-17 14:46:16 +01:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
318aa07158 Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.

 This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before  the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.

The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".

After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.

Closes scylladb/scylladb#28972

* github.com:scylladb/scylladb:
  test/alternator: speed up test_streams.py by using module-scope fixtures
  test/alternator: test_streams.py don't use fixtures in 4 tests
  test/alternator: fix do_test() in test_streams.py
2026-03-17 13:56:16 +02:00
Ernest Zaslavsky
7f597aca67 cmake: fix broken build
Add raft_util.idl.hh to cmake to generate the code properly

Closes scylladb/scylladb#29055
2026-03-17 10:35:34 +01:00
Botond Dénes
dbe70cddca test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.

Fixes: SCYLLADB-811

Closes scylladb/scylladb#28958
2026-03-17 10:32:23 +01:00
Botond Dénes
0fd51c4adb test/nodetool: rest_api_mock_server: add retry for status code 404
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.

Fixes: SCYLLADB-966

Closes scylladb/scylladb#29003
2026-03-17 10:30:23 +01:00
Pavel Emelyanov
9fe19ec9d9 sstables: Fix object storage lister not resetting position in batch
vector

The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:42 +03:00
Pavel Emelyanov
1a6a7647c6 sstables: Fix object storage lister skipping entries when filter is active
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:40 +03:00
Marcin Maliszkiewicz
9318c80203 perf: add abort_source support to wait-for-port loops
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.

Didn't use sleep_abortable as the sleep is very short
anyway.
2026-03-16 16:14:10 +01:00
Marcin Maliszkiewicz
edf0148bee perf-alternator: wait for alternator port before running workload
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
2026-03-16 16:07:52 +01:00
Raphael S. Carvalho
ee87b66033 replica: Demote log level on split failure during shutdown
Dtest failed with:

table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...

The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.

Fixes https://github.com/scylladb/scylladb/issues/24850.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 12:03:17 -03:00
Raphael S. Carvalho
b508f3dd38 service: Demote log level on split failure during shutdown
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 11:52:00 -03:00
Karol Nowacki
7659a5b878 vector_search: test: fix flaky test
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).

This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
2026-03-13 16:28:22 +01:00
Karol Nowacki
5474cc6cc2 vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832
2026-03-13 16:28:22 +01:00
Andrzej Jackowski
60aaea8547 cql: improve write consistency level guardrail messages
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.

Refs: SCYLLADB-257
2026-03-13 14:40:45 +01:00
Tomasz Grabiec
1256a9faa7 tablets: Fix deadlock in background storage group merge fiber
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.

Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.

The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.

Fixes SCYLLADB-928
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
7706c9e8c4 replica: table: Propagate old erm to storage group merge 2026-03-12 22:45:01 +01:00
Tomasz Grabiec
582a4abeb6 test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:

  Found 0 split ready compaction groups, but expected 2 instead.

Exposed by increased split activity in tests.
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
279fcdd5ff storage_service: Extract local_topology_barrier()
Will be called in tests. It does the local part of the global topology
barrier.

The comment:

        // We capture the topology version right after the checks
        // above, before any yields. This is crucial since _topology_state_machine._topology
        // might be altered concurrently while this method is running,
        // which can cause the fence command to apply an invalid fence version.

was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
2026-03-12 22:44:56 +01:00
Nadav Har'El
92ee959e9b test/alternator: speed up test_streams.py by using module-scope fixtures
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.

This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.

We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything.  With this guarantee in place,
all stream-table fixtures can safely use scope="module".

After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:14:04 +02:00
Nadav Har'El
6ac1f1333f test/alternator: test_streams.py don't use fixtures in 4 tests
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.

So this patch we modify these four tests to create their own new table
instead of using a fixture.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:12:33 +02:00
Nadav Har'El
16e7a88a02 test/alternator: fix do_test() in test_streams.py
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.

Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.

We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,

test/alternator/run --count 2 --aws \
    test_streams.py::test_streams_putitem_keys_only

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-09 19:21:53 +02:00
Łukasz Paszkowski
147b355326 replica/table: avoid computing token range side in storage_group_of() on hot path
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.

Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
419e9aa323 replica/compaction_group: add lazy select_compaction_group() overloads
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.

Add an overload for selecting the compaction group for an sstable
spanning a token range.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
3f70611504 locator/tablets: add tablet_map::get_tablet_range_side()
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.

Pure addition, no behavior change.
2026-03-09 17:59:36 +01:00
Jakub Smolar
7cdd979158 db/config: announce ms format as highest supported
Uncomment the feature flag check in get_highest_supported_format()
to return MS format when supported, otherwise fall back to ME.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
949fc85217 db/config: enable ms sstable format by default
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
6b413e3959 cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.

Use chosen_sstable_format instead.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
b89840c4b9 api/system: add /system/chosen_sstable_version
Returns the sstable version currently chosen for use in for new sstables.

We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).

(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
2026-03-09 17:12:09 +01:00
Michał Chojnowski
9280a039ee test/cluster/dtest: reduce num_tokens to 16
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.

It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.

To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).

This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
2026-03-09 17:12:09 +01:00
Botond Dénes
cd13a911cc test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
So that if the check of expected rows fail, we have a dump to look at
and see what is different.
2026-03-05 11:44:02 +02:00
Botond Dénes
f375aae257 replica/database: consolidate the two database_apply error injections
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait

This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
2026-03-05 11:44:02 +02:00
Botond Dénes
44b8cad3df service/storage_proxy: add name of table to error message for write errors
It is useful to know what table the failed write belongs to.
2026-03-05 10:51:12 +02:00
Ernest Zaslavsky
afac984632 s3_client: reorganize tests in part_size_calculation_test
just group all BOOST_REQUIRE_EXCEPTION tests in one block and
remove artificial scopes
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
1a20877afe s3_client: switch using s3 limits constants in tests
instead of using magic numbers, switch using s3 limit constants to
make it clearer what and why is tested
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
349 changed files with 24158 additions and 8151 deletions

View File

@@ -55,22 +55,26 @@ ninja build/<mode>/test/boost/<test_name>
ninja build/<mode>/scylla
# Run all tests in a file
./test.py --mode=<mode> <test_path>
./test.py --mode=<mode> test/<suite>/<test_name>.py
# Run a single test case from a file
./test.py --mode=<mode> <test_path>::<test_function_name>
./test.py --mode=<mode> test/<suite>/<test_name>.py::<test_function_name>
# Run all tests in a directory
./test.py --mode=<mode> test/<suite>/
# Examples
./test.py --mode=dev alternator/
./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/alternator/
./test.py --mode=dev test/cluster/test_raft_voters.py::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/cqlpy/test_json.py
# Optional flags
./test.py --mode=dev cluster/test_raft_no_quorum -v # Verbose output
./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5 # Repeat test 5 times
./test.py --mode=dev test/cluster/test_raft_no_quorum.py -v # Verbose output
./test.py --mode=dev test/cluster/test_raft_no_quorum.py --repeat 5 # Repeat test 5 times
```
**Important:**
- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)
- Use full path with `.py` extension (e.g., `test/cluster/test_raft_no_quorum.py`, not `cluster/test_raft_no_quorum`)
- To run a single test case, append `::<test_function_name>` to the file path
- Add `-v` for verbose output
- Add `--repeat <num>` to repeat a test multiple times

View File

@@ -8,6 +8,9 @@ on:
jobs:
check-fixes-prefix:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- name: Check PR body for "Fixes" prefix patterns
uses: actions/github-script@v7

View File

@@ -7,6 +7,11 @@ on:
- synchronize
- reopened
permissions:
contents: read
pull-requests: write
statuses: write
jobs:
validate_pr_author_email:
uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

View File

@@ -1,4 +1,6 @@
name: Trigger Scylla CI Route
permissions:
contents: read
on:
issue_comment:

View File

@@ -1,5 +1,8 @@
name: Trigger next gating
permissions:
contents: read
on:
push:
branches:

View File

@@ -2,6 +2,12 @@ cmake_minimum_required(VERSION 3.27)
project(scylla)
# Disable CMake's automatic -fcolor-diagnostics injection (CMake 3.24+ adds
# it for Clang+Ninja). configure.py does not add any color diagnostics flags,
# so we clear the internal CMake variable to prevent injection.
set(CMAKE_CXX_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
set(CMAKE_C_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)
@@ -51,6 +57,16 @@ set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
# Global defines matching configure.py
# Since gcc 13, libgcc doesn't need the exception workaround
add_compile_definitions(SEASTAR_NO_EXCEPTION_HACK)
# Hacks needed to expose internal APIs for xxhash dependencies
add_compile_definitions(XXH_PRIVATE_API)
# SEASTAR_TESTING_MAIN is added later (after add_subdirectory(seastar) and
# add_subdirectory(abseil)) to avoid leaking into the seastar subdirectory.
# If SEASTAR_TESTING_MAIN is defined globally before seastar, it causes a
# duplicate 'main' symbol in seastar_testing.
if(is_multi_config)
find_package(Seastar)
# this is atypical compared to standard ExternalProject usage:
@@ -98,10 +114,31 @@ else()
set(Seastar_IO_URING ON CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
# Match configure.py's build_seastar_shared_libs: Debug and Dev
# build Seastar as a shared library, others build it static.
if(CMAKE_BUILD_TYPE STREQUAL "Debug" OR CMAKE_BUILD_TYPE STREQUAL "Dev")
set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)
else()
set(BUILD_SHARED_LIBS OFF CACHE BOOL "" FORCE)
endif()
add_subdirectory(seastar)
target_compile_definitions (seastar
PRIVATE
SEASTAR_NO_EXCEPTION_HACK)
# Coverage mode sets cmake_build_type='Debug' for Seastar
# (configure.py:515), so Seastar's pkg-config output includes sanitizer
# link flags in seastar_libs_coverage (configure.py:2514,2649).
# Seastar's own CMake only activates sanitizer targets for Debug/Sanitize
# configs, so we inject link options on the seastar target for Coverage.
# Using PUBLIC ensures they propagate to all targets linking Seastar
# (but not standalone tools like patchelf), matching configure.py's
# behavior. Compile-time flags and defines are handled globally in
# cmake/mode.Coverage.cmake.
if(CMAKE_BUILD_TYPE STREQUAL "Coverage")
target_link_options(seastar
PUBLIC
-fsanitize=address
-fsanitize=undefined
-fsanitize=vptr)
endif()
endif()
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
@@ -111,8 +148,10 @@ if(Scylla_ENABLE_LTO)
endif()
find_package(Sanitizers QUIET)
# Match configure.py:2192 — abseil gets sanitizer flags with -fno-sanitize=vptr
# to exclude vptr checks which are incompatible with abseil's usage.
list(APPEND absl_cxx_flags
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>;-fno-sanitize=vptr>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
@@ -137,9 +176,38 @@ add_library(absl::headers ALIAS absl-headers)
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# Now that seastar and abseil subdirectories are fully processed, add
# SEASTAR_TESTING_MAIN globally. This matches configure.py's global define
# without leaking into seastar (which would cause duplicate main symbols).
add_compile_definitions(SEASTAR_TESTING_MAIN)
# System libraries dependencies
find_package(Boost REQUIRED
COMPONENTS filesystem program_options system thread regex unit_test_framework)
# When using shared Boost libraries, define BOOST_ALL_DYN_LINK (matching configure.py)
if(NOT Boost_USE_STATIC_LIBS)
add_compile_definitions(BOOST_ALL_DYN_LINK)
endif()
# CMake's Boost package config adds per-component defines like
# BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK, BOOST_REGEX_DYN_LINK, etc. on the
# imported targets. configure.py only uses BOOST_ALL_DYN_LINK (which covers
# all components), so strip the per-component defines to align the two build
# systems.
foreach(_boost_target
Boost::unit_test_framework
Boost::regex
Boost::filesystem
Boost::program_options
Boost::system
Boost::thread)
if(TARGET ${_boost_target})
# Completely remove all INTERFACE_COMPILE_DEFINITIONS from the Boost target.
# This prevents per-component *_DYN_LINK and *_NO_LIB defines from
# propagating. BOOST_ALL_DYN_LINK (set globally) covers all components.
set_property(TARGET ${_boost_target} PROPERTY INTERFACE_COMPILE_DEFINITIONS)
endif()
endforeach()
target_link_libraries(Boost::regex
INTERFACE
ICU::i18n
@@ -196,6 +264,10 @@ if (Scylla_USE_PRECOMPILED_HEADER)
message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")
target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")
target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)
# Match configure.py: -fpch-validate-input-files-content tells the compiler
# to check content of stdafx.hh if timestamps don't match (important for
# ccache/git workflows where timestamps may not be preserved).
add_compile_options(-fpch-validate-input-files-content)
endif()
else()
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

2
abseil

Submodule abseil updated: d7aaad83b4...255c84dadd

View File

@@ -699,6 +699,17 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
// for such a size.
co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));
}
// Check the concurrency limit early, before acquiring memory and
// reading the request body, to avoid piling up memory from excess
// requests that will be rejected anyway. This mirrors the CQL
// transport which also checks concurrency before memory acquisition
// (transport/server.cc).
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// If the Content-Length of the request is not available, we assume
@@ -760,12 +771,6 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(fmt::format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
executor::client_state client_state(service::client_state::external_tag(),
_auth_service, &_sl_controller, _timeout_config.current_values(), req->get_client_address());
if (!username.empty()) {

View File

@@ -33,7 +33,7 @@
#include "data_dictionary/data_dictionary.hh"
#include "utils/rjson.hh"
static logging::logger elogger("alternator-streams");
static logging::logger slogger("alternator-streams");
/**
* Base template type to implement rapidjson::internal::TypeHelper<...>:s
@@ -437,7 +437,7 @@ const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_po
if (prev_streams.empty()) {
// something is really wrong - streams are empty
// let's try internal_error in hope it will be notified and fixed
on_internal_error(elogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
on_internal_error(slogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
}
auto it = std::lower_bound(prev_streams.begin(), prev_streams.end(), child.token(), [](const cdc::stream_id& id, const dht::token& t) {
return id.token() < t;
@@ -787,16 +787,18 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
struct event_id {
cdc::stream_id stream;
utils::UUID timestamp;
size_t index = 0;
static constexpr auto marker = 'E';
event_id(cdc::stream_id s, utils::UUID ts)
event_id(cdc::stream_id s, utils::UUID ts, size_t index)
: stream(s)
, timestamp(ts)
, index(index)
{}
friend std::ostream& operator<<(std::ostream& os, const event_id& id) {
fmt::print(os, "{}{}:{}", marker, id.stream.to_bytes(), id.timestamp);
fmt::print(os, "{}{}:{}:{}", marker, id.stream.to_bytes(), id.timestamp, id.index);
return os;
}
};
@@ -808,7 +810,19 @@ struct rapidjson::internal::TypeHelper<ValueType, alternator::event_id>
{};
namespace alternator {
namespace {
struct managed_bytes_ptr_hash {
size_t operator()(const managed_bytes *k) const noexcept {
return std::hash<managed_bytes>{}(*k);
}
};
struct managed_bytes_ptr_equal {
bool operator()(const managed_bytes *a, const managed_bytes *b) const noexcept {
return *a == *b;
}
};
}
future<executor::request_return_type> executor::get_records(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.get_records++;
auto start_time = std::chrono::steady_clock::now();
@@ -879,6 +893,12 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto pks = schema->partition_key_columns();
auto cks = schema->clustering_key_columns();
auto base_cks = base->clustering_key_columns();
if (base_cks.size() > 1) {
throw api_error::internal(fmt::format("invalid alternator table, clustering key count ({}) is bigger than one", base_cks.size()));
}
const bytes *clustering_key_column_name = !base_cks.empty() ? &base_cks.front().name() : nullptr;
std::transform(pks.begin(), pks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
@@ -933,42 +953,40 @@ future<executor::request_return_type> executor::get_records(client_state& client
return cdef->name->name() == eor_column_name;
})
);
auto clustering_key_index = clustering_key_column_name ? std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [&](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == *clustering_key_column_name;
})
) : 0;
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
struct Record {
rjson::value record;
rjson::value dynamodb;
};
const managed_bytes empty_managed_bytes;
std::unordered_map<const managed_bytes*, Record, managed_bytes_ptr_hash, managed_bytes_ptr_equal> records_map;
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
using op_utype = std::underlying_type_t<cdc::operation>;
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
const managed_bytes* cs_ptr = clustering_key_column_name ? &*row[clustering_key_index] : &empty_managed_bytes;
auto records_it = records_map.emplace(cs_ptr, Record{});
auto &record = records_it.first->second;
if (!dynamodb.HasMember("Keys")) {
if (records_it.second) {
record.dynamodb = rjson::empty_object();
record.record = rjson::empty_object();
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
rjson::add(record.dynamodb, "Keys", std::move(keys));
rjson::add(record.dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(record.dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(record.dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
@@ -992,6 +1010,10 @@ future<executor::request_return_type> executor::get_records(client_state& client
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*
* Note: BatchWriteItem will generate multiple records with
* the same timestamp, when write isolation is set to always
* (which triggers lwt), so we need to unpack them based on clustering key.
*/
switch (op) {
case cdc::operation::pre_image:
@@ -1000,14 +1022,14 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
rjson::add(record.dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
rjson::add(record.record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
rjson::add(record.record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
@@ -1015,28 +1037,41 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
rjson::add(record.record, "userIdentity", std::move(user_identity));
rjson::add(record.record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
rjson::add(record.record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
size_t index = 0;
for (auto& [_, rec] : records_map) {
rjson::add(rec.record, "awsRegion", rjson::from_string(dc_name));
rjson::add(rec.record, "eventID", event_id(iter.shard.id, *timestamp, index++));
rjson::add(rec.record, "eventSource", "scylladb:alternator");
rjson::add(rec.record, "eventVersion", "1.1");
rjson::add(rec.record, "dynamodb", std::move(rec.dynamodb));
rjson::push_back(records, std::move(rec.record));
}
records_map.clear();
timestamp = ts;
if (limit == 0) {
if (records.Size() >= limit) {
// Note: we might have more than limit rows here - BatchWriteItem will emit multiple items
// with the same timestamp and we have no way of resume iteration midway through those,
// so we return all of them here.
break;
}
}
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
if (timestamp) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
@@ -1087,6 +1122,7 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
cdc::options opts;
opts.enabled(true);
// cdc::delta_mode is ignored by Alternator, so aim for the least overhead.
opts.set_delta_mode(cdc::delta_mode::keys);
opts.ttl(std::chrono::duration_cast<std::chrono::seconds>(dynamodb_streams_max_window).count());

View File

@@ -743,7 +743,7 @@
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"description":"The snapshot tag to delete. If omitted, all snapshots are removed.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -751,7 +751,7 @@
},
{
"name":"kn",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"description":"Comma-separated list of keyspace names to delete snapshots from. If omitted, snapshots are deleted from all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -759,7 +759,7 @@
},
{
"name":"cf",
"description":"an optional table name that its snapshot will be deleted",
"description":"A table name used to filter which table's snapshots are deleted. If omitted or empty, snapshots for all tables are eligible. When provided together with 'kn', the table is looked up in each listed keyspace independently. For secondary indexes, the logical index name (e.g. 'myindex') can be used and is resolved automatically.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1295,6 +1295,45 @@
}
]
},
{
"path":"/storage_service/logstor_compaction",
"operations":[
{
"method":"POST",
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3127,6 +3166,83 @@
]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}",
"operations":[{
"method":"POST",
"summary":"Start vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"create_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
},
{
"method":"GET",
"summary":"Get a keyspace's vnodes-to-tablets migration status",
"type":"vnode_tablet_migration_status",
"nickname":"get_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/node/storage_mode",
"operations":[{
"method":"PUT",
"summary":"Set the intended storage mode for this node during vnodes-to-tablets migration",
"type":"void",
"nickname":"set_vnode_tablet_migration_node_storage_mode",
"produces":["application/json"],
"parameters":[
{
"name":"intended_mode",
"description":"Intended storage mode (tablets or vnodes)",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization",
"operations":[{
"method":"POST",
"summary":"Finalize vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"finalize_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[
@@ -3229,6 +3345,38 @@
}
]
},
{
"path":"/storage_service/logstor_info",
"operations":[
{
"method":"GET",
"summary":"Logstor segment information for one table",
"type":"table_logstor_info",
"nickname":"logstor_info",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"table name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
@@ -3637,6 +3785,47 @@
}
}
},
"logstor_hist_bucket":{
"id":"logstor_hist_bucket",
"properties":{
"bucket":{
"type":"long"
},
"count":{
"type":"long"
},
"min_data_size":{
"type":"long"
},
"max_data_size":{
"type":"long"
}
}
},
"table_logstor_info":{
"id":"table_logstor_info",
"description":"Per-table logstor segment distribution",
"properties":{
"keyspace":{
"type":"string"
},
"table":{
"type":"string"
},
"compaction_groups":{
"type":"long"
},
"segments":{
"type":"long"
},
"data_size_histogram":{
"type":"array",
"items":{
"$ref":"logstor_hist_bucket"
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",
@@ -3671,6 +3860,45 @@
"description":"The resulting compression ratio (estimated on a random sample of files)"
}
}
},
"vnode_tablet_migration_node_status":{
"id":"vnode_tablet_migration_node_status",
"description":"Node storage mode info during vnodes-to-tablets migration",
"properties":{
"host_id":{
"type":"string",
"description":"The host ID"
},
"current_mode":{
"type":"string",
"description":"The current storage mode: `vnodes` or `tablets`"
},
"intended_mode":{
"type":"string",
"description":"The intended storage mode: `vnodes` or `tablets`"
}
}
},
"vnode_tablet_migration_status":{
"id":"vnode_tablet_migration_status",
"description":"Vnodes-to-tablets migration status for a keyspace",
"properties":{
"keyspace":{
"type":"string",
"description":"The keyspace name"
},
"status":{
"type":"string",
"description":"The migration status: `vnodes` (not started), `migrating_to_tablets` (in progress), or `tablets` (complete)"
},
"nodes":{
"type":"array",
"items":{
"$ref":"vnode_tablet_migration_node_status"
},
"description":"Per-node storage mode information. Empty if the keyspace is not being migrated."
}
}
}
}
}

View File

@@ -209,6 +209,21 @@
"parameters":[]
}
]
},
{
"path":"/system/chosen_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get sstable version currently chosen for use in new sstables",
"type":"string",
"nickname":"get_chosen_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -18,7 +18,9 @@
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include <sstream>
#include "db/data_listeners.hh"
#include "utils/hash.hh"
#include "storage_service.hh"
#include "compaction/compaction_manager.hh"
#include "unimplemented.hh"
@@ -342,6 +344,56 @@ uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<
return ret;
}
static
future<json::json_return_type>
rest_toppartitions_generic(sharded<replica::database>& db, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db) {
cf::get_column_family_name.set(r, [&db] (const_req req){
std::vector<sstring> res;
@@ -1047,6 +1099,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
});
});
ss::toppartitions_generic.set(r, [&db] (std::unique_ptr<http::request> req) {
return rest_toppartitions_generic(db, std::move(req));
});
cf::force_major_compaction.set(r, [&ctx, &db](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
if (!req->get_query_param("split_output").empty()) {
fail(unimplemented::cause::API);
@@ -1213,6 +1269,7 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_sstable_count_per_level.unset(r);
cf::get_sstables_for_key.unset(r);
cf::toppartitions.unset(r);
ss::toppartitions_generic.unset(r);
cf::force_major_compaction.unset(r);
ss::get_load.unset(r);
ss::get_metrics_load.unset(r);

View File

@@ -23,7 +23,7 @@ void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
sstring injection = req->get_path_param("injection");
bool one_shot = req->get_query_param("one_shot") == "True";
bool one_shot = strcasecmp(req->get_query_param("one_shot").c_str(), "true") == 0;
auto params = co_await util::read_entire_stream_contiguous(*req->content_stream);
const size_t max_params_size = 1024 * 1024;

View File

@@ -17,9 +17,7 @@
#include "gms/feature_service.hh"
#include "schema/schema_builder.hh"
#include "sstables/sstables_manager.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
@@ -32,6 +30,7 @@
#include <fmt/ranges.h>
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "service/topology_state_machine.hh"
#include "service/load_meter.hh"
#include "gms/feature_service.hh"
#include "gms/gossiper.hh"
@@ -574,14 +573,6 @@ void unset_view_builder(http_context& ctx, routes& r) {
cf::get_built_indexes.unset(r);
}
static future<json::json_return_type> describe_ring_as_json(sharded<service::storage_service>& ss, sstring keyspace) {
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));
}
static future<json::json_return_type> describe_ring_as_json_for_table(const sharded<service::storage_service>& ss, sstring keyspace, sstring table) {
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));
}
namespace {
template <typename Key, typename Value>
storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {
@@ -612,56 +603,6 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
}
static
future<json::json_return_type>
rest_toppartitions_generic(http_context& ctx, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
static
json::json_return_type
rest_get_release_version(sharded<service::storage_service>& ss, const_req& req) {
@@ -729,13 +670,16 @@ rest_describe_ring(http_context& ctx, sharded<service::storage_service>& ss, std
if (!req->param.exists("keyspace")) {
throw bad_param_exception("The keyspace param is not provided");
}
auto keyspace = req->get_path_param("keyspace");
auto keyspace = validate_keyspace(ctx, req);
auto table = req->get_query_param("table");
utils::chunked_vector<dht::token_range_endpoints> ranges;
if (!table.empty()) {
validate_table(ctx.db.local(), keyspace, table);
return describe_ring_as_json_for_table(ss, keyspace, table);
auto table_id = validate_table(ctx.db.local(), keyspace, table);
ranges = co_await ss.local().describe_ring_for_table(table_id);
} else {
ranges = co_await ss.local().describe_ring(keyspace);
}
return describe_ring_as_json(ss, validate_keyspace(ctx, req));
co_return json::json_return_type(stream_range_as_array(std::move(ranges), token_range_endpoints_to_json));
}
static
@@ -833,6 +777,28 @@ rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req)
co_return json_void();
}
static
future<json::json_return_type>
rest_logstor_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
bool major = false;
if (auto major_param = req->get_query_param("major"); !major_param.empty()) {
major = validate_bool(major_param);
}
apilog.info("logstor_compaction: major={}", major);
auto& db = ctx.db;
co_await replica::database::trigger_logstor_compaction_on_all_shards(db, major);
co_return json_void();
}
static
future<json::json_return_type>
rest_logstor_flush(http_context& ctx, std::unique_ptr<http::request> req) {
apilog.info("logstor_flush");
auto& db = ctx.db;
co_await replica::database::flush_logstor_separator_on_all_shards(db);
co_return json_void();
}
static
future<json::json_return_type>
rest_decommission(sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, std::unique_ptr<http::request> req) {
@@ -1553,6 +1519,54 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
});
}
static
future<json::json_return_type>
rest_logstor_info(http_context& ctx, std::unique_ptr<http::request> req) {
auto keyspace = api::req_param<sstring>(*req, "keyspace", {}).value;
auto table = api::req_param<sstring>(*req, "table", {}).value;
if (table.empty()) {
table = api::req_param<sstring>(*req, "cf", {}).value;
}
if (keyspace.empty()) {
throw bad_param_exception("The query parameter 'keyspace' is required");
}
if (table.empty()) {
throw bad_param_exception("The query parameter 'table' is required");
}
keyspace = validate_keyspace(ctx, keyspace);
auto tid = validate_table(ctx.db.local(), keyspace, table);
auto& cf = ctx.db.local().find_column_family(tid);
if (!cf.uses_logstor()) {
throw bad_param_exception(fmt::format("Table {}.{} does not use logstor", keyspace, table));
}
return do_with(replica::logstor::table_segment_stats{}, [keyspace = std::move(keyspace), table = std::move(table), tid, &ctx] (replica::logstor::table_segment_stats& merged_stats) {
return ctx.db.map_reduce([&merged_stats](replica::logstor::table_segment_stats&& shard_stats) {
merged_stats += shard_stats;
}, [tid](const replica::database& db) {
return db.get_logstor_table_segment_stats(tid);
}).then([&merged_stats, keyspace = std::move(keyspace), table = std::move(table)] {
ss::table_logstor_info result;
result.keyspace = keyspace;
result.table = table;
result.compaction_groups = merged_stats.compaction_group_count;
result.segments = merged_stats.segment_count;
for (const auto& bucket : merged_stats.histogram) {
ss::logstor_hist_bucket hist;
hist.count = bucket.count;
hist.max_data_size = bucket.max_data_size;
result.data_size_histogram.push(std::move(hist));
}
return make_ready_future<json::json_return_type>(stream_object(result));
});
});
}
static
future<json::json_return_type>
rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {
@@ -1709,6 +1723,69 @@ rest_tablet_balancing_enable(sharded<service::storage_service>& ss, std::unique_
co_return json_void();
}
static
future<json::json_return_type>
rest_create_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("create_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
co_await ss.local().prepare_for_tablets_migration(keyspace);
co_return json_void();
}
static
future<json::json_return_type>
rest_get_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("get_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
auto status = co_await ss.local().get_tablets_migration_status(keyspace);
ss::vnode_tablet_migration_status result;
result.keyspace = status.keyspace;
result.status = status.status;
result.nodes._set = true;
for (const auto& node : status.nodes) {
ss::vnode_tablet_migration_node_status n;
n.host_id = fmt::to_string(node.host_id);
n.current_mode = node.current_mode;
n.intended_mode = node.intended_mode;
result.nodes.push(n);
}
co_return result;
}
static
future<json::json_return_type>
rest_set_vnode_tablet_migration_node_storage_mode(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("set_vnode_tablet_migration_node_storage_mode: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto mode_str = req->get_query_param("intended_mode");
auto mode = service::intended_storage_mode_from_string(mode_str);
co_await ss.local().set_node_intended_storage_mode(mode);
co_return json_void();
}
static
future<json::json_return_type>
rest_finalize_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("finalize_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
validate_keyspace(ctx, keyspace);
co_await ss.local().finalize_tablets_migration(keyspace);
co_return json_void();
}
static
future<json::json_return_type>
rest_quiesce_topology(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1784,7 +1861,6 @@ rest_bind(FuncType func, BindArgs&... args) {
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));
ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));
@@ -1800,6 +1876,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));
ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
@@ -1848,6 +1926,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
@@ -1857,6 +1936,10 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
ss::repair_tablet.set(r, rest_bind(rest_repair_tablet, ctx, ss));
ss::tablet_balancing_enable.set(r, rest_bind(rest_tablet_balancing_enable, ss));
ss::create_vnode_tablet_migration.set(r, rest_bind(rest_create_vnode_tablet_migration, ctx, ss));
ss::get_vnode_tablet_migration.set(r, rest_bind(rest_get_vnode_tablet_migration, ctx, ss));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss));
ss::finalize_vnode_tablet_migration.set(r, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss));
ss::quiesce_topology.set(r, rest_bind(rest_quiesce_topology, ss));
sp::get_schema_versions.set(r, rest_bind(rest_get_schema_versions, ss));
ss::drop_quarantined_sstables.set(r, rest_bind(rest_drop_quarantined_sstables, ctx, ss));
@@ -1864,7 +1947,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
void unset_storage_service(http_context& ctx, routes& r) {
ss::get_token_endpoint.unset(r);
ss::toppartitions_generic.unset(r);
ss::get_release_version.unset(r);
ss::get_scylla_release_version.unset(r);
ss::get_schema_version.unset(r);
@@ -1878,6 +1960,8 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reset_cleanup_needed.unset(r);
ss::force_flush.unset(r);
ss::force_keyspace_flush.unset(r);
ss::logstor_compaction.unset(r);
ss::logstor_flush.unset(r);
ss::decommission.unset(r);
ss::move.unset(r);
ss::remove_node.unset(r);
@@ -1925,6 +2009,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_ownership.unset(r);
ss::get_effective_ownership.unset(r);
ss::sstable_info.unset(r);
ss::logstor_info.unset(r);
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
@@ -1934,6 +2019,10 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::del_tablet_replica.unset(r);
ss::repair_tablet.unset(r);
ss::tablet_balancing_enable.unset(r);
ss::create_vnode_tablet_migration.unset(r);
ss::get_vnode_tablet_migration.unset(r);
ss::set_vnode_tablet_migration_node_storage_mode.unset(r);
ss::finalize_vnode_tablet_migration.unset(r);
ss::quiesce_topology.unset(r);
sp::get_schema_versions.unset(r);
ss::drop_quarantined_sstables.unset(r);
@@ -2024,6 +2113,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
}
co_return json_void();
} catch (const data_dictionary::no_such_column_family& e) {
throw httpd::bad_param_exception(e.what());
} catch (...) {
apilog.error("take_snapshot failed: {}", std::current_exception());
throw;
@@ -2060,6 +2151,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
try {
co_await snap_ctl.local().clear_snapshot(tag, keynames, column_family);
co_return json_void();
} catch (const data_dictionary::no_such_column_family& e) {
throw httpd::bad_param_exception(e.what());
} catch (...) {
apilog.error("del_snapshot failed: {}", std::current_exception());
throw;

View File

@@ -190,6 +190,13 @@ void set_system(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
hs::get_chosen_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx] {
auto format = ctx.db.local().get_user_sstables_manager().get_preferred_sstable_version();
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
}
}

View File

@@ -47,7 +47,7 @@ void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
lw_shared_ptr<const cache::role_record> cache::get(std::string_view role) const noexcept {
auto it = _roles.find(role);
if (it == _roles.end()) {
return {};
@@ -55,6 +55,16 @@ lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) cons
return it->second;
}
void cache::for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const {
for (const auto& [name, record] : _roles) {
func(name, *record);
}
}
size_t cache::roles_count() const noexcept {
return _roles.size();
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;

View File

@@ -9,6 +9,7 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <string_view>
#include <unordered_set>
#include <unordered_map>
@@ -19,7 +20,7 @@
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "absl-flat_hash_map.hh"
#include "auth/permission.hh"
#include "auth/common.hh"
@@ -42,8 +43,8 @@ public:
std::unordered_set<role_name_t> member_of;
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
std::unordered_map<sstring, sstring, sstring_hash, sstring_eq> attributes;
std::unordered_map<sstring, permission_set, sstring_hash, sstring_eq> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
@@ -52,7 +53,7 @@ public:
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
lw_shared_ptr<const role_record> get(std::string_view role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
@@ -61,8 +62,15 @@ public:
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
// Returns the number of roles in the cache.
size_t roles_count() const noexcept;
// The callback doesn't suspend (no co_await) so it observes the state
// of the cache atomically.
void for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const;
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>, sstring_hash, sstring_eq>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.

View File

@@ -14,6 +14,7 @@
#include <fmt/ranges.h>
#include "utils/to_string.hh"
#include "utils/error_injection.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
@@ -105,6 +106,9 @@ auth::authentication_option_set auth::certificate_authenticator::alterable_optio
}
future<std::optional<auth::authenticated_user>> auth::certificate_authenticator::authenticate(session_dn_func f) const {
if (auto user = utils::get_local_injector().inject_parameter("transport_early_auth_bypass")) {
co_return auth::authenticated_user{sstring(*user)};
}
if (!f) {
co_return std::nullopt;
}

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "auth/default_authorizer.hh"
#include "auth/permission.hh"
namespace auth {
// maintenance_socket_authorizer is used for clients connecting to the
// maintenance socket. It grants all permissions unconditionally (like
// AllowAllAuthorizer) while still supporting grant/revoke operations
// (delegated to the underlying CassandraAuthorizer / default_authorizer).
class maintenance_socket_authorizer : public default_authorizer {
public:
using default_authorizer::default_authorizer;
~maintenance_socket_authorizer() override = default;
future<> start() override {
return make_ready_future<>();
}
future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
};
} // namespace auth

View File

@@ -30,6 +30,7 @@
#include "auth/default_authorizer.hh"
#include "auth/ldap_role_manager.hh"
#include "auth/maintenance_socket_authenticator.hh"
#include "auth/maintenance_socket_authorizer.hh"
#include "auth/maintenance_socket_role_manager.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
@@ -866,6 +867,12 @@ authenticator_factory make_maintenance_socket_authenticator_factory(
};
}
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp) {
return [&qp] {
return std::make_unique<maintenance_socket_authorizer>(qp.local());
};
}
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,

View File

@@ -434,6 +434,11 @@ authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket authorizer.
/// This authorizer is not config-selectable and is only used for the maintenance socket.
/// It grants all permissions unconditionally while delegating grant/revoke to the default authorizer.
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp);
/// Creates a factory for the maintenance socket role manager.
/// This role manager is not config-selectable and is only used for the maintenance socket.
role_manager_factory make_maintenance_socket_role_manager_factory(

View File

@@ -44,13 +44,12 @@ namespace auth {
static logging::logger log("standard_role_manager");
future<std::optional<standard_role_manager::record>> standard_role_manager::find_record(std::string_view role_name) {
auto name = sstring(role_name);
auto role = _cache.get(name);
auto role = _cache.get(role_name);
if (!role) {
return make_ready_future<std::optional<record>>(std::nullopt);
}
return make_ready_future<std::optional<record>>(std::make_optional(record{
.name = std::move(name),
.name = sstring(role_name),
.is_superuser = role->is_superuser,
.can_login = role->can_login,
.member_of = role->member_of
@@ -393,51 +392,21 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
_cache.for_each_role([&roles_map] (const cache::role_name_t& name, const cache::role_record& record) {
for (const auto& granted_role : record.member_of) {
roles_map.emplace(name, granted_role);
}
});
co_return roles_map;
}
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
db::system_keyspace::NAME,
meta::roles_table::name);
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
std::transform(
results->begin(),
results->end(),
std::inserter(roles, roles.begin()),
[] (const cql3::untyped_result_set_row& row) {
return row.get_as<sstring>(role_col_name_string);}
);
roles.reserve(_cache.roles_count());
_cache.for_each_role([&roles] (const cache::role_name_t& name, const cache::role_record&) {
roles.insert(name);
});
co_return roles;
}
@@ -460,31 +429,26 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
auto role = _cache.get(role_name);
if (!role) {
co_return std::nullopt;
}
co_return std::optional<sstring>{};
auto it = role->attributes.find(attribute_name);
if (it != role->attributes.end()) {
co_return it->second;
}
co_return std::nullopt;
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
});
}).then([&role_to_att_val] () {
return make_ready_future<attribute_vals>(std::move(role_to_att_val));
});
});
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
attribute_vals result;
_cache.for_each_role([&result, attribute_name] (const cache::role_name_t& name, const cache::role_record& record) {
auto it = record.attributes.find(attribute_name);
if (it != record.attributes.end()) {
result.emplace(name, it->second);
}
});
co_return result;
}
future<> standard_role_manager::set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) {

47
cmake/FindLua.cmake Normal file
View File

@@ -0,0 +1,47 @@
#
# Copyright 2025-present ScyllaDB
#
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
# Custom FindLua module that uses pkg-config, matching configure.py's
# approach. CMake's built-in FindLua resolves to the versioned library
# (e.g. liblua-5.4.so) instead of the unversioned symlink (liblua.so),
# causing a name mismatch between the two build systems.
find_package(PkgConfig REQUIRED)
# configure.py: lua53 on Debian-like, lua on others
pkg_search_module(PC_lua QUIET lua53 lua)
find_library(Lua_LIBRARY
NAMES lua lua5.3 lua53
HINTS
${PC_lua_LIBDIR}
${PC_lua_LIBRARY_DIRS})
find_path(Lua_INCLUDE_DIR
NAMES lua.h
HINTS
${PC_lua_INCLUDEDIR}
${PC_lua_INCLUDE_DIRS})
mark_as_advanced(
Lua_LIBRARY
Lua_INCLUDE_DIR)
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(Lua
REQUIRED_VARS
Lua_LIBRARY
Lua_INCLUDE_DIR
VERSION_VAR PC_lua_VERSION)
if(Lua_FOUND)
set(LUA_LIBRARIES ${Lua_LIBRARY})
set(LUA_INCLUDE_DIR ${Lua_INCLUDE_DIR})
endif()

View File

@@ -1,5 +1,5 @@
set(CMAKE_CXX_FLAGS_COVERAGE
"-fprofile-instr-generate -fcoverage-mapping -fprofile-list=${CMAKE_SOURCE_DIR}/coverage_sources.list"
"-fprofile-instr-generate -fcoverage-mapping"
CACHE
INTERNAL
"")
@@ -8,18 +8,33 @@ update_build_flags(Coverage
OPTIMIZATION_LEVEL "g")
set(scylla_build_mode_Coverage "coverage")
# Coverage mode sets cmake_build_type='Debug' for Seastar
# (configure.py:515), so Seastar's pkg-config --cflags output
# (configure.py:2252-2267, queried at configure.py:3039) includes debug
# defines, sanitizer compile flags, and -fstack-clash-protection.
# Seastar's CMake generator expressions only activate these for
# Debug/Sanitize configs, so we add them explicitly for Coverage.
set(Seastar_DEFINITIONS_COVERAGE
SCYLLA_BUILD_MODE=${scylla_build_mode_Coverage}
DEBUG
SANITIZE
DEBUG_LSA_SANITIZER
SCYLLA_ENABLE_ERROR_INJECTION)
SEASTAR_DEBUG
SEASTAR_DEFAULT_ALLOCATOR
SEASTAR_SHUFFLE_TASK_QUEUE
SEASTAR_DEBUG_SHARED_PTR
SEASTAR_DEBUG_PROMISE
SEASTAR_TYPE_ERASE_MORE)
foreach(definition ${Seastar_DEFINITIONS_COVERAGE})
add_compile_definitions(
$<$<CONFIG:Coverage>:${definition}>)
endforeach()
set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE
add_compile_options(
$<$<CONFIG:Coverage>:-fsanitize=address>
$<$<CONFIG:Coverage>:-fsanitize=undefined>
$<$<CONFIG:Coverage>:-fsanitize=vptr>
$<$<CONFIG:Coverage>:-fstack-clash-protection>)
set(CMAKE_EXE_LINKER_FLAGS_COVERAGE
"-fprofile-instr-generate -fcoverage-mapping")
maybe_limit_stack_usage_in_KB(40 Coverage)

View File

@@ -131,6 +131,7 @@ function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)
check_cxx_compiler_flag(${_stack_usage_threshold_flag} _stack_usage_flag_supported)
if(_stack_usage_flag_supported)
add_compile_options($<$<CONFIG:${config}>:${_stack_usage_threshold_flag}>)
add_compile_options($<$<CONFIG:${config}>:-Wno-error=stack-usage=>)
endif()
endfunction()
@@ -260,6 +261,23 @@ endif()
# Force SHA1 build-id generation
add_link_options("LINKER:--build-id=sha1")
# Match configure.py: add -fno-lto globally. configure.py adds -fno-lto to
# all binaries (except standalone cpp_apps like patchelf) via the per-binary
# $libs variable. LTO-enabled targets (scylla binary in RelWithDebInfo) will
# override with -flto=thin -ffat-lto-objects via enable_lto().
add_link_options(-fno-lto)
# Match configure.py:2633-2636 — sanitizer link flags for standalone binaries
# (e.g. patchelf) that don't link Seastar. Seastar-linked targets get these
# via seastar_libs (configure.py:2649).
# Coverage mode gets sanitizer link flags via the seastar target instead
# (see CMakeLists.txt), matching configure.py where only seastar_libs_coverage
# carries -fsanitize (not cxx_ld_flags).
add_link_options(
$<$<CONFIG:Debug,Sanitize>:-fsanitize=address>
$<$<CONFIG:Debug,Sanitize>:-fsanitize=undefined>)
include(CheckLinkerFlag)
set(Scylla_USE_LINKER
""

View File

@@ -44,6 +44,7 @@
#include "dht/partition_filter.hh"
#include "mutation_writer/shard_based_splitting_writer.hh"
#include "mutation_writer/partition_based_splitting_writer.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "mutation/mutation_source_metadata.hh"
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
@@ -1933,6 +1934,7 @@ class resharding_compaction final : public compaction {
};
std::vector<estimated_values> _estimation_per_shard;
std::vector<sstables::run_id> _run_identifiers;
bool _reshard_vnodes;
private:
// return estimated partitions per sstable for a given shard
uint64_t partitions_per_sstable(shard_id s) const {
@@ -1945,7 +1947,11 @@ public:
: compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker::no)
, _estimation_per_shard(smp::count)
, _run_identifiers(smp::count)
, _reshard_vnodes(descriptor.options.as<compaction_type_options::reshard>().vnodes_resharding)
{
if (_reshard_vnodes && !_owned_ranges) {
on_internal_error(clogger, "Resharding vnodes requires owned_ranges");
}
for (auto& sst : _sstables) {
const auto& shards = sst->get_shards_for_this_sstable();
auto size = sst->bytes_on_disk();
@@ -1983,8 +1989,25 @@ public:
}
mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {
return [end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {
return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));
auto owned_ranges = _reshard_vnodes ? _owned_ranges : nullptr;
return [end_consumer = std::move(end_consumer), owned_ranges = std::move(owned_ranges)] (mutation_reader reader) mutable -> future<> {
if (owned_ranges) {
auto classify = [owned_ranges, it = owned_ranges->begin(), idx = mutation_writer::token_group_id(0)] (dht::token t) mutable -> mutation_writer::token_group_id {
dht::token_comparator cmp;
while (it != owned_ranges->end() && it->after(t, cmp)) {
clogger.debug("Token {} is after current range {}: advancing to the next range", t, *it);
++it;
++idx;
}
if (it == owned_ranges->end() || !it->contains(t, cmp)) {
on_internal_error(clogger, fmt::format("Token {} is outside of owned ranges", t));
}
return idx;
};
return mutation_writer::segregate_by_token_group(std::move(reader), std::move(classify), std::move(end_consumer));
} else {
return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));
}
};
}

View File

@@ -87,6 +87,8 @@ public:
drop_unfixable_sstables drop_unfixable = drop_unfixable_sstables::no;
};
struct reshard {
// If set, resharding compaction will apply the owned_ranges to segregate sstables in vnode boundaries.
bool vnodes_resharding = false;
};
struct reshape {
};
@@ -115,8 +117,8 @@ public:
return compaction_type_options(reshape{});
}
static compaction_type_options make_reshard() {
return compaction_type_options(reshard{});
static compaction_type_options make_reshard(bool vnodes_resharding = false) {
return compaction_type_options(reshard{.vnodes_resharding = vnodes_resharding});
}
static compaction_type_options make_regular() {

View File

@@ -1268,9 +1268,15 @@ future<> compaction_manager::start(const db::config& cfg, utils::disk_space_moni
if (dsm && (this_shard_id() == 0)) {
_out_of_space_subscription = dsm->subscribe(cfg.critical_disk_utilization_level, [this] (auto threshold_reached) {
if (threshold_reached) {
return container().invoke_on_all([] (compaction_manager& cm) { return cm.drain(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = true;
return cm.drain();
});
}
return container().invoke_on_all([] (compaction_manager& cm) { cm.enable(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = false;
cm.enable();
});
});
}
@@ -2348,6 +2354,16 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>("split", info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables), throw_if_stopping::no);
}
std::exception_ptr compaction_manager::make_disabled_exception(compaction::compaction_group_view& cg) {
std::exception_ptr ex;
if (_in_critical_disk_utilization_mode) {
ex = std::make_exception_ptr(std::runtime_error("critical disk utilization"));
} else {
ex = std::make_exception_ptr(compaction_stopped_exception(cg.schema()->ks_name(), cg.schema()->cf_name(), "compaction disabled"));
}
return ex;
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
@@ -2357,8 +2373,7 @@ compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compac
// We don't want to prevent split because compaction is temporarily disabled on a view only for synchronization,
// which is unneeded against new sstables that aren't part of any set yet, so never use can_proceed(&t) here.
if (is_disabled()) {
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error(format("Cannot split {} because manager has compaction disabled, " \
"reason might be out of space prevention", sst->get_filename()))));
co_return coroutine::exception(make_disabled_exception(t));
}
std::vector<sstables::shared_sstable> ret;

View File

@@ -115,6 +115,8 @@ private:
uint32_t _disabled_state_count = 0;
bool is_disabled() const { return _state != state::running || _disabled_state_count > 0; }
// precondition: is_disabled() is true.
std::exception_ptr make_disabled_exception(compaction::compaction_group_view& cg);
std::optional<future<>> _stop_future;
@@ -170,6 +172,7 @@ private:
shared_tombstone_gc_state _shared_tombstone_gc_state;
utils::disk_space_monitor::subscription _out_of_space_subscription;
bool _in_critical_disk_utilization_mode = false;
private:
// Requires task->_compaction_state.gate to be held and task to be registered in _tasks.
future<compaction_stats_opt> perform_task(shared_ptr<compaction::compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);

View File

@@ -132,7 +132,7 @@ distribute_reshard_jobs(sstables::sstable_directory::sstable_open_info_vector so
// A creator function must be passed that will create an SSTable object in the correct shard,
// and an I/O priority must be specified.
future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::sstable_open_info_vector shared_info, replica::table& table,
compaction::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, tasks::task_info parent_info)
compaction::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, bool vnodes_resharding, tasks::task_info parent_info)
{
// Resharding doesn't like empty sstable sets, so bail early. There is nothing
// to reshard in this shard.
@@ -160,13 +160,22 @@ future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::
// There is a semaphore inside the compaction manager in run_resharding_jobs. So we
// parallel_for_each so the statistics about pending jobs are updated to reflect all
// jobs. But only one will run in parallel at a time
auto& t = table.try_get_compaction_group_view_with_static_sharding();
//
// The compaction group view is used here only for job registration and gate-holding;
// resharding never reads or writes the group's own SSTables. With static (vnode)
// sharding there is exactly one group per shard; with tablets there may be many.
// In either case, any registered group suffices.
auto* cg = table.get_any_compaction_group();
if (!cg) {
on_internal_error(tasks::tmlogger, format("No compaction group found for table {}.{}", table.schema()->ks_name(), table.schema()->cf_name()));
}
auto& t = cg->view_for_unrepaired_data();
co_await coroutine::parallel_for_each(buckets, [&] (std::vector<sstables::shared_sstable>& sstlist) mutable {
return table.get_compaction_manager().run_custom_job(t, compaction_type::Reshard, "Reshard compaction", [&] (compaction_data& info, compaction_progress_monitor& progress_monitor) -> future<> {
auto erm = table.get_effective_replication_map(); // keep alive around compaction.
compaction_descriptor desc(sstlist);
desc.options = compaction_type_options::make_reshard();
desc.options = compaction_type_options::make_reshard(vnodes_resharding);
desc.creator = creator;
desc.sharder = &erm->get_sharder(*table.schema());
desc.owned_ranges = owned_ranges_ptr;
@@ -906,7 +915,7 @@ future<> table_resharding_compaction_task_impl::run() {
if (_owned_ranges_ptr) {
local_owned_ranges_ptr = make_lw_shared<const dht::token_range_vector>(*_owned_ranges_ptr);
}
auto task = co_await compaction_module.make_and_start_task<shard_resharding_compaction_task_impl>(parent_info, _status.keyspace, _status.table, _status.id, _dir, db, _creator, std::move(local_owned_ranges_ptr), destinations);
auto task = co_await compaction_module.make_and_start_task<shard_resharding_compaction_task_impl>(parent_info, _status.keyspace, _status.table, _status.id, _dir, db, _creator, std::move(local_owned_ranges_ptr), _vnodes_resharding, destinations);
co_await task->done();
}));
@@ -926,12 +935,14 @@ shard_resharding_compaction_task_impl::shard_resharding_compaction_task_impl(tas
replica::database& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr local_owned_ranges_ptr,
bool vnodes_resharding,
std::vector<replica::reshard_shard_descriptor>& destinations) noexcept
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "shard", std::move(keyspace), std::move(table), "", parent_id)
, _dir(dir)
, _db(db)
, _creator(std::move(creator))
, _local_owned_ranges_ptr(std::move(local_owned_ranges_ptr))
, _vnodes_resharding(vnodes_resharding)
, _destinations(destinations)
{
_expected_workload = _destinations[this_shard_id()].size();
@@ -941,7 +952,7 @@ future<> shard_resharding_compaction_task_impl::run() {
auto& table = _db.find_column_family(_status.keyspace, _status.table);
auto info_vec = std::move(_destinations[this_shard_id()].info_vec);
tasks::task_info info{_status.id, _status.shard};
co_await reshard(_dir.local(), std::move(info_vec), table, _creator, std::move(_local_owned_ranges_ptr), info);
co_await reshard(_dir.local(), std::move(info_vec), table, _creator, std::move(_local_owned_ranges_ptr), _vnodes_resharding, info);
co_await _dir.local().move_foreign_sstables(_dir);
}

View File

@@ -693,6 +693,7 @@ private:
sharded<replica::database>& _db;
compaction_sstable_creator_fn _creator;
compaction::owned_ranges_ptr _owned_ranges_ptr;
bool _vnodes_resharding;
public:
table_resharding_compaction_task_impl(tasks::task_manager::module_ptr module,
std::string keyspace,
@@ -700,12 +701,14 @@ public:
sharded<sstables::sstable_directory>& dir,
sharded<replica::database>& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr owned_ranges_ptr) noexcept
compaction::owned_ranges_ptr owned_ranges_ptr,
bool vnodes_resharding) noexcept
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "table", std::move(keyspace), std::move(table), "", tasks::task_id::create_null_id())
, _dir(dir)
, _db(db)
, _creator(std::move(creator))
, _owned_ranges_ptr(std::move(owned_ranges_ptr))
, _vnodes_resharding(vnodes_resharding)
{}
protected:
virtual future<> run() override;
@@ -718,6 +721,7 @@ private:
replica::database& _db;
compaction_sstable_creator_fn _creator;
compaction::owned_ranges_ptr _local_owned_ranges_ptr;
bool _vnodes_resharding;
std::vector<replica::reshard_shard_descriptor>& _destinations;
public:
shard_resharding_compaction_task_impl(tasks::task_manager::module_ptr module,
@@ -728,6 +732,7 @@ public:
replica::database& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr local_owned_ranges_ptr,
bool vnodes_resharding,
std::vector<replica::reshard_shard_descriptor>& destinations) noexcept;
protected:
virtual future<> run() override;

View File

@@ -397,6 +397,17 @@ commitlog_total_space_in_mb: -1
# you can cache more hot rows
# column_index_size_in_kb: 64
# sstable format version for newly written sstables.
# Currently allowed values are `me` and `ms`.
# If not specified in the config, this defaults to `me`.
#
# The difference between `me` and `ms` are the data structures used
# in the primary index.
# In short, `ms` needs more CPU during sstable writes,
# but should behave better during reads,
# although it might behave worse for very long clustering keys.
sstable_format: ms
# Auto-scaling of the promoted index prevents running out of memory
# when the promoted index grows too large (due to partitions with many rows
# vs. too small column_index_size_in_kb). When the serialized representation
@@ -477,6 +488,7 @@ commitlog_total_space_in_mb: -1
# compressed.
# can be: all - all traffic is compressed
# dc - traffic between different datacenters is compressed
# rack - traffic between different racks is compressed
# none - nothing is compressed.
# internode_compression: none
@@ -572,8 +584,7 @@ commitlog_total_space_in_mb: -1
audit: "table"
#
# List of statement categories that should be audited.
# Possible categories are: QUERY, DML, DCL, DDL, AUTH, ADMIN
audit_categories: "DCL,AUTH,ADMIN"
audit_categories: "DCL,DDL,AUTH,ADMIN"
#
# List of tables that should be audited.
# audit_tables: "<keyspace_name>.<table_name>,<keyspace_name>.<table_name>"

View File

@@ -896,6 +896,9 @@ scylla_core = (['message/messaging_service.cc',
'replica/multishard_query.cc',
'replica/mutation_dump.cc',
'replica/querier.cc',
'replica/logstor/segment_manager.cc',
'replica/logstor/logstor.cc',
'replica/logstor/write_buffer.cc',
'mutation/atomic_cell.cc',
'mutation/canonical_mutation.cc',
'mutation/frozen_mutation.cc',
@@ -1467,6 +1470,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/query.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
'idl/logstor.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
'idl/cache_temperature.idl.hh',
@@ -1704,12 +1708,14 @@ deps['test/boost/combined_tests'] += [
'test/boost/sstable_compression_config_test.cc',
'test/boost/sstable_directory_test.cc',
'test/boost/sstable_set_test.cc',
'test/boost/sstable_tablet_streaming.cc',
'test/boost/statement_restrictions_test.cc',
'test/boost/storage_proxy_test.cc',
'test/boost/tablets_test.cc',
'test/boost/tracing_test.cc',
'test/boost/user_function_test.cc',
'test/boost/user_types_test.cc',
'test/boost/vector_index_test.cc',
'test/boost/view_build_test.cc',
'test/boost/view_complex_test.cc',
'test/boost/view_schema_ckey_test.cc',
@@ -2227,16 +2233,20 @@ abseil_libs = ['absl/' + lib for lib in [
'container/libabsl_raw_hash_set.a',
'synchronization/libabsl_synchronization.a',
'synchronization/libabsl_graphcycles_internal.a',
'synchronization/libabsl_kernel_timeout_internal.a',
'debugging/libabsl_stacktrace.a',
'debugging/libabsl_symbolize.a',
'debugging/libabsl_debugging_internal.a',
'debugging/libabsl_demangle_internal.a',
'debugging/libabsl_demangle_rust.a',
'debugging/libabsl_decode_rust_punycode.a',
'debugging/libabsl_utf8_for_code_point.a',
'debugging/libabsl_borrowed_fixup_buffer.a',
'time/libabsl_time.a',
'time/libabsl_time_zone.a',
'numeric/libabsl_int128.a',
'hash/libabsl_hash.a',
'hash/libabsl_city.a',
'hash/libabsl_low_level_hash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',

View File

@@ -201,6 +201,10 @@ public:
return _clustering_columns_restrictions;
}
const expr::expression& get_nonprimary_key_restrictions() const {
return _nonprimary_key_restrictions;
}
// Get a set of columns restricted by the IS NOT NULL restriction.
// IS NOT NULL is a special case that is handled separately from other restrictions.
const std::unordered_set<const column_definition*> get_not_null_columns() const;

View File

@@ -265,7 +265,10 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
return make_exception_future<shared_ptr<cql_transport::messages::result_message>>(
exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl)));
format("Write consistency level {} is forbidden by the current configuration "
"setting of write_consistency_levels_disallowed. Please use a different "
"consistency level, or remove {} from write_consistency_levels_disallowed "
"set in the configuration.", cl, cl)));
}
for (size_t i = 0; i < _statements.size(); ++i) {
@@ -277,7 +280,8 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
_stats.statements_in_cas_batches += _statements.size();
return execute_with_conditions(qp, options, query_state).then([guardrail_state, cl] (auto result) {
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return result;
});
@@ -297,7 +301,8 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
}
auto result = make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(result));
});

View File

@@ -59,6 +59,8 @@ const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
const sstring cf_prop_defs::KW_TABLETS = "tablets";
const sstring cf_prop_defs::KW_STORAGE_ENGINE = "storage_engine";
schema::extensions_map cf_prop_defs::make_schema_extensions(const db::extensions& exts) const {
schema::extensions_map er;
for (auto& p : exts.schema_extensions()) {
@@ -106,6 +108,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
KW_BF_FP_CHANCE, KW_MEMTABLE_FLUSH_PERIOD, KW_COMPACTION,
KW_COMPRESSION, KW_CRC_CHECK_CHANCE, KW_ID, KW_PAXOSGRACESECONDS,
KW_SYNCHRONOUS_UPDATES, KW_TABLETS,
KW_STORAGE_ENGINE,
});
static std::set<sstring> obsolete_keywords({
sstring("index_interval"),
@@ -196,6 +199,20 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
}
db::tablet_options::validate(*tablet_options_map);
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
if (!db.features().logstor) {
throw exceptions::configuration_exception(format("The experimental feature 'logstor' must be enabled in order to use the 'logstor' storage engine."));
}
if (!db.get_config().enable_logstor()) {
throw exceptions::configuration_exception(format("The configuration option 'enable_logstor' must be set to true in the configuration in order to use the 'logstor' storage engine."));
}
} else {
throw exceptions::configuration_exception(format("Illegal value for '{}'", KW_STORAGE_ENGINE));
}
}
}
std::map<sstring, sstring> cf_prop_defs::get_compaction_type_options() const {
@@ -396,6 +413,13 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
if (auto tablet_options_opt = get_map(KW_TABLETS)) {
builder.set_tablet_options(std::move(*tablet_options_opt));
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
builder.set_logstor();
}
}
}
void cf_prop_defs::validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const

View File

@@ -64,6 +64,8 @@ public:
static const sstring KW_TABLETS;
static const sstring KW_STORAGE_ENGINE;
// FIXME: In origin the following consts are in CFMetaData.
static constexpr int32_t DEFAULT_DEFAULT_TIME_TO_LIVE = 0;
static constexpr int32_t DEFAULT_MIN_INDEX_INTERVAL = 128;

View File

@@ -8,6 +8,7 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include <boost/algorithm/string.hpp>
#include <seastar/core/coroutine.hh>
#include "create_index_statement.hh"
#include "db/config.hh"
@@ -35,8 +36,10 @@
#include "db/schema_tables.hh"
#include "index/secondary_index_manager.hh"
#include "types/concrete_types.hh"
#include "types/vector.hh"
#include "db/tags/extension.hh"
#include "tombstone_gc_extension.hh"
#include "index/secondary_index.hh"
#include <stdexcept>
@@ -116,6 +119,58 @@ static data_type type_for_computed_column(cql3::statements::index_target::target
}
}
// Cassandra SAI compatibility: detect the StorageAttachedIndex class name
// used by Cassandra to create vector and metadata indexes.
static bool is_sai_class_name(const sstring& class_name) {
return class_name == "org.apache.cassandra.index.sai.StorageAttachedIndex"
|| boost::iequals(class_name, "storageattachedindex")
|| boost::iequals(class_name, "sai");
}
// Returns true if the custom class name refers to a vector-capable index
// (either ScyllaDB's native vector_index or Cassandra's SAI).
static bool is_vector_capable_class(const sstring& class_name) {
return class_name == "vector_index" || is_sai_class_name(class_name);
}
// When the custom class is SAI, verify that at least one target is a
// vector column and rewrite the class to ScyllaDB's native "vector_index".
// Non-vector single-column targets and multi-column (local-index partition
// key) targets are skipped — they are treated as filtering columns by
// vector_index::check_target().
static void maybe_rewrite_sai_to_vector_index(
const schema& schema,
const std::vector<::shared_ptr<index_target>>& targets,
index_specific_prop_defs& props) {
if (!props.custom_class || !is_sai_class_name(*props.custom_class)) {
return;
}
for (const auto& target : targets) {
auto* ident = std::get_if<::shared_ptr<column_identifier>>(&target->value);
if (!ident) {
// Multi-column target (local-index partition key) — skip.
continue;
}
auto cd = schema.get_column_definition((*ident)->name());
if (!cd) {
// Nonexistent column — skip; vector_index::validate() will catch it.
continue;
}
if (dynamic_cast<const vector_type_impl*>(cd->type.get())) {
props.custom_class = "vector_index";
return;
}
}
throw exceptions::invalid_request_exception(
"StorageAttachedIndex (SAI) is only supported on vector columns; "
"use a secondary index for non-vector columns");
}
static bool is_vector_index(const index_options_map& options) {
auto class_it = options.find(db::index::secondary_index::custom_class_option_name);
return class_it != options.end() && is_vector_capable_class(class_it->second);
}
view_ptr create_index_statement::create_view_for_index(const schema_ptr schema, const index_metadata& im,
const data_dictionary::database& db) const
{
@@ -265,8 +320,8 @@ create_index_statement::validate(query_processor& qp, const service::client_stat
_idx_properties->validate();
// FIXME: This is ugly and can be improved.
const bool is_vector_index = _idx_properties->custom_class && *_idx_properties->custom_class == "vector_index";
const bool is_vector_index = _idx_properties->custom_class && is_vector_capable_class(*_idx_properties->custom_class);
const bool uses_view_properties = _view_properties.properties()->count() > 0
|| _view_properties.use_compact_storage()
|| _view_properties.defined_ordering().size() > 0;
@@ -352,6 +407,8 @@ create_index_statement::validate_while_executing(data_dictionary::database db, l
targets.emplace_back(raw_target->prepare(*schema));
}
maybe_rewrite_sai_to_vector_index(*schema, targets, *_idx_properties);
if (_idx_properties && _idx_properties->custom_class) {
auto custom_index_factory = secondary_index::secondary_index_manager::get_custom_class_factory(*_idx_properties->custom_class);
if (!custom_index_factory) {
@@ -697,7 +754,9 @@ index_metadata create_index_statement::make_index_metadata(const std::vector<::s
const index_options_map& options)
{
index_options_map new_options = options;
auto target_option = secondary_index::target_parser::serialize_targets(targets);
auto target_option = is_vector_index(options)
? secondary_index::vector_index::serialize_targets(targets)
: secondary_index::target_parser::serialize_targets(targets);
new_options.emplace(index_target::target_option_name, target_option);
const auto& first_target = targets.front()->value;

View File

@@ -9,6 +9,7 @@
*/
#include "cql3/statements/cf_prop_defs.hh"
#include "utils/assert.hh"
#include <inttypes.h>
#include <boost/regex.hpp>
@@ -266,6 +267,13 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
stmt_warning("CREATE TABLE WITH COMPACT STORAGE is deprecated and will eventually be removed in a future version.");
}
if (_properties.properties()->has_property(cf_prop_defs::KW_STORAGE_ENGINE)) {
auto storage_engine = _properties.properties()->get_string(cf_prop_defs::KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor" && !_column_aliases.empty()) {
throw exceptions::configuration_exception("The 'logstor' storage engine cannot be used with tables that have clustering columns");
}
}
auto& key_aliases = _key_aliases[0];
std::vector<data_type> key_types;
for (auto&& alias : key_aliases) {

View File

@@ -273,7 +273,10 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
co_return coroutine::exception(
std::make_exception_ptr(exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl))));
format("Write consistency level {} is forbidden by the current configuration "
"setting of write_consistency_levels_disallowed. Please use a different "
"consistency level, or remove {} from write_consistency_levels_disallowed "
"set in the configuration.", cl, cl))));
}
_restrictions->validate_primary_key(options);
@@ -281,7 +284,8 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
if (has_conditions()) {
auto result = co_await execute_with_condition(qp, qs, options);
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
co_return result;
}
@@ -303,7 +307,8 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
auto result = seastar::make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
if (keys_size_one) {
auto&& table = s->table();

View File

@@ -52,6 +52,7 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
}
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
const auto mutate_result = co_await coordinator.get().mutate(_statement->s,
keys[0].start()->value().token(),
[&](api::timestamp_type ts) {
@@ -65,7 +66,7 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
raw_cql_statement, muts.size()));
}
return std::move(*muts.begin());
});
}, timeout, qs.get_client_state().get_abort_source());
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {

View File

@@ -42,7 +42,7 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
const auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
auto query_result = co_await coordinator.get().query(_query_schema, *read_command,
key_ranges, state.get_trace_state(), timeout);
key_ranges, state.get_trace_state(), timeout, state.get_client_state().get_abort_source());
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
@@ -54,4 +54,4 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
read_command, options, now);
}
}
}

View File

@@ -250,8 +250,8 @@ void keyspace_metadata::validate(const gms::feature_service& fs, const locator::
if (params.consistency && !fs.strongly_consistent_tables) {
throw exceptions::configuration_exception("The strongly_consistent_tables feature must be enabled to use a consistency option");
}
if (params.consistency && *params.consistency == data_dictionary::consistency_config_option::global) {
throw exceptions::configuration_exception("Global consistency is not supported yet");
if (params.consistency && *params.consistency == data_dictionary::consistency_config_option::local) {
throw exceptions::configuration_exception("Local consistency is not supported yet");
}
}

View File

@@ -679,6 +679,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The directory where hints files are stored if hinted handoff is enabled.")
, view_hints_directory(this, "view_hints_directory", value_status::Used, "",
"The directory where materialized-view updates are stored while a view replica is unreachable.")
, logstor_directory(this, "logstor_directory", value_status::Used, "",
"The directory where data files for logstor storage are stored.")
, saved_caches_directory(this, "saved_caches_directory", value_status::Unused, "",
"The directory location where table key and row caches are stored.")
/**
@@ -862,6 +864,14 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"* offheap_objects Native memory, eliminating NIO buffer heap overhead.")
, memtable_cleanup_threshold(this, "memtable_cleanup_threshold", value_status::Invalid, .11,
"Ratio of occupied non-flushing memtable size to total permitted size for triggering a flush of the largest memtable. Larger values mean larger flushes and less compaction, but also less concurrent flush activity, which can make it difficult to keep your disks saturated under heavy write load.")
, logstor_disk_size_in_mb(this, "logstor_disk_size_in_mb", value_status::Used, 2048,
"Total size in megabytes allocated for logstor storage on disk.")
, logstor_file_size_in_mb(this, "logstor_file_size_in_mb", value_status::Used, 32,
"Total size in megabytes allocated for each logstor data file on disk.")
, logstor_separator_delay_limit_ms(this, "logstor_separator_delay_limit_ms", value_status::Used, 100,
"Maximum delay in milliseconds for logstor separator debt control.")
, logstor_separator_max_memory_in_mb(this, "logstor_separator_max_memory_in_mb", value_status::Used, 256,
"Maximum memory in megabytes for logstor separator memory buffers.")
, file_cache_size_in_mb(this, "file_cache_size_in_mb", value_status::Unused, 512,
"Total memory to use for SSTable-reading buffers.")
, memtable_flush_queue_size(this, "memtable_flush_queue_size", value_status::Unused, 4,
@@ -1281,6 +1291,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_in_memory_data_store(this, "enable_in_memory_data_store", value_status::Used, false, "Enable in memory mode (system tables are always persisted).")
, enable_cache(this, "enable_cache", value_status::Used, true, "Enable cache.")
, enable_commitlog(this, "enable_commitlog", value_status::Used, true, "Enable commitlog.")
, enable_logstor(this, "enable_logstor", value_status::Used, false, "Enable the logstor storage engine.")
, volatile_system_keyspace_for_testing(this, "volatile_system_keyspace_for_testing", value_status::Used, false, "Don't persist system keyspace - testing only!")
, api_port(this, "api_port", value_status::Used, 10000, "Http Rest API port.")
, api_address(this, "api_address", value_status::Used, "", "Http Rest API address.")
@@ -1571,7 +1582,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"\tnone : No auditing enabled.\n"
"\tsyslog : Audit messages sent to Syslog.\n"
"\ttable : Audit messages written to column family named audit.audit_log.\n")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,DDL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_tables(this, "audit_tables", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of table names (<keyspace>.<table>) that will be audited.")
, audit_keyspaces(this, "audit_keyspaces", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of keyspaces that will be audited. All tables in those keyspaces will be audited")
, audit_unix_socket_path(this, "audit_unix_socket_path", value_status::Used, "/dev/log", "The path to the unix socket used for writing to syslog. Only applicable when audit is set to syslog.")
@@ -1692,6 +1703,7 @@ void db::config::setup_directories() {
maybe_in_workdir(data_file_directories, "data");
maybe_in_workdir(hints_directory, "hints");
maybe_in_workdir(view_hints_directory, "view_hints");
maybe_in_workdir(logstor_directory, "logstor");
maybe_in_workdir(saved_caches_directory, "saved_caches");
}
@@ -1861,7 +1873,8 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
{"tablets", feature::UNUSED},
{"views-with-tablets", feature::UNUSED},
{"strongly-consistent-tables", feature::STRONGLY_CONSISTENT_TABLES}
{"strongly-consistent-tables", feature::STRONGLY_CONSISTENT_TABLES},
{"logstor", feature::LOGSTOR}
};
}

View File

@@ -117,7 +117,8 @@ struct experimental_features_t {
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
STRONGLY_CONSISTENT_TABLES
STRONGLY_CONSISTENT_TABLES,
LOGSTOR,
};
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
@@ -201,6 +202,7 @@ public:
named_value<uint64_t> data_file_capacity;
named_value<sstring> hints_directory;
named_value<sstring> view_hints_directory;
named_value<sstring> logstor_directory;
named_value<sstring> saved_caches_directory;
named_value<sstring> commit_failure_policy;
named_value<sstring> disk_failure_policy;
@@ -244,6 +246,10 @@ public:
named_value<bool> defragment_memory_on_idle;
named_value<sstring> memtable_allocation_type;
named_value<double> memtable_cleanup_threshold;
named_value<uint32_t> logstor_disk_size_in_mb;
named_value<uint32_t> logstor_file_size_in_mb;
named_value<uint32_t> logstor_separator_delay_limit_ms;
named_value<uint32_t> logstor_separator_max_memory_in_mb;
named_value<uint32_t> file_cache_size_in_mb;
named_value<uint32_t> memtable_flush_queue_size;
named_value<uint32_t> memtable_flush_writers;
@@ -364,6 +370,7 @@ public:
named_value<bool> enable_in_memory_data_store;
named_value<bool> enable_cache;
named_value<bool> enable_commitlog;
named_value<bool> enable_logstor;
named_value<bool> volatile_system_keyspace_for_testing;
named_value<uint16_t> api_port;
named_value<sstring> api_address;

View File

@@ -22,7 +22,7 @@ corrupt_data_handler::corrupt_data_handler(register_metrics rm) {
_metrics.add_group("corrupt_data", {
sm::make_counter("entries_reported", _stats.corrupt_data_reported,
sm::description("Counts the number of corrupt data instances reported to the corrupt data handler. "
"A non-zero value indicates that the database suffered data corruption."))
"A non-zero value indicates that the database suffered data corruption.")).set_skip_when_empty()
});
}
}

View File

@@ -50,9 +50,7 @@ future<> hint_endpoint_manager::do_store_hint(schema_ptr s, lw_shared_ptr<const
size_t mut_size = fm->representation().size();
shard_stats().size_of_hints_in_progress += mut_size;
if (utils::get_local_injector().enter("slow_down_writing_hints")) {
co_await seastar::sleep(std::chrono::seconds(10));
}
co_await utils::get_local_injector().inject("slow_down_writing_hints", std::chrono::seconds(10));
try {
const auto shared_lock = co_await get_shared_lock(file_update_mutex());

View File

@@ -186,7 +186,7 @@ void manager::register_metrics(const sstring& group_name) {
sm::description("Number of unexpected errors during sending, sending will be retried later")),
sm::make_counter("corrupted_files", _stats.corrupted_files,
sm::description("Number of hints files that were discarded during sending because the file was corrupted.")),
sm::description("Number of hints files that were discarded during sending because the file was corrupted.")).set_skip_when_empty(),
sm::make_gauge("pending_drains",
sm::description("Number of tasks waiting in the queue for draining hints"),

View File

@@ -206,7 +206,7 @@ void rate_limiter_base::register_metrics() {
sm::description("Number of times a lookup returned an already allocated entry.")),
sm::make_counter("failed_allocations", _metrics.failed_allocations,
sm::description("Number of times the rate limiter gave up trying to allocate.")),
sm::description("Number of times the rate limiter gave up trying to allocate.")).set_skip_when_empty(),
sm::make_counter("probe_count", _metrics.probe_count,
sm::description("Number of probes made during lookups.")),

View File

@@ -174,7 +174,7 @@ cache_tracker::setup_metrics() {
sm::make_counter("sstable_reader_recreations", sm::description("number of times sstable reader was recreated due to memtable flush"), _stats.underlying_recreations),
sm::make_counter("sstable_partition_skips", sm::description("number of times sstable reader was fast forwarded across partitions"), _stats.underlying_partition_skips),
sm::make_counter("sstable_row_skips", sm::description("number of times sstable reader was fast forwarded within a partition"), _stats.underlying_row_skips),
sm::make_counter("pinned_dirty_memory_overload", sm::description("amount of pinned bytes that we tried to unpin over the limit. This should sit constantly at 0, and any number different than 0 is indicative of a bug"), _stats.pinned_dirty_memory_overload),
sm::make_counter("pinned_dirty_memory_overload", sm::description("amount of pinned bytes that we tried to unpin over the limit. This should sit constantly at 0, and any number different than 0 is indicative of a bug"), _stats.pinned_dirty_memory_overload).set_skip_when_empty(),
sm::make_counter("rows_processed_from_memtable", _stats.rows_processed_from_memtable,
sm::description("total number of rows in memtables which were processed during cache update on memtable flush")),
sm::make_counter("rows_dropped_from_memtable", _stats.rows_dropped_from_memtable,

View File

@@ -336,6 +336,8 @@ schema_ptr scylla_tables(schema_features features) {
// since it is written to only after the cluster feature is enabled.
sb.with_column("tablets", map_type_impl::get_instance(utf8_type, utf8_type, false));
sb.with_column("storage_engine", utf8_type);
sb.with_hash_version();
s = sb.build();
}
@@ -1676,6 +1678,9 @@ mutation make_scylla_tables_mutation(schema_ptr table, api::timestamp_type times
m.set_clustered_cell(ckey, cdef, make_map_mutation(map, cdef, timestamp));
}
}
if (table->logstor_enabled()) {
m.set_clustered_cell(ckey, "storage_engine", "logstor", timestamp);
}
// In-memory tables are deprecated since scylla-2024.1.0
// FIXME: delete the column when there's no live version supporting it anymore.
// Writing it here breaks upgrade rollback to versions that do not support the in_memory schema_feature
@@ -2161,6 +2166,13 @@ static void prepare_builder_from_scylla_tables_row(const schema_ctxt& ctxt, sche
auto tablet_options = db::tablet_options(*opt_map);
builder.set_tablet_options(tablet_options.to_map());
}
if (auto storage_engine = table_row.get<sstring>("storage_engine")) {
if (*storage_engine == "logstor") {
builder.set_logstor();
} else {
throw std::invalid_argument(format("Invalid value for storage_engine: {}", *storage_engine));
}
}
}
schema_ptr create_table_from_mutations(const schema_ctxt& ctxt, schema_mutations sm, const data_dictionary::user_types_storage& user_types, schema_ptr cdc_schema, std::optional<table_schema_version> version)

View File

@@ -144,7 +144,7 @@ static std::vector<sstring> get_keyspaces(const schema& s, const replica::databa
/**
* Makes a wrapping range of ring_position from a nonwrapping range of token, used to select sstables.
*/
static dht::partition_range as_ring_position_range(dht::token_range& r) {
static dht::partition_range as_ring_position_range(const dht::token_range& r) {
std::optional<wrapping_interval<dht::ring_position>::bound> start_bound, end_bound;
if (r.start()) {
start_bound = {{ dht::ring_position(r.start()->value(), dht::ring_position::token_bound::start), r.start()->is_inclusive() }};
@@ -156,11 +156,14 @@ static dht::partition_range as_ring_position_range(dht::token_range& r) {
}
/**
* Add a new range_estimates for the specified range, considering the sstables associated with `cf`.
* Add a new range_estimates for the specified range, considering the sstables associated
* with the table identified by `cf_id` across all shards.
*/
static future<system_keyspace::range_estimates> estimate(const replica::column_family& cf, const token_range& r) {
int64_t count{0};
utils::estimated_histogram hist{0};
static future<system_keyspace::range_estimates> estimate(replica::database& db, table_id cf_id, schema_ptr schema, const token_range& r) {
struct shard_estimate {
int64_t count = 0;
utils::estimated_histogram hist{0};
};
auto from_bytes = [] (auto& b) {
return dht::token::from_sstring(utf8_type->to_string(b));
};
@@ -169,14 +172,35 @@ static future<system_keyspace::range_estimates> estimate(const replica::column_f
wrapping_interval<dht::token>({{ from_bytes(r.start), false }}, {{ from_bytes(r.end) }}),
dht::token_comparator(),
[&] (auto&& rng) { ranges.push_back(std::move(rng)); });
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
count += co_await sstable->estimated_keys_for_range(r);
hist.merge(sstable->get_stats_metadata().estimated_partition_size);
// Estimate partition count and size distribution from sstables on a single shard.
auto estimate_on_shard = [cf_id, ranges] (replica::database& local_db) -> future<shard_estimate> {
auto table_ptr = local_db.get_tables_metadata().get_table_if_exists(cf_id);
if (!table_ptr) {
co_return shard_estimate{};
}
}
co_return system_keyspace::range_estimates{cf.schema(), r.start, r.end, count, count > 0 ? hist.mean() : 0};
auto& cf = *table_ptr;
shard_estimate result;
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
result.count += co_await sstable->estimated_keys_for_range(r);
result.hist.merge(sstable->get_stats_metadata().estimated_partition_size);
}
}
co_return result;
};
// Combine partial results from two shards.
auto reduce = [] (shard_estimate a, const shard_estimate& b) {
a.count += b.count;
a.hist.merge(b.hist);
return a;
};
auto aggregate = co_await db.container().map_reduce0(std::move(estimate_on_shard), shard_estimate{}, std::move(reduce));
int64_t mean_size = aggregate.count > 0 ? aggregate.hist.mean() : 0;
co_return system_keyspace::range_estimates{std::move(schema), r.start, r.end, aggregate.count, mean_size};
}
/**
@@ -321,7 +345,7 @@ size_estimates_mutation_reader::estimates_for_current_keyspace(std::vector<token
auto rows_to_estimate = range.slice(rows, virtual_row_comparator(_schema));
for (auto&& r : rows_to_estimate) {
auto& cf = _db.find_column_family(*_current_partition, utf8_type->to_string(r.cf_name));
estimates.push_back(co_await estimate(cf, r.tokens));
estimates.push_back(co_await estimate(_db, cf.schema()->id(), cf.schema(), r.tokens));
if (estimates.size() >= _slice.partition_row_limit()) {
co_return estimates;
}

View File

@@ -18,8 +18,11 @@
#include <seastar/coroutine/parallel_for_each.hh>
#include "db/snapshot-ctl.hh"
#include "db/snapshot/backup_task.hh"
#include "db/schema_tables.hh"
#include "index/secondary_index_manager.hh"
#include "replica/database.hh"
#include "replica/global_table_ptr.hh"
#include "replica/schema_describe_helper.hh"
#include "sstables/sstables_manager.hh"
#include "service/storage_proxy.hh"
@@ -154,14 +157,56 @@ future<> snapshot_ctl::do_take_cluster_column_family_snapshot(std::vector<sstrin
);
}
sstring snapshot_ctl::resolve_table_name(const sstring& ks_name, const sstring& name) const {
try {
_db.local().find_uuid(ks_name, name);
return name;
} catch (const data_dictionary::no_such_column_family&) {
// The name may be a logical index name (e.g. "myindex").
// Only indexes with a backing view have a separate backing table
// that can be snapshotted. Custom indexes such as vector indexes
// do not, so keep rejecting them here rather than mapping them to
// a synthetic name.
auto schema = _db.local().find_indexed_table(ks_name, name);
if (schema) {
const auto& im = schema->all_indices().at(name);
if (db::schema_tables::view_should_exist(im)) {
return secondary_index::index_table_name(name);
}
}
throw;
}
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
for (auto& t : tables) {
t = resolve_table_name(ks_name, t);
}
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {
return run_snapshot_modify_operation([this, tag = std::move(tag), keyspace_names = std::move(keyspace_names), cf_name = std::move(cf_name)] {
return _db.local().clear_snapshot(tag, keyspace_names, cf_name);
co_return co_await run_snapshot_modify_operation([this, tag = std::move(tag), keyspace_names = std::move(keyspace_names), cf_name = std::move(cf_name)] (this auto) -> future<> {
// clear_snapshot enumerates keyspace_names and uses cf_name as a
// filter in each. When cf_name needs resolution (e.g. logical index
// name -> backing table name), the result may differ per keyspace,
// so resolve and clear individually.
if (!cf_name.empty() && !keyspace_names.empty()) {
std::vector<std::pair<sstring, sstring>> resolved_targets;
resolved_targets.reserve(keyspace_names.size());
// Resolve every keyspace first so a later failure doesn't delete
// snapshots that were already matched in earlier keyspaces.
for (const auto& ks_name : keyspace_names) {
resolved_targets.emplace_back(ks_name, resolve_table_name(ks_name, cf_name));
}
for (auto& [ks_name, resolved_cf_name] : resolved_targets) {
co_await _db.local().clear_snapshot(tag, {ks_name}, std::move(resolved_cf_name));
}
co_return;
}
co_await _db.local().clear_snapshot(std::move(tag), std::move(keyspace_names), cf_name);
});
}
@@ -170,7 +215,26 @@ snapshot_ctl::get_snapshot_details() {
using snapshot_map = std::unordered_map<sstring, db_snapshot_details>;
co_return co_await run_snapshot_list_operation(coroutine::lambda([this] () -> future<snapshot_map> {
return _db.local().get_snapshot_details();
auto details = co_await _db.local().get_snapshot_details();
for (auto& [snapshot_name, snapshot_details] : details) {
for (auto& table : snapshot_details) {
auto schema = _db.local().as_data_dictionary().try_find_table(
table.ks, table.cf);
if (!schema || !schema->schema()->is_view()) {
continue;
}
auto helper = replica::make_schema_describe_helper(
schema->schema(), _db.local().as_data_dictionary());
if (helper.type == schema_describe_helper::type::index) {
table.cf = secondary_index::index_name_from_table_name(
table.cf);
}
}
}
co_return details;
}));
}
@@ -235,4 +299,4 @@ future<int64_t> snapshot_ctl::true_snapshots_size(sstring ks, sstring cf) {
}));
}
}
}

View File

@@ -133,6 +133,12 @@ private:
future<> check_snapshot_not_exist(sstring ks_name, sstring name, std::optional<std::vector<sstring>> filter = {});
// Resolve a user-provided table name that may be a logical index name
// (e.g. "myindex") to its backing column family name (e.g.
// "myindex_index"). Returns the name unchanged if it already
// matches a column family.
sstring resolve_table_name(const sstring& ks_name, const sstring& name) const;
future<> run_snapshot_modify_operation(noncopyable_function<future<>()> &&);
template <typename Func>
@@ -151,4 +157,4 @@ private:
future<> do_take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
};
}
}

View File

@@ -281,6 +281,7 @@ schema_ptr system_keyspace::topology() {
.with_column("cleanup_status", utf8_type)
.with_column("supported_features", set_type_impl::get_instance(utf8_type, true))
.with_column("request_id", timeuuid_type)
.with_column("intended_storage_mode", utf8_type)
.with_column("ignore_nodes", set_type_impl::get_instance(uuid_type, true), column_kind::static_column)
.with_column("new_cdc_generation_data_uuid", timeuuid_type, column_kind::static_column)
.with_column("new_keyspace_rf_change_ks_name", utf8_type, column_kind::static_column) // deprecated
@@ -323,6 +324,7 @@ schema_ptr system_keyspace::topology_requests() {
.with_column("snapshot_tag", utf8_type)
.with_column("snapshot_expiry", timestamp_type)
.with_column("snapshot_skip_flush", boolean_type)
.with_column("finalize_migration_ks_name", utf8_type)
.set_comment("Topology request tracking")
.with_hash_version()
.build();
@@ -3052,7 +3054,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
co_return ret;
}
const bool strongly_consistent_tables = _db.features().strongly_consistent_tables;
const bool tablet_balancing_not_supported = _db.features().strongly_consistent_tables || _db.features().logstor;
for (auto& row : *rs) {
if (!row.has("host_id")) {
@@ -3169,6 +3171,11 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
}
}
std::optional<service::intended_storage_mode> storage_mode;
if (row.has("intended_storage_mode")) {
storage_mode = service::intended_storage_mode_from_string(row.get_as<sstring>("intended_storage_mode"));
}
std::unordered_map<raft::server_id, service::replica_state>* map = nullptr;
if (nstate == service::node_state::normal) {
map = &ret.normal_nodes;
@@ -3193,7 +3200,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
map->emplace(host_id, service::replica_state{
nstate, std::move(datacenter), std::move(rack), std::move(release_version),
ring_slice, shard_count, ignore_msb, std::move(supported_features),
service::cleanup_status_from_string(cleanup_status), request_id});
service::cleanup_status_from_string(cleanup_status), request_id, storage_mode});
}
}
@@ -3289,7 +3296,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.session = service::session_id(some_row.get_as<utils::UUID>("session"));
}
if (strongly_consistent_tables) {
if (tablet_balancing_not_supported) {
ret.tablet_balancing_enabled = false;
} else if (some_row.has("tablet_balancing_enabled")) {
ret.tablet_balancing_enabled = some_row.get_as<bool>("tablet_balancing_enabled");
@@ -3506,6 +3513,9 @@ system_keyspace::topology_requests_entry system_keyspace::topology_request_row_t
entry.snapshot_expiry = row.get_as<db_clock::time_point>("snapshot_expiry");
}
}
if (row.has("finalize_migration_ks_name")) {
entry.finalize_migration_ks_name = row.get_as<sstring>("finalize_migration_ks_name");
}
return entry;
}

View File

@@ -427,6 +427,7 @@ public:
std::optional<sstring> snapshot_tag;
std::optional<db_clock::time_point> snapshot_expiry;
bool snapshot_skip_flush;
std::optional<sstring> finalize_migration_ks_name;
};
using topology_requests_entries = std::unordered_map<utils::UUID, system_keyspace::topology_requests_entry>;

View File

@@ -2647,7 +2647,7 @@ future<> view_builder::add_new_view(view_ptr view, build_step& step) {
}
if (this_shard_id() == smp::count - 1) {
co_await utils::get_local_injector().inject("add_new_view_pause_last_shard", utils::wait_for_message(5min));
inject_failure("add_new_view_fail_last_shard");
}
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());

View File

@@ -143,10 +143,18 @@ dht::token_range view_building_worker::get_tablet_token_range(table_id table_id,
}
future<> view_building_worker::drain() {
auto drain_started = std::exchange(_drain_started, started_drain::yes);
if (drain_started == started_drain::no) {
_drain_finished = shared_future(do_drain());
}
return _drain_finished.get_future();
}
future<> view_building_worker::do_drain() {
if (!_as.abort_requested()) {
_as.request_abort();
}
_state._mutex.broken();
co_await _staging_sstables_mutex.wait();
_staging_sstables_mutex.broken();
_sstables_to_register_event.broken();
if (this_shard_id() == 0) {
@@ -156,7 +164,9 @@ future<> view_building_worker::drain() {
co_await std::move(state_observer);
co_await _mnotifier.unregister_listener(this);
}
co_await _state.clear();
co_await _state._mutex.wait();
_state._mutex.broken();
co_await _state.drain();
co_await uninit_messaging_service();
}
@@ -200,9 +210,7 @@ future<> view_building_worker::run_staging_sstables_registrator() {
while (!_as.abort_requested()) {
bool sleep = false;
try {
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
co_await create_staging_sstable_tasks();
lock.return_all();
_as.check();
co_await _sstables_to_register_event.when();
} catch (semaphore_aborted&) {
@@ -227,13 +235,45 @@ future<> view_building_worker::run_staging_sstables_registrator() {
}
}
future<std::vector<foreign_ptr<semaphore_units<>>>> view_building_worker::lock_staging_mutex_on_multiple_shards(std::flat_set<shard_id> shards) {
SCYLLA_ASSERT(this_shard_id() == 0);
// Collect `_staging_sstables_mutex` locks from multiple shards,
// so other shards won't interact with their `_staging_sstables` map
// until the caller releases them.
std::vector<foreign_ptr<semaphore_units<>>> locks;
locks.resize(smp::count);
// Locks are acquired from multiple shards in parallel.
// This is the only place where multiple-shard locks are acquired at once
// and the method is called only once at a time (from `create_staging_sstable_tasks()`
// on shard 0), so no deadlock may occur.
co_await coroutine::parallel_for_each(shards, [&locks, &sharded_vbw = container()] (auto shard_id) -> future<> {
auto lock_ptr = co_await smp::submit_to(shard_id, [&sharded_vbw] () -> future<foreign_ptr<semaphore_units<>>> {
auto& vbw = sharded_vbw.local();
auto lock = co_await get_units(vbw._staging_sstables_mutex, 1, vbw._as);
co_return make_foreign(std::move(lock));
});
locks[shard_id] = std::move(lock_ptr);
});
co_return std::move(locks);
}
future<> view_building_worker::create_staging_sstable_tasks() {
// Explicitly lock shard0 beforehand to prevent other shards from modifying `_sstables_to_register` from `register_staging_sstable_tasks()`
auto lock0 = co_await get_units(_staging_sstables_mutex, 1, _as);
if (_sstables_to_register.empty()) {
co_return;
}
utils::chunked_vector<canonical_mutation> cmuts;
auto shards = _sstables_to_register
| std::views::values
| std::views::join
| std::views::transform([] (const auto& sst_info) { return sst_info.shard; })
| std::ranges::to<std::flat_set<shard_id>>();
shards.erase(0); // We're already holding shard0 lock
auto locks = co_await lock_staging_mutex_on_multiple_shards(std::move(shards));
utils::chunked_vector<canonical_mutation> cmuts;
auto guard = co_await _group0.client().start_operation(_as);
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
for (auto& [table_id, sst_infos]: _sstables_to_register) {
@@ -460,6 +500,16 @@ static std::unordered_set<table_id> get_ids_of_all_views(replica::database& db,
}) | std::ranges::to<std::unordered_set>();;
}
void view_building_worker::state::start_batch(std::unique_ptr<batch> batch) {
if (_drained) {
on_internal_error(vbw_logger, "view_building_worker::state was already drained");
} else if (_batch) {
on_internal_error(vbw_logger, fmt::format("view_building_worker::state::start_batch(): some batch (tasks: {}) is already running", _batch->tasks | std::views::keys));
}
_batch = std::move(batch);
_batch->start();
}
// If `state::processing_base_table` is different that the `view_building_state::currently_processed_base_table`,
// clear the state, save and flush new base table
future<> view_building_worker::state::update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as) {
@@ -485,6 +535,10 @@ future<> view_building_worker::state::clean_up_after_batch() {
// Flush base table, set is as currently processing base table and save which views exist at the time of flush
future<> view_building_worker::state::flush_base_table(replica::database& db, table_id base_table_id, abort_source& as) {
if (_drained) {
on_internal_error(vbw_logger, "view_building_worker::state was already drained");
}
auto cf = db.find_column_family(base_table_id).shared_from_this();
co_await when_all(cf->await_pending_writes(), cf->await_pending_streams());
co_await flush_base(cf, as);
@@ -503,6 +557,11 @@ future<> view_building_worker::state::clear() {
flushed_views.clear();
}
future<> view_building_worker::state::drain() {
_drained = true;
co_await clear();
}
view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica)
: base_id(base_id)
, replica(replica)
@@ -667,24 +726,34 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
}
future<> view_building_worker::do_process_staging(table_id table_id, dht::token last_token) {
if (_staging_sstables[table_id].empty()) {
auto table = _db.get_tables_metadata().get_table(table_id).shared_from_this();
std::vector<sstables::shared_sstable> sstables_to_process;
try {
// Acquire `_staging_sstables_mutex` to prevent `create_staging_sstable_tasks()` from
// concurrently modifying `_staging_sstables` (moving entries from `_sstables_to_register`)
// while we read them.
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto tid = tablet_map.get_tablet_id(last_token);
auto tablet_range = tablet_map.get_token_range(tid);
// Select sstables belonging to the tablet (identified by `last_token`)
for (auto& sst: _staging_sstables[table_id]) {
auto sst_last_token = sst->get_last_decorated_key().token();
if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
sstables_to_process.push_back(sst);
}
}
lock.return_all();
} catch (semaphore_aborted&) {
vbw_logger.warn("Semaphore was aborted while waiting to removed processed sstables for table {}", table_id);
co_return;
}
auto table = _db.get_tables_metadata().get_table(table_id).shared_from_this();
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto tid = tablet_map.get_tablet_id(last_token);
auto tablet_range = tablet_map.get_token_range(tid);
// Select sstables belonging to the tablet (identified by `last_token`)
std::vector<sstables::shared_sstable> sstables_to_process;
for (auto& sst: _staging_sstables[table_id]) {
auto sst_last_token = sst->get_last_decorated_key().token();
if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
sstables_to_process.push_back(sst);
}
if (sstables_to_process.empty()) {
co_return;
}
co_await _vug.process_staging_sstables(std::move(table), sstables_to_process);
try {
@@ -799,8 +868,8 @@ future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_
}
// Create and start the batch
_state._batch = std::make_unique<batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
_state._batch->start();
auto batch = std::make_unique<view_building_worker::batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
_state.start_batch(std::move(batch));
}
if (std::ranges::all_of(ids, [&] (auto& id) { return !_state._batch->tasks.contains(id); })) {

View File

@@ -14,6 +14,7 @@
#include <seastar/core/shared_future.hh>
#include <unordered_map>
#include <unordered_set>
#include <flat_set>
#include "locator/abstract_replication_strategy.hh"
#include "locator/tablets.hh"
#include "raft/raft.hh"
@@ -98,11 +99,14 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
std::unordered_set<table_id> flushed_views;
semaphore _mutex = semaphore(1);
bool _drained = false;
// All of the methods below should be executed while holding `_mutex` unit!
void start_batch(std::unique_ptr<batch> batch);
future<> update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as);
future<> flush_base_table(replica::database& db, table_id base_table_id, abort_source& as);
future<> clean_up_after_batch();
future<> clear();
future<> drain();
};
// Wrapper which represents information needed to create
@@ -169,14 +173,24 @@ private:
future<> do_process_staging(table_id base_id, dht::token last_token);
future<> run_staging_sstables_registrator();
// Caller must hold units from `_staging_sstables_mutex`
// Acquires `_staging_sstables_mutex` on all shards internally,
// so callers must not hold `_staging_sstables_mutex` when invoking it.
future<> create_staging_sstable_tasks();
future<> discover_existing_staging_sstables();
std::unordered_map<table_id, std::vector<staging_sstable_task_info>> discover_local_staging_sstables(building_tasks building_tasks);
// Acquire `_staging_sstables_mutex` on multiple shards in parallel.
// Must be called only from shard 0.
// Must be called ONLY by `create_staging_sstable_tasks()` and only once at a time to avoid deadlock.
future<std::vector<foreign_ptr<semaphore_units<>>>> lock_staging_mutex_on_multiple_shards(std::flat_set<shard_id> shards);
void init_messaging_service();
future<> uninit_messaging_service();
future<std::vector<utils::UUID>> work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids);
using started_drain = bool_class<struct started_drain_tag>;
started_drain _drain_started = started_drain::no;
shared_future<> _drain_finished;
future<> do_drain();
};
}

View File

@@ -99,7 +99,7 @@ public:
set_cell(cr, "up", gossiper.is_alive(hostid));
if (gossiper.is_shutdown(endpoint)) {
set_cell(cr, "status", gossiper.get_gossip_status(endpoint));
set_cell(cr, "status", "shutdown");
} else {
set_cell(cr, "status", boost::to_upper_copy<std::string>(fmt::format("{}", ss.get_node_state(hostid))));
}
@@ -224,12 +224,12 @@ public:
}
if (_db.find_keyspace(e.name).get_replication_strategy().uses_tablets()) {
co_await _db.get_tables_metadata().for_each_table_gently([&, this] (table_id, lw_shared_ptr<replica::table> table) -> future<> {
co_await _db.get_tables_metadata().for_each_table_gently([&, this] (table_id tid, lw_shared_ptr<replica::table> table) -> future<> {
if (table->schema()->ks_name() != e.name) {
co_return;
}
const auto& table_name = table->schema()->cf_name();
utils::chunked_vector<dht::token_range_endpoints> ranges = co_await _ss.describe_ring_for_table(e.name, table_name);
utils::chunked_vector<dht::token_range_endpoints> ranges = co_await _ss.describe_ring_for_table(tid);
co_await emit_ring(result, e.key, table_name, std::move(ranges));
});
} else {

View File

@@ -30,6 +30,31 @@ enum class token_kind {
after_all_keys,
};
// Represents a token for partition keys.
// Has a disengaged state, which sorts before all engaged states.
struct raw_token {
int64_t value;
/// Constructs a disengaged token.
raw_token() : value(std::numeric_limits<int64_t>::min()) {}
/// Constructs an engaged token.
/// The token must be of token_kind::key kind.
explicit raw_token(const token&);
explicit raw_token(int64_t v) : value(v) {};
std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
std::strong_ordering operator<=>(const token& o) const noexcept;
/// Returns true iff engaged.
explicit operator bool() const noexcept {
return value != std::numeric_limits<int64_t>::min();
}
};
using raw_token_opt = seastar::optimized_optional<raw_token>;
class token {
// INT64_MIN is not a legal token, but a special value used to represent
// infinity in token intervals.
@@ -52,6 +77,10 @@ public:
constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}
token(raw_token raw) noexcept
: token(raw ? kind::key : kind::before_all_keys, raw.value)
{ }
// This constructor seems redundant with the bytes_view constructor, but
// it's necessary for IDL, which passes a deserialized_bytes_proxy here.
// (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
}
};
inline
raw_token::raw_token(const token& t)
: value(t.raw())
{
#ifdef DEBUG
assert(t._kind == token::kind::key);
#endif
}
inline
std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
switch (o._kind) {
case token::kind::after_all_keys:
return std::strong_ordering::less;
case token::kind::before_all_keys:
// before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
// So we can order them by just comparing raw values.
[[fallthrough]];
case token::kind::key:
return value <=> o._data;
}
}
inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
if (l1 == l2) {
return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
}
};
template <>
struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const dht::raw_token& t, FormatContext& ctx) const {
if (!t) {
return fmt::format_to(ctx.out(), "null");
}
return fmt::format_to(ctx.out(), "{}", t.value);
}
};
namespace std {
template<>

View File

@@ -9,6 +9,7 @@
import os
import sys
import shlex
import argparse
import psutil
from pathlib import Path
@@ -103,16 +104,41 @@ if __name__ == '__main__':
run('dd if=/dev/zero of={} bs=1M count={}'.format(swapfile, swapsize_mb), shell=True, check=True)
swapfile.chmod(0o600)
run('mkswap -f {}'.format(swapfile), shell=True, check=True)
mount_point = find_mount_point(swap_directory)
mount_unit = out(f'systemd-escape -p --suffix=mount {shlex.quote(str(mount_point))}')
# Add DefaultDependencies=no to the swap unit to avoid getting the default
# Before=swap.target dependency. We apply this to all clouds, but the
# requirement came from Azure:
#
# On Azure, the swap directory is on the Azure ephemeral disk (mounted on /mnt).
# However, cloud-init makes this mount (i.e., the mnt.mount unit) depend on
# the network (After=network-online.target). By extension, this means that
# the swap unit depends on the network. If we didn't use DefaultDependencies=no,
# then the swap unit would be part of the swap.target which other services
# assume to be a local boot target, so we would end up with dependency cycles
# such as:
#
# swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target -> network.target -> systemd-resolved.service -> tmp.mount -> swap.target
#
# By removing the automatic Before=swap.target, the swap unit is no longer
# part of swap.target, avoiding such cycles. The swap will still be
# activated via WantedBy=multi-user.target.
unit_data = '''
[Unit]
Description=swapfile
DefaultDependencies=no
After={}
Conflicts=umount.target
Before=umount.target
[Swap]
What={}
[Install]
WantedBy=multi-user.target
'''[1:-1].format(swapfile)
'''[1:-1].format(mount_unit, swapfile)
with swapunit.open('w') as f:
f.write(unit_data)
systemd_unit.reload()

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --collector.systemd --collector.systemd.unit-include='^(scylla-server|systemd-coredump.*)\.service$' --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"

View File

@@ -1,6 +1,12 @@
### a dictionary of redirections
#old path: new path
# Move the Upgrade Support (About Upgrade) page
/stable/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
/branch-2025.4/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
/branch-2026.1/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
# Move the OS Support page
/stable/getting-started/os-support.html: https://docs.scylladb.com/stable/versioning/os-support-per-version.html

View File

@@ -31,7 +31,7 @@ was used. Alternator currently supports two compression algorithms, `gzip`
and `deflate`, both standardized in ([RFC 9110](https://www.rfc-editor.org/rfc/rfc9110.html)).
Other standard compression types which are listed in
[IANA's HTTP Content Coding Registry](https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding),
including `zstd` ([RFC 8878][https://www.rfc-editor.org/rfc/rfc8878.html]),
including `zstd` ([RFC 8878](https://www.rfc-editor.org/rfc/rfc8878.html)),
are not yet supported by Alternator.
Note that HTTP's compression only compresses the request's _body_ - not the

View File

@@ -139,7 +139,7 @@ The ``WHERE`` clause
~~~~~~~~~~~~~~~~~~~~
The ``WHERE`` clause specifies which rows must be queried. It is composed of relations on the columns that are part of
the ``PRIMARY KEY``.
the ``PRIMARY KEY``, and relations can be joined only with ``AND`` (``OR`` and other logical operators are not supported).
Not all relations are allowed in a query. For instance, non-equal relations (where ``IN`` is considered as an equal
relation) on a partition key are not supported (see the use of the ``TOKEN`` method below to do non-equal queries on
@@ -200,6 +200,23 @@ The tuple notation may also be used for ``IN`` clauses on clustering columns::
WHERE userid = 'john doe'
AND (blog_title, posted_at) IN (('John''s Blog', '2012-01-01'), ('Extreme Chess', '2014-06-01'))
This tuple notation is different from boolean grouping. For example, the following query is not supported::
SELECT * FROM users
WHERE (country = 'BR' AND state = 'SP')
because parentheses are only allowed around a single relation, so this works: ``(country = 'BR') AND (state = 'SP')``, but this does not: ``(country = 'BR' AND state = 'SP')``.
Similarly, an extended query of the form of::
SELECT * FROM users
WHERE (country = 'BR' AND state = 'SP')
OR (country = 'BR' AND state = 'RJ')
won't work due to both: grouping boolean expressions and not supporting ``OR``, so when possible,
rewrite such queries with ``IN`` on the varying column, for example
``country = 'BR' AND state IN ('SP', 'RJ')``, or run multiple queries and merge
the results client-side.
The ``CONTAINS`` operator may only be used on collection columns (lists, sets, and maps). In the case of maps,
``CONTAINS`` applies to the map values. The ``CONTAINS KEY`` operator may only be used on map columns and applies to the
map keys.

236
docs/cql/guardrails.rst Normal file
View File

@@ -0,0 +1,236 @@
.. highlight:: cql
.. _cql-guardrails:
CQL Guardrails
==============
ScyllaDB provides a set of configurable guardrail parameters that help operators
enforce best practices and prevent misconfigurations that could degrade cluster
health, availability, or performance. Guardrails operate at two severity levels:
* **Warn**: The request succeeds, but the server includes a warning in the CQL
response. Depending on the specific guardrail, the warning may also be logged on the server side.
* **Fail**: The request is rejected with an error/exception (the specific type
depends on the guardrail). The user must correct the request or adjust the
guardrail configuration to proceed.
.. note::
Guardrails are checked only when a statement is
executed. They do not retroactively validate existing keyspaces, tables, or
previously completed writes.
For the full list of configuration properties, including types, defaults, and
liveness information, see :doc:`Configuration Parameters </reference/configuration-parameters>`.
.. _guardrails-replication-factor:
Replication Factor Guardrails
-----------------------------
These four parameters control the minimum and maximum allowed replication factor
(RF) values. They are evaluated whenever a ``CREATE KEYSPACE`` or
``ALTER KEYSPACE`` statement is executed. Each data center's RF is checked
individually.
An RF of ``0`` — which means "do not replicate to this data center" — is
always allowed and never triggers a guardrail.
A threshold value of ``-1`` disables the corresponding check.
``minimum_replication_factor_warn_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF is set to a value greater than ``0`` and lower than
this threshold, the server attaches a warning to the CQL response identifying
the offending data center and RF value.
**When to use.** The default of ``3`` is the standard recommendation for
production clusters. An RF below ``3`` means that the cluster cannot tolerate
even a single node failure without data loss or read unavailability (assuming
``QUORUM`` consistency). Keep this at ``3`` unless your deployment has specific
constraints (e.g., a development or test cluster with fewer than 3 nodes).
``minimum_replication_factor_fail_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF is set to a value greater than ``0`` and lower than
this threshold, the request is rejected with a ``ConfigurationException``
identifying the offending data center and RF value.
**When to use.** Enable this parameter (e.g., set to ``3``) in production
environments where allowing a low RF would be operationally dangerous. Unlike
the warn threshold, this provides a hard guarantee that no keyspace can be
created or altered to have an RF below the limit.
``maximum_replication_factor_warn_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF exceeds this threshold, the server attaches a warning to the CQL response identifying
the offending data center and RF value.
**When to use.** An excessively high RF increases write amplification and
storage costs proportionally. For example, an RF of ``5`` means every write
is replicated to five nodes. Set this threshold to alert operators who
may unintentionally set an RF that is too high.
``maximum_replication_factor_fail_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF exceeds this threshold, the request is rejected with a ``ConfigurationException``
identifying the offending data center and RF value.
**When to use.** Enable this parameter to prevent accidental creation of
keyspaces with an unreasonably high RF. An extremely high RF wastes storage and
network bandwidth and can lead to write latency spikes. This is a hard limit —
the keyspace creation or alteration will not proceed until the RF is lowered.
**Metrics.** ScyllaDB exposes per-shard metrics that track the number of
times each replication factor guardrail has been triggered:
* ``scylla_cql_minimum_replication_factor_warn_violations``
* ``scylla_cql_minimum_replication_factor_fail_violations``
* ``scylla_cql_maximum_replication_factor_warn_violations``
* ``scylla_cql_maximum_replication_factor_fail_violations``
A sustained increase in any of these metrics indicates that
``CREATE KEYSPACE`` or ``ALTER KEYSPACE`` requests are hitting the configured
thresholds.
.. _guardrails-replication-strategy:
Replication Strategy Guardrails
-------------------------------
These two parameters control which replication strategies trigger warnings or
are rejected when a keyspace is created or altered.
``replication_strategy_warn_list``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the replication strategy used in a ``CREATE KEYSPACE`` or ``ALTER KEYSPACE``
statement is on this list, the server attaches a warning to the CQL response
identifying the discouraged strategy and the affected keyspace.
**When to use.** ``SimpleStrategy`` is not recommended for production use.
It places replicas without awareness of data center or rack topology, which
can undermine fault tolerance in multi-DC deployments. Even in single-DC
deployments, ``NetworkTopologyStrategy`` is recommended because it keeps the
schema ready for future topology changes.
The default configuration warns on ``SimpleStrategy``, which is appropriate
for most deployments. If you have existing keyspaces that use
``SimpleStrategy``, see :doc:`Update Topology Strategy From Simple to Network
</operating-scylla/procedures/cluster-management/update-topology-strategy-from-simple-to-network>`
for the migration procedure.
``replication_strategy_fail_list``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the replication strategy used in a ``CREATE KEYSPACE`` or ``ALTER KEYSPACE``
statement is on this list, the request is rejected with a
``ConfigurationException`` identifying the forbidden strategy and the affected
keyspace.
**When to use.** In production environments, add ``SimpleStrategy`` to this
list to enforce ``NetworkTopologyStrategy`` across all keyspaces. This helps
prevent new production keyspaces from being created with a topology-unaware
strategy.
**Metrics.** The following per-shard metrics track replication strategy
guardrail violations:
* ``scylla_cql_replication_strategy_warn_list_violations``
* ``scylla_cql_replication_strategy_fail_list_violations``
.. _guardrails-write-consistency-level:
Write Consistency Level Guardrails
----------------------------------
These two parameters control which consistency levels (CL) are allowed for
write operations (``INSERT``, ``UPDATE``, ``DELETE``, and ``BATCH``
statements).
Be aware that adding warnings to CQL responses can significantly increase
network traffic and reduce overall throughput.
``write_consistency_levels_warned``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a write operation uses a consistency level on this list, the server attaches
a warning to the CQL response identifying the discouraged consistency level.
**When to use.** Use this parameter to alert application developers when they
use a consistency level that, while technically functional, is not recommended
for the workload. Common examples:
* **Warn on** ``ANY``: writes at ``ANY`` are acknowledged as soon as at least
one node (including a coordinator acting as a hinted handoff store) receives
the mutation. This means data may not be persisted on any replica node at
the time of acknowledgement, risking data loss if the coordinator fails
before hinted handoff completes.
* **Warn on** ``ALL``: writes at ``ALL`` require every replica to acknowledge
the write. If any single replica is down, the write fails. This significantly
reduces write availability.
``write_consistency_levels_disallowed``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a write operation uses a consistency level on this list, the request is
rejected with an ``InvalidRequestException`` identifying the forbidden
consistency level.
**When to use.** Use this parameter to hard-block consistency levels that are
considered unsafe for your deployment:
* **Disallow** ``ANY``: in production environments, ``ANY`` is almost never
appropriate. It provides the weakest durability guarantee and is a common
source of data-loss incidents when operators or application developers use it
unintentionally.
* **Disallow** ``ALL``: in clusters where high write availability is critical,
blocking ``ALL`` prevents a single node failure from causing write
unavailability.
**Metrics.** The following per-shard metrics track write consistency level
guardrail violations:
* ``scylla_cql_write_consistency_levels_warned_violations``
* ``scylla_cql_write_consistency_levels_disallowed_violations``
Additionally, ScyllaDB exposes the
``scylla_cql_writes_per_consistency_level`` metric, labeled by consistency
level, which tracks the total number of write requests per CL. This metric is
useful for understanding the current write-CL distribution across the cluster
*before* deciding which levels to warn on or disallow. For example, querying
this metric can reveal whether any application is inadvertently using ``ANY``
or ``ALL`` for writes.
.. _guardrails-compact-storage:
Compact Storage Guardrail
-------------------------
``enable_create_table_with_compact_storage``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This boolean parameter controls whether ``CREATE TABLE`` statements with the
deprecated ``COMPACT STORAGE`` option are allowed. Unlike the other guardrails,
it acts as a simple on/off switch rather than using separate warn and fail
thresholds.
**When to use.** Leave this at the default (``false``) for all new
deployments. ``COMPACT STORAGE`` is a legacy feature that will be permanently
removed in a future version of ScyllaDB. Set to ``true`` only if you have a specific,
temporary need to create compact storage tables (e.g., compatibility with legacy
applications during a migration). For details on the ``COMPACT STORAGE`` option, see
:ref:`Compact Tables <compact-tables>` in the Data Definition documentation.
Additional References
---------------------
* :doc:`Consistency Level </cql/consistency>`
* :doc:`Data Definition (CREATE/ALTER KEYSPACE) </cql/ddl>`
* :doc:`How to Safely Increase the Replication Factor </kb/rf-increase>`
* :doc:`Metrics Reference </reference/metrics>`

View File

@@ -17,6 +17,7 @@ CQL Reference
secondary-indexes
time-to-live
functions
guardrails
wasm
json
mv
@@ -46,6 +47,7 @@ It allows you to create keyspaces and tables, insert and query tables, and more.
* :doc:`Data Types </cql/types>`
* :doc:`Definitions </cql/definitions>`
* :doc:`Global Secondary Indexes </cql/secondary-indexes>`
* :doc:`CQL Guardrails </cql/guardrails>`
* :doc:`Expiring Data with Time to Live (TTL) </cql/time-to-live>`
* :doc:`Functions </cql/functions>`
* :doc:`JSON Support </cql/json>`

View File

@@ -261,8 +261,51 @@ The following options are supported for vector indexes. All of them are optional
| | * ``true``: Enable rescoring. | |
| | * ``false``: Disable rescoring. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``source_model`` | The name of the embedding model that produced the vectors (e.g., ``"ada002"``). Cassandra client | *(none)* |
| | libraries such as CassIO send this option to tag the index with the model. Cassandra SAI rejects it as | |
| | an unrecognized property; ScyllaDB accepts and preserves it in ``DESCRIBE`` output for compatibility | |
| | with those libraries, but does not act on it. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
.. _cassandra-sai-compatibility:
Cassandra SAI Compatibility for Vector Search
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ScyllaDB accepts the Cassandra ``StorageAttachedIndex`` (SAI) class name in ``CREATE CUSTOM INDEX``
statements **for vector columns**. Cassandra libraries such as
`CassIO <https://cassio.org/>`_ and `LangChain <https://www.langchain.com/>`_ use SAI to create
vector indexes; ScyllaDB recognizes these statements for compatibility.
When ScyllaDB encounters an SAI class name on a **vector column**, the index is automatically
created as a native ``vector_index``. The following class names are recognized:
* ``org.apache.cassandra.index.sai.StorageAttachedIndex`` (exact case required)
* ``StorageAttachedIndex`` (case-insensitive)
* ``SAI`` (case-insensitive)
Example::
-- Cassandra SAI statement accepted by ScyllaDB:
CREATE CUSTOM INDEX ON my_table (embedding)
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'COSINE'};
-- Equivalent to:
CREATE CUSTOM INDEX ON my_table (embedding)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE'};
The ``similarity_function`` option is supported by both Cassandra SAI and ScyllaDB.
.. note::
SAI class names are only supported on **vector columns**. Using an SAI class name on a
non-vector column (e.g., ``text`` or ``int``) will result in an error. General SAI
indexing of non-vector columns is not supported by ScyllaDB; use a
:doc:`secondary index </cql/secondary-indexes>` instead.
.. _drop-index-statement:
DROP INDEX

View File

@@ -1,111 +0,0 @@
# Introduction
Similar to the approach described in CASSANDRA-12151, we add the
concept of an audit specification. An audit has a target (syslog or a
table) and a set of events/actions that it wants recorded. We
introduce new CQL syntax for Scylla users to describe and manipulate
audit specifications.
Prior art:
- Microsoft SQL Server [audit
description](https://docs.microsoft.com/en-us/sql/relational-databases/security/auditing/sql-server-audit-database-engine?view=sql-server-ver15)
- pgAudit [docs](https://github.com/pgaudit/pgaudit/blob/master/README.md)
- MySQL audit_log docs in
[MySQL](https://dev.mysql.com/doc/refman/8.0/en/audit-log.html) and
[Azure](https://docs.microsoft.com/en-us/azure/mysql/concepts-audit-logs)
- DynamoDB can [use CloudTrail](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/logging-using-cloudtrail.html) to log all events
# CQL extensions
## Create an audit
```cql
CREATE AUDIT [IF NOT EXISTS] audit-name WITH TARGET { SYSLOG | table-name }
[ AND TRIGGER KEYSPACE IN (ks1, ks2, ks3) ]
[ AND TRIGGER TABLE IN (tbl1, tbl2, tbl3) ]
[ AND TRIGGER ROLE IN (usr1, usr2, usr3) ]
[ AND TRIGGER CATEGORY IN (cat1, cat2, cat3) ]
;
```
From this point on, every database event that matches all present
triggers will be recorded in the target. When the target is a table,
it behaves like the [current
design](https://docs.scylladb.com/operating-scylla/security/auditing/#table-storage).
The audit name must be different from all other audits, unless IF NOT
EXISTS precedes it, in which case the existing audit must be identical
to the new definition. Case sensitivity and length limit are the same
as for table names.
A trigger kind (ie, `KEYSPACE`, `TABLE`, `ROLE`, or `CATEGORY`) can be
specified at most once.
## Show an audit
```cql
DESCRIBE AUDIT [audit-name ...];
```
Prints definitions of all audits named herein. If no names are
provided, prints all audits.
## Delete an audit
```cql
DROP AUDIT audit-name;
```
Stops logging events specified by this audit. Doesn't impact the
already logged events. If the target is a table, it remains as it is.
## Alter an audit
```cql
ALTER AUDIT audit-name WITH {same syntax as CREATE}
```
Any trigger provided will be updated (or newly created, if previously
absent). To drop a trigger, use `IN *`.
## Permissions
Only superusers can modify audits or turn them on and off.
Only superusers can read tables that are audit targets; no user can
modify them. Only superusers can drop tables that are audit targets,
after the audit itself is dropped. If a superuser doesn't drop a
target table, it remains in existence indefinitely.
# Implementation
## Efficient trigger evaluation
```c++
namespace audit {
/// Stores triggers from an AUDIT statement.
class triggers {
// Use trie structures for speedy string lookup.
optional<trie> _ks_trigger, _tbl_trigger, _usr_trigger;
// A logical-AND filter.
optional<unsigned> _cat_trigger;
public:
/// True iff every non-null trigger matches the corresponding ainf element.
bool should_audit(const audit_info& ainf);
};
} // namespace audit
```
To prevent modification of target tables, `audit::inspect()` will
check the statement and throw if it is disallowed, similar to what
`check_access()` currently does.
## Persisting audit definitions
Obviously, an audit definition must survive a server restart and stay
consistent among all nodes in a cluster. We'll accomplish both by
storing audits in a system table.

View File

@@ -0,0 +1,155 @@
# Comparing Build Systems: configure.py vs CMake
ScyllaDB has two build systems: the primary `configure.py` + Ninja pipeline
and an alternative CMake build (used mainly for IDE integration — CLion,
clangd, etc.). Both must produce equivalent compilation and link commands.
`scripts/compare_build_systems.py` verifies this by parsing the `build.ninja`
files generated by each system and comparing:
1. **Per-file compilation flags** — defines, warnings, optimization, language
flags for every Scylla source file.
2. **Link target sets** — are the same executables produced by both systems?
3. **Per-target linker settings** — link flags and libraries for every common
executable.
`configure.py` is treated as the baseline. CMake should match it.
## Quick start
```bash
# Compare a single mode
scripts/compare_build_systems.py -m dev
# Compare all modes
scripts/compare_build_systems.py
# Verbose output — show per-file and per-target differences
scripts/compare_build_systems.py -m debug -v
```
The script automatically configures both build systems into a temporary
directory for every run — the user's existing build tree is never touched.
No manual `configure.py` or `cmake` invocation is required.
## Mode mapping
| configure.py | CMake |
|--------------|------------------|
| `debug` | `Debug` |
| `dev` | `Dev` |
| `release` | `RelWithDebInfo` |
| `sanitize` | `Sanitize` |
| `coverage` | `Coverage` |
## Examples
```bash
# Check dev mode only (fast, most common during development)
scripts/compare_build_systems.py -m dev
# Check all modes
scripts/compare_build_systems.py
# CI mode: quiet, strict (exit 1 on any diff)
scripts/compare_build_systems.py --ci
# Verbose output for debugging a specific mode
scripts/compare_build_systems.py -m sanitize -v
# Quiet mode — only prints summary and errors
scripts/compare_build_systems.py -m dev -q
```
## Exit codes
| Code | Meaning |
|------|--------------------------------------------------------------------------|
| `0` | All checked modes match |
| `1` | Differences found |
| `2` | Configuration failure or some modes could not be compared (e.g. skipped) |
## What it ignores
The script intentionally ignores certain structural differences that are
inherent to how the two build systems work:
- **Include paths** (`-I`, `-isystem`) — directory layout differs between
the two systems.
- **LTO/PGO flags** — these are configuration-dependent options, not
mode-inherent.
- **Internal library targets** — CMake creates intermediate static/shared
libraries (e.g., `scylla-main`, `test-lib`, abseil targets) while
`configure.py` links `.o` files directly.
- **Per-component Boost defines** — CMake adds `BOOST_REGEX_DYN_LINK` etc.
per component; `configure.py` uses a single `BOOST_ALL_DYN_LINK`.
## Typical workflow
After modifying `CMakeLists.txt` or `cmake/mode.*.cmake`:
```bash
# 1. Run the comparison (auto-configures both systems in a temp dir)
scripts/compare_build_systems.py -m dev -v
# 2. Fix any differences, repeat
```
## AI agent workflow
When the script reports mismatches, you can paste its summary output into
an AI coding agent (GitHub Copilot, etc.) and ask it to fix the
discrepancies. The agent has access to both `configure.py` and the
CMake files and can resolve most differences automatically.
### Example interaction
**1. Run the script:**
```bash
scripts/compare_build_systems.py
```
**2. Copy the summary and paste it to the agent:**
> I ran `scripts/compare_build_systems.py` and got:
>
> ```
> Summary
> ══════════════════════════════════════════════════════════════════════
> debug (CMake: Debug ): ✗ MISMATCH
> Compilation: 3 files with flag diffs, 1 sources only in configure.py
> only-configure.py defines: -DSOME_FLAG (3 files)
> Link targets: 1 only in configure.py
> Linker: 2 targets with lib diffs
> lib only in CMake: boost_filesystem (2 targets)
> dev (CMake: Dev ): ✗ MISMATCH
> Compilation: 1 sources only in configure.py
> Link targets: 1 only in configure.py
> release (CMake: RelWithDebInfo ): ✓ MATCH
> sanitize (CMake: Sanitize ): ✓ MATCH
> coverage (CMake: Coverage ): ✓ MATCH
> ```
>
> Please fix all issues and commit according to project guidelines.
**3. The agent will:**
- Identify each discrepancy (missing sources, missing targets, extra
libraries, missing defines).
- Trace root causes — e.g., a test added to `configure.py` but not to
`test/boost/CMakeLists.txt`, or an unnecessary `Boost::filesystem`
link in a CMake target.
- Apply fixes to the appropriate `CMakeLists.txt` files.
- Re-run cmake and the comparison script to verify the fix.
- Commit each fix to the correct commit in the series (using
`git commit --fixup` + `git rebase --autosquash`).
### Tips
- **Paste the full summary block** — the inline diff details (compilation,
link targets, linker) give the agent enough context to act without
scrolling through verbose output.
- **Use `-v` for stubborn issues** — if the agent needs per-file or
per-target detail, re-run with `-v` and paste the relevant section.

81
docs/dev/counters.md Normal file
View File

@@ -0,0 +1,81 @@
# Counters
Counters are special kinds of cells which value can only be incremented, decremented, read and (with some limitations) deleted. In particular, once deleted, that counter cannot be used again. For example:
```cql
> UPDATE cf SET my_counter = my_counter + 6 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
0 | 6
(1 rows)
> UPDATE cf SET my_counter = my_counter - 1 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
0 | 5
(1 rows)
> DELETE my_counter FROM cf WHERE pk = 0;
> SELECT * FROM cf;
pk | my_counter
----+------------
(0 rows)
> UPDATE cf SET my_counter = my_counter + 3 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
(0 rows)
```
## Counters representation
Counters are represented as sets of, so called, shards which are triples containing:
* counter id uuid identifying the writer owning that shard (see below)
* logical clock incremented each time the owning writer modifies the shard value
* current value sum of increments and decrements done by the owning writer
During each write operation one of the replicas is chosen as a leader. The leader reads its shard, increments logical clock, updates current value and then sends the new version of its shard to the other replicas.
Shards owned by the same writer are merged (see below) so that each counter cell contains only one shard per counter id. Reading the actual counter value requires summing values of all shards.
### Counter id
The counter id is a 128-bit UUID that identifies which writer owns a shard. How it is assigned depends on whether the table uses vnodes or tablets.
**Vnodes:** the counter id is the host id of the node that owns the shard. Each node in the cluster gets a unique counter id, so the number of shards in a counter cell grows with the number of distinct nodes that have ever written to it.
**Tablets:** the counter id is rack-based rather than node-based. It is a deterministic type-3 (name-based) UUID derived from the string `"<datacenter>:<rack>"`. All nodes in the same rack share the same counter id.
During tablet migration, since there are two active replicas in a rack and in order to avoid conflicts, the node that is a *pending replica* uses the **negated** rack UUID as its counter id.
This bounds the number of shards in a counter cell to at most `2 × (number of racks)` regardless of node replacements.
### Merging and reconciliation
Reconciliation of two counters requires merging all shards belonging to the same counter id. The rule is: the shard with the highest logical clock wins.
Since support of deleting counters is limited so that once deleted they cannot be used again, during reconciliation tombstones win with live counter cell regardless of their timestamps.
### Digest
Computing a digest of counter cells needs to be done based solely on the shard contents (counter id, value, logical clock) rather than any structural metadata.
## Writes
1. Counter update starts with a client sending counter delta as a long (CQL3 `bigint`) to the coordinator.
2. CQL3 creates a `CounterMutation` containing a `counter_update` cell which is just a delta.
3. Coordinator chooses the leader of the counter update and sends it the mutation. The leader is always one of the replicas owning the partition the modified counter belongs to.
4. Now, the leader needs to transform counter deltas into shards. To do that it reads the current value of the shard it owns, and produces a new shard with the value modified by the delta and the logical clock incremented.
5. The mutation with the newly created shard is both used to update the memtable on the leader as well as sent to the other nodes for replication.
### Choosing leader
Choosing a replica which becomes a leader for a counter update is completely at the coordinator discretion. It is not a static role in any way and any concurrent update could be forwarded to a different leader. This means that all problems related to leader election are avoided.
The coordinator chooses the leader using the following algorithm:
1. If the coordinator can be a leader it chooses itself.
2. Otherwise, a random replica from the local DC is chosen.
3. If there is no eligible node available in the local DC the replica closest to the coordinator (according to the snitch) is chosen.
## Reads
Querying counter values is much simpler than updating it. First part of the read operation is performed as for all other cell types. When counter cells from different sources are being reconciled their shards are merged. Once the final counter cell value is known and the `CounterCell` is serialised, current values of all shards are summed up and the output of serialisation is a long integer.

View File

@@ -192,14 +192,10 @@ For example, to configure ScyllaDB to use listen address `10.0.0.5`:
$ docker run --name some-scylla -d scylladb/scylla --listen-address 10.0.0.5
```
**Since: 1.4**
#### `--alternator-address ADDR`
The `--alternator-address` command line option configures the Alternator API listen address. The default value is the same as `--listen-address`.
**Since: 3.2**
#### `--alternator-port PORT`
The `--alternator-port` command line option configures the Alternator API listen port. The Alternator API is disabled by default. You need to specify the port to enable it.
@@ -210,22 +206,16 @@ For example, to configure ScyllaDB to listen to Alternator API at port `8000`:
$ docker run --name some-scylla -d scylladb/scylla --alternator-port 8000
```
**Since: 3.2**
#### `--alternator-https-port PORT`
The `--alternator-https-port` option is similar to `--alternator-port`, just enables an encrypted (HTTPS) port. Either the `--alternator-https-port` or `--alternator-http-port`, or both, can be used to enable Alternator.
Note that the `--alternator-https-port` option also requires that files `/etc/scylla/scylla.crt` and `/etc/scylla/scylla.key` be inserted into the image. These files contain an SSL certificate and key, respectively.
**Since: 4.2**
#### `--alternator-write-isolation policy`
The `--alternator-write-isolation` command line option chooses between four allowed write isolation policies described in docs/alternator/alternator.md. This option must be specified if Alternator is enabled - it does not have a default.
**Since: 4.1**
#### `--broadcast-address ADDR`
The `--broadcast-address` command line option configures the IP address the ScyllaDB instance tells other ScyllaDB nodes in the cluster to connect to.
@@ -304,8 +294,6 @@ For example, to skip running I/O setup:
$ docker run --name some-scylla -d scylladb/scylla --io-setup 0
```
**Since: 4.3**
#### `--cpuset CPUSET`
The `--cpuset` command line option restricts ScyllaDB to run on only on CPUs specified by `CPUSET`.
@@ -341,26 +329,18 @@ For example, to enable the User Defined Functions (UDF) feature:
$ docker run --name some-scylla -d scylladb/scylla --experimental-feature=udf
```
**Since: 2.0**
#### `--disable-version-check`
The `--disable-version-check` disable the version validation check.
**Since: 2.2**
#### `--authenticator AUTHENTICATOR`
The `--authenticator` command lines option allows to provide the authenticator class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthenticator` which performs no credentials checks. The second option is using the `PasswordAuthenticator` parameter, which relies on username/password pairs to authenticate users.
**Since: 2.3**
#### `--authorizer AUTHORIZER`
The `--authorizer` command lines option allows to provide the authorizer class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthorizer` which allows any action to any user. The second option is using the `CassandraAuthorizer` parameter, which stores permissions in `system.permissions` table.
**Since: 2025.4**
#### `--dc NAME`
The `--dc` command line option sets the datacenter name for the ScyllaDB node.

124
docs/dev/logstor.md Normal file
View File

@@ -0,0 +1,124 @@
# Logstor
## Introduction
Logstor is a log-structured storage engine for ScyllaDB optimized for key-value workloads. It provides an alternative storage backend for key-value tables - tables with a partition key only, with no clustering columns.
Unlike the traditional LSM-tree based storage, logstor uses a log-structured approach with in-memory indexing, making it particularly suitable for workloads with frequent overwrites and point lookups.
## Architecture
Logstor consists of several key components:
### Components
#### Primary Index
The primary index is entirely in memory and it maps a partition key to its location in the log segments. It consists of a B-tree per each table that is ordered token.
#### Segment Manager
The `segment_manager` handles the allocation and management of fixed-size segments (default 128KB). Segments are grouped into large files (default 32MB). Key responsibilities include:
- **Segment allocation**: Provides segments for writing new data
- **Space reclamation**: Tracks free space in each segment
- **Compaction**: Copies live data from sparse segments to reclaim space
- **Recovery**: Scans segments on startup to rebuild the index
- **Separator**: Rewrites segments that have records from different compaction groups into new segments that are separated by compaction group.
The data in the segments consists of records of type `log_record`. Each record contains the value for some key as a `canonical_mutation` and additional metadata.
The `segment_manager` receives new writes via a `write_buffer` and writes them sequentially to the active segment with 4k-block alignment.
#### Write Buffer
The `write_buffer` manages a buffer of log records and handles the serialization of the records including headers and alignment. It can be used to write multiple records to the buffer and then write the buffer to the segment manager.
The `buffered_writer` manages multiple write buffers for user writes, an active buffer and multiple flushing ones, to batch writes and manage backpressure.
### Data Flow
**Write Path:**
1. Application writes mutation to logstor
2. Mutation is converted to a log record
3. Record is written to write buffer
4. The buffer is switched and written to the active segment.
5. Index is updated with new record locations
6. Old record locations (for overwrites) are marked as free
**Read Path:**
1. Application requests data for a partition key
2. Index lookup returns record location
3. Segment manager reads record from disk
4. Record is deserialized into a mutation and returned
**Separator:**
1. When a record is written to the active segment, it is also written to its compaction group's separator buffer. The separator buffer holds a reference to the original segment.
2. The separator buffer is flushed when it's full, or requested to flush for other reason. It is written into a new segment in the compaction group, and it updates the location of the records from the original mixed segments to the new segments in the compaction group.
3. After the separator buffer is flushed and all records from the original segment are moved, it releases the reference of the segment. When there are no more reference to the segment it is freed.
**Compaction:**
1. The amount of live data is tracked for each segment in its segment_descriptor. The segment descriptors are stored in a histogram by live data.
2. A segment set from a single compaction group is submitted for compaction.
3. Compaction picks segments for compaction from the segment set. It chooses segments with the lowest utilization such that compacting them results in net gain of free segments.
4. It reads the segments, finding all live records, and writing them into a write buffer. When the buffer is full it is flushed into a new segment, and for each recording updating the index location to the new location.
5. After all live records are rewritten the old segments are freed.
## Usage
### Enabling Logstor
To use logstor, enable it in the configuration:
```yaml
enable_logstor: true
experimental_features:
- logstor
```
### Creating Tables
Tables using logstor must have no clustering columns, and created with the `storage_engine` property equals to 'logstor':
```cql
CREATE TABLE keyspace.user_profiles (
user_id uuid PRIMARY KEY,
name text,
email text,
metadata frozen<map<text, text>>
) WITH storage_engine = 'logstor';
```
### Basic Operations
**Insert/Update:**
```cql
INSERT INTO keyspace.table_name (pk, v) VALUES (1, 'value1');
INSERT INTO keyspace.table_name (pk, v) VALUES (2, 'value2');
-- Overwrite with new value
INSERT INTO keyspace.table_name (pk, v) VALUES (1, 'updated_value');
```
Currently, updates must write the full row. Updating individual columns is not yet supported. Each write replaces the entire partition.
**Select:**
```cql
SELECT * FROM keyspace.table_name WHERE pk = 1;
-- Returns: (1, 'updated_value')
SELECT pk, v FROM keyspace.table_name WHERE pk = 2;
-- Returns: (2, 'value2')
SELECT * FROM keyspace.table_name;
-- Returns: (1, 'updated_value'), (2, 'value2')
```
**Delete:**
```cql
DELETE FROM keyspace.table_name WHERE pk = 1;
```

View File

@@ -37,8 +37,17 @@ Global index's target is usually just the indexed column name, unless the index
- index on map, set or list values: VALUES(v)
- index on map entries: ENTRIES(v)
Their serialization is just string representation, so:
"v", "FULL(v)", "KEYS(v)", "VALUES(v)", "ENTRIES(v)" are all valid targets.
Their serialization uses lowercase type names as prefixes, except for `full` which is serialized
as just the column name (without any prefix):
`"v"`, `"keys(v)"`, `"values(v)"`, `"entries(v)"` are valid targets; a frozen full collection
index on column `v` is stored simply as `"v"` (same as a regular index).
If the column name contains characters that could be confused with the above formats
(e.g., a name containing parentheses or braces), it is escaped using the CQL
quoted-identifier syntax (column_identifier::to_cql_string()), which wraps the
name in double quotes and doubles any embedded double-quote characters. For example,
a column named `hEllo` is stored as `"hEllo"`, and a column named `keys(m)` is
stored as `"keys(m)"`.
## Local index

View File

@@ -0,0 +1,67 @@
# System Keyspaces Overview
This page gives a high-level overview of several internal keyspaces and what they are used for.
## Table of Contents
- [system_replicated_keys](#system_replicated_keys)
- [system_distributed](#system_distributed)
- [system_distributed_everywhere](#system_distributed_everywhere)
- [system_auth](#system_auth)
- [system](#system)
- [system_schema](#system_schema)
- [system_traces](#system_traces)
- [system_audit/audit](#system_auditaudit)
## `system_replicated_keys`
Internal keyspace for encryption-at-rest key material used by the replicated key provider. It stores encrypted data keys so nodes can retrieve the correct key IDs when reading encrypted data.
This keyspace is created as an internal system keyspace and uses `EverywhereStrategy` so key metadata is available on every node. It is not intended for user data.
## `system_distributed`
Internal distributed metadata keyspace used for cluster-wide coordination data that is shared across nodes.
In practice, it is used for metadata such as:
- materialized view build coordination state
- CDC stream/timestamp metadata exposed to clients
- service level definitions used by workload prioritization
This keyspace is managed by Scylla and is not intended for application tables.
It is created as an internal keyspace (historically with `SimpleStrategy` and RF=3 by default).
## `system_distributed_everywhere`
Legacy keyspace. It is no longer used.
## `system_auth`
Legacy auth keyspace name kept primarily for compatibility.
Auth tables have moved to the `system` keyspace (`roles`, `role_members`, `role_permissions`, and related auth state). `system_auth` may still exist for compatibility with legacy tooling/queries, but it is no longer where current auth state is primarily stored.
## `system`
This keyspace is local one, so each node has its own, independent content for tables in this keyspace. For some tables, the content is coordinated at a higher level (RAFT), but not via the traditional replication systems (storage proxy).
See the detailed table-level documentation here: [system_keyspace](system_keyspace.md)
## `system_schema`
This keyspace is local one, so each node has its own, independent content for tables in this keyspace. All tables in this keyspace are coordinated via the schema replication system.
See the detailed table-level documentation here: [system_schema_keyspace](system_schema_keyspace.md)
## `system_traces`
Internal tracing keyspace used for query tracing and slow-query logging records (`sessions`, `events`, and related index/log tables).
This keyspace is written by Scylla's tracing subsystem for diagnostics and observability. It is operational metadata, not user application data (historically created with `SimpleStrategy` and RF=2).
## `system_audit`/`audit`
Internal audit-logging keyspace used to persist audit events when table-backed auditing is enabled.
Scylla's audit table storage is implemented as an internal audit keyspace for audit records (for example, auth/admin/DCL activity depending on audit configuration). In current code this keyspace is named `audit`, while operational material may refer to it as its historical name (`system_audit`). It is intended for security/compliance observability, not for application data.

View File

@@ -1611,6 +1611,7 @@ CREATE TABLE system.topology (
cleanup_status text,
datacenter text,
ignore_msb int,
intended_storage_mode text,
node_state text,
num_tokens int,
rack text,
@@ -1663,6 +1664,7 @@ CREATE TABLE system.topology (
- `tokens_string`: Alternative representation of tokens
- `shard_count`: Number of shards on the node
- `ignore_msb`: MSB bits to ignore for token calculation
- `intended_storage_mode`: Intended storage mode for tables under vnodes-to-tablets migration. The node switches to this mode on next restart.
- `cleanup_status`: Status of cleanup operations
- `supported_features`: Features supported by this node
- `request_id`: ID of the current topology request for this node

View File

@@ -700,6 +700,7 @@ CREATE TABLE system.topology (
host_id uuid,
datacenter text,
ignore_msb int,
intended_storage_mode text,
node_state text,
num_tokens int,
rack text,
@@ -741,6 +742,7 @@ Each node has a clustering row in the table where its `host_id` is the clusterin
- `datacenter` - a name of the datacenter the node belongs to
- `rack` - a name of the rack the node belongs to
- `ignore_msb` - the value of the node's `murmur3_partitioner_ignore_msb_bits` parameter
- `intended_storage_mode` - if set, it indicates the intended storage mode for tables under vnodes-to-tablets migration
- `shard_count` - the node's `smp::count`
- `release_version` - the node's `version::current()` (corresponding to a Cassandra version, used by drivers)
- `node_state` - current state of the node (as described earlier)

10
docs/dev/vector_index.md Normal file
View File

@@ -0,0 +1,10 @@
# Vector index in Scylla
Vector indexes are custom indexes (USING 'vector\_index'). Their `target` option in `system_schema.indexes` uses following format:
- Simple single-column vector index `(v)`: just the (escaped) column name, e.g. `v`
- Vector index with filtering columns `(v, f1, f2)`: JSON with `tc` (target column) and `fc` (filtering columns): `{"tc":"v","fc":["f1","f2"]}`
- Local vector index `((p1, p2), v)`: JSON with `tc` and `pk` (partition key columns): `{"tc":"v","pk":["p1","p2"]}`
- Local vector index with filtering columns `((p1, p2), v, f1, f2)`: JSON with `tc`, `pk`, and `fc`: `{"tc":"v","pk":["p1","p2"],"fc":["f1","f2"]}`
The `target` option acts as the interface for the vector-store service, providing the metadata necessary to determine which columns are indexed and how they are structured.

View File

@@ -289,7 +289,7 @@ Yes, but it will require running a full repair (or cleanup) to change the replic
- If you're reducing the replication factor, run ``nodetool cleanup <updated Keyspace>`` on the keyspace you modified to remove surplus replicated data.
Cleanup runs on a per-node basis.
- If you're increasing the replication factor, refer to :doc:`How to Safely Increase the RF </kb/rf-increase>`
- Note that you need to provide the keyspace namr. If you do not, the cleanup or repair operation runs on all keyspaces for the specific node.
- Note that you need to provide the keyspace name. If you do not, the cleanup or repair operation runs on all keyspaces for the specific node.
Why can't I set ``listen_address`` to listen to 0.0.0.0 (all my addresses)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -52,7 +52,7 @@ Install ScyllaDB
.. code-block:: console
:substitutions:
sudo wget -O /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
sudo wget -O /etc/apt/sources.list.d/scylla.list https://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
#. Install ScyllaDB packages.
@@ -125,7 +125,7 @@ Install ScyllaDB
.. code-block:: console
:substitutions:
sudo curl -o /etc/yum.repos.d/scylla.repo -L http://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
sudo curl -o /etc/yum.repos.d/scylla.repo -L https://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
#. Install ScyllaDB packages.
@@ -133,19 +133,19 @@ Install ScyllaDB
sudo yum install scylla
Running the command installs the latest official version of ScyllaDB Open Source.
Alternatively, you can to install a specific patch version:
Running the command installs the latest official version of ScyllaDB.
Alternatively, you can install a specific patch version:
.. code-block:: console
sudo yum install scylla-<your patch version>
Example: The following example shows the command to install ScyllaDB 5.2.3.
Example: The following example shows installing ScyllaDB 2025.3.1.
.. code-block:: console
:class: hide-copy-button
sudo yum install scylla-5.2.3
sudo yum install scylla-2025.3.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -36,11 +36,8 @@ release versions, run:
curl -sSf get.scylladb.com/server | sudo bash -s -- --list-active-releases
Versions 2025.1 and Later
==============================
Run the command with the ``--scylla-version`` option to specify the version
you want to install.
To install a non-default version, run the command with the ``--scylla-version``
option to specify the version you want to install.
**Example**
@@ -50,20 +47,4 @@ you want to install.
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version |CURRENT_VERSION|
Versions Earlier than 2025.1
================================
To install a supported version of *ScyllaDB Enterprise*, run the command with:
* ``--scylla-product scylla-enterprise`` to specify that you want to install
ScyllaDB Entrprise.
* ``--scylla-version`` to specify the version you want to install.
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -181,6 +181,7 @@ internode_compression controls whether traffic between nodes is compressed.
* all - all traffic is compressed.
* dc - traffic between different datacenters is compressed.
* rack - traffic between different racks is compressed.
* none - nothing is compressed (default).
Configuring TLS/SSL in scylla.yaml

View File

@@ -57,12 +57,11 @@ To enable shared dictionaries:
internode_compression_enable_advanced: true
rpc_dict_training_when: when_leader
.. warning:: Enabling shared dictionary training might leak unencrypted data to disk.
.. note::
Trained dictionaries contain randomly chosen samples of data transferred between
nodes. The data samples are persisted in the Raft log, which is not encrypted.
As a result, some data from otherwise encrypted tables might be stored on disk
unencrypted.
Some dictionary training data may be encrypted using storage-level encryption
(if enabled) instead of database-level encryption, meaning protection is
applied at the storage layer rather than within the database itself.
Reference

View File

@@ -10,6 +10,7 @@ ScyllaDB Configuration Procedures
How to do a Rolling Restart <rolling-restart>
Advanced Internode (RPC) Compression <advanced-internode-compression>
Shared-dictionary compression for SSTables <sstable-dictionary-compression>
Migrate a Keyspace from Vnodes to Tablets <migrate-vnodes-to-tablets>
Procedures to change ScyllaDB Configuration settings.
@@ -22,3 +23,5 @@ Procedures to change ScyllaDB Configuration settings.
* :doc:`Advanced Internode (RPC) Compression </operating-scylla/procedures/config-change/advanced-internode-compression>`
* :doc:`Shared-dictionary compression for SSTables </operating-scylla/procedures/config-change/sstable-dictionary-compression>`
* :doc:`Migrate a Keyspace from Vnodes to Tablets </operating-scylla/procedures/config-change/migrate-vnodes-to-tablets>`

View File

@@ -0,0 +1,393 @@
Migrate a Keyspace from Vnodes to Tablets
==========================================
This procedure describes how to migrate an existing keyspace from vnodes
to tablets. Tablets are designed to be the long-term replacement for vnodes,
offering numerous benefits such as faster topology operations, automatic load
balancing, automatic cleanups, and improved streaming performance. Migrating to
tablets is strongly recommended. See :doc:`Data Distribution with Tablets </architecture/tablets/>`
for details.
.. note::
The migration is an online operation. This means that the keyspace remains
fully available to users throughout the migration, provided that its
replication factor is greater than 1. Reads and writes continue to be served
using vnodes until the migration is finished.
.. warning::
During the migration, you should expect degraded performance on the migrating
keyspace. The reasons are the following:
* **Rolling restart**: Each node must upgrade its storage from vnodes to
tablets. This is an offline operation happening on startup, so a restart is
needed. Upon restart, each node performs a heavy and time-consuming
resharding operation to reorganize its data based on tablets, and remains
offline until this operation completes. Resharding may last from minutes to
hours, depending on the amount of data that the node holds. At this time,
the node cannot serve any requests.
* **Unbalanced tablets**: The initial tablet layout mirrors the vnode layout.
The tablet load balancer does not rebalance tablets until the migration is
finished, so some shards may carry more data than others during the
migration. The imbalance is expected to be more prominent in clusters with
very large nodes (hundreds of vCPUs).
* **Loss of shard awareness**: During the migration and until the rolling
restart is complete, the cluster is in a mixed state with some nodes using
vnodes and others using tablets. In this state, queries may cause
cross-shard operations within nodes, reducing performance.
The performance will return to normal after the migration finishes and the
tablet load balancer rebalances the data.
Prerequisites
-------------
* All nodes in the cluster must be **up and running**. You can check the status
of all nodes with
:doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
* All nodes must be running ScyllaDB 2026.2 or later.
Limitations
-----------
The current migration procedure has the following limitations:
* The total number of **vnode tokens** in the cluster must be a **power of two**
and the tokens must be **evenly spaced** across the token ring. This is
verified automatically when starting the migration.
* **No schema changes** during the migration. Do not create, alter, or drop
tables in the migrating keyspace until the migration is finished.
* **No topology changes** during the migration. Do not add, remove, decommission,
or replace nodes while a migration is in progress.
* **No TRUNCATE** on tables in the migrating keyspace during the migration.
* Only **CQL base tables** can be migrated. Materialized views, secondary
indexes, CDC tables, and Alternator tables are not supported.
* Tables with **counters** or **LWTs** cannot be migrated.
Overview
--------
The migration consists of three phases:
1. **Prepare**: Create tablet maps for all tables in the keyspace. Each tablet
inherits its token range and replica set from the corresponding vnode range.
2. **Storage upgrade**: Restart each node one at a time, upgrading its storage
from vnodes to tablets. Upon restart, the node begins resharding data into
tablets. This is a storage-layer operation and is unrelated to ScyllaDB
version upgrades.
3. **Finalize**: Once all nodes have been upgraded, commit the migration by
clearing the migration state and switching the keyspace schema to tablets.
During the first two phases, the migration is reversible; you can roll back to
vnodes. However, once the migration is finalized, it cannot be reversed.
.. note::
In the following sections, any reference to "upgrade" or "downgrade" of a
node will refer to the migration of its storage from vnodes to tablets or
vice versa. Do not confuse it with version upgrades/downgrades.
Procedure
---------
#. Prepare the keyspace for migration:
#. Create tablet maps for all tables in the keyspace:
.. code-block:: console
scylla nodetool migrate-to-tablets start <keyspace>
#. Verify that the keyspace is in ``migrating_to_tablets`` state and all nodes are still using vnodes:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses vnodes
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
.. _upgrade-nodes:
#. Upgrade all nodes to tablets:
#. Pick a node.
#. Mark the node for upgrade to tablets:
.. note::
This is a node-local operation. Use the IP address of the node that
you are upgrading.
.. caution::
Do not mark more than one node for upgrade at the same time. Even if
you restart them serially, unexpected restarts can happen for various
reasons (crashes, power failures, etc.) leading to parallel node
upgrades which can reduce availability.
.. code-block:: console
scylla nodetool -h <node-ip> migrate-to-tablets upgrade
#. Verify that the node status changed from ``vnodes`` to ``migrating to tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c migrating to tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Drain and stop the node:
.. code-block:: console
scylla nodetool -h <node-ip> drain
.. include:: /rst_include/scylla-commands-stop-index.rst
#. Restart the node:
.. include:: /rst_include/scylla-commands-start-index.rst
#. Wait until the node is UP and has returned to the ScyllaDB cluster using :doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
This operation may take a long time due to resharding. To monitor
resharding progress, use the task manager API:
.. code-block:: console
scylla nodetool tasks list compaction -h <node-ip> --keyspace <keyspace> | grep -i reshard
#. Verify that the node status changed from ``migrating to tablets`` to ``uses tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Move to the next node and repeat from step a until all nodes are upgraded.
#. Finalize the migration:
.. warning::
Finalization **cannot be undone**. Once the migration is finalized, the
keyspace cannot be switched back to vnodes.
#. Issue the finalization request:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace>
#. Verify that the keyspace status changed to ``tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets finalize ks
Keyspace: ks
Status: tablets
Rollback Procedure
------------------
.. note::
Rollback is only possible **before finalization**. Once the migration is
finalized, it cannot be reversed.
If you need to abort the migration **before finalization**, you can roll back
by downgrading each node back to vnodes. The rollback procedure is the
following:
#. Find all nodes that have been upgraded to tablets (status: ``uses tablets``)
or they are in the process of upgrading to tablets (status: ``migrating to tablets``):
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets <---
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. For **each upgraded or upgrading node** in the cluster, perform a downgrade
(one node at a time):
#. Mark the node for downgrade:
.. code-block:: console
scylla nodetool -h <node-ip> migrate-to-tablets downgrade
#. Check the node status. The status for a previously upgraded node should
change from ``uses tablets`` to ``migrating to vnodes``. The status for a
previously upgrading node should change from ``migrating to tablets`` to
``uses vnodes`` or ``migrating to vnodes``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c migrating to vnodes <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. If the node status is ``uses vnodes``, the downgrade is complete. Move to
the next node and repeat from step a.
#. If the node is ``migrating to vnodes``, restart it to complete the
downgrade:
#. Drain and stop the node:
.. code-block:: console
scylla nodetool -h <node-ip> drain
.. include:: /rst_include/scylla-commands-stop-index.rst
#. Restart the node:
.. include:: /rst_include/scylla-commands-start-index.rst
#. Wait until the node is UP and has returned to the ScyllaDB cluster using :doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
This operation may take a long time due to resharding. To monitor
resharding progress, use the task manager API:
.. code-block:: console
scylla nodetool tasks list compaction -h <node-ip> --keyspace <keyspace> | grep -i reshard
#. Verify that the node status changed from ``migrating to vnodes`` to ``uses vnodes``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses vnodes <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Move to the next node and repeat from step a until all nodes are
downgraded.
#. Once all nodes have been downgraded, finalize the rollback:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace>
Migrating multiple keyspaces
----------------------------
Migrating multiple keyspaces simultaneously is supported. The procedure is the
same as with a single keyspace except that the preparation and finalization
steps need to be repeated for each keyspace. However, note that a new migration
cannot be started once another migration is in the upgrade phase. The migrations
need to be prepared and finalized together.
To migrate multiple keyspaces simultaneously, follow these steps:
#. For **each keyspace**, prepare it for migration:
.. code-block:: console
scylla nodetool migrate-to-tablets start <keyspace1>
scylla nodetool migrate-to-tablets start <keyspace2>
...
Verify that all keyspaces are in ``migrating_to_tablets`` state before
proceeding:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace1>
scylla nodetool migrate-to-tablets status <keyspace2>
...
#. Upgrade all nodes in the cluster following the same :ref:`procedure <upgrade-nodes>`
as for a single keyspace. Each node restart reshards all keyspaces under
migration in one pass.
#. For **each keyspace**, finalize the migration:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace1>
scylla nodetool migrate-to-tablets finalize <keyspace2>
...

View File

@@ -2,8 +2,8 @@
ScyllaDB Auditing Guide
========================
Auditing allows the administrator to monitor activities on a Scylla cluster, including queries and data changes.
The information is stored in a Syslog or a Scylla table.
Auditing allows the administrator to monitor activities on a ScyllaDB cluster, including CQL queries and data changes, as well as Alternator (DynamoDB-compatible API) requests.
The information is stored in a Syslog or a ScyllaDB table.
Prerequisite
------------
@@ -14,15 +14,15 @@ Enable ScyllaDB :doc:`Authentication </operating-scylla/security/authentication>
Enabling Audit
---------------
By default, table auditing is **enabled**. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
By default, auditing is **enabled** with the ``table`` backend. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
You can set the following options:
* ``none`` - Audit is disabled.
* ``table`` - Audit is enabled, and messages are stored in a Scylla table (default).
* ``table`` - Audit is enabled, and messages are stored in a ScyllaDB table (default).
* ``syslog`` - Audit is enabled, and messages are sent to Syslog.
* ``syslog,table`` - Audit is enabled, and messages are stored in a Scylla table and sent to Syslog.
* ``syslog,table`` - Audit is enabled, and messages are stored in a ScyllaDB table and sent to Syslog.
Configuring any other value results in an error at Scylla startup.
Configuring any other value results in an error at ScyllaDB startup.
Configuring Audit
-----------------
@@ -34,7 +34,9 @@ Flag Default Value Description
================== ================================== ========================================================================================================================
audit_categories "DCL,AUTH,ADMIN" Comma-separated list of statement categories that should be audited
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
audit_tables “” Comma-separated list of table names that should be audited, in the format of <keyspacename>.<tablename>
audit_tables “” Comma-separated list of table names that should be audited, in the format ``<keyspace_name>.<table_name>``.
For Alternator tables use the ``alternator.<table_name>`` format (see :ref:`alternator-auditing`).
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
audit_keyspaces “” Comma-separated list of keyspaces that should be audited. You must specify at least one keyspace.
If you leave this option empty, no keyspace will be audited.
@@ -47,30 +49,137 @@ You can use DCL, AUTH, and ADMIN audit categories without including any keyspace
audit_categories parameter description
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
========= =========================================================================================
Parameter Logs Description
========= =========================================================================================
AUTH Logs login events
--------- -----------------------------------------------------------------------------------------
DML Logs insert, update, delete, and other data manipulation language (DML) events
--------- -----------------------------------------------------------------------------------------
DDL Logs object and role create, alter, drop, and other data definition language (DDL) events
--------- -----------------------------------------------------------------------------------------
DCL Logs grant, revoke, create role, drop role, and list roles events
--------- -----------------------------------------------------------------------------------------
QUERY Logs all queries
--------- -----------------------------------------------------------------------------------------
ADMIN Logs service level operations: create, alter, drop, attach, detach, list.
========= ========================================================================================= ====================
Parameter Logs Description Applies To
========= ========================================================================================= ====================
AUTH Logs login events CQL
--------- ----------------------------------------------------------------------------------------- --------------------
DML Logs insert, update, delete, and other data manipulation language (DML) events CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
DDL Logs object and role create, alter, drop, and other data definition language (DDL) events CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
DCL Logs grant, revoke, create role, drop role, and list roles events CQL
--------- ----------------------------------------------------------------------------------------- --------------------
QUERY Logs all queries CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
ADMIN Logs service level operations: create, alter, drop, attach, detach, list. CQL
For :ref:`service level <workload-priorization-service-level-management>`
auditing.
========= =========================================================================================
========= ========================================================================================= ====================
For details on auditing Alternator operations, see :ref:`alternator-auditing`.
Note that enabling audit may negatively impact performance and audit-to-table may consume extra storage. That's especially true when auditing DML and QUERY categories, which generate a high volume of audit messages.
.. _alternator-auditing:
Auditing Alternator Requests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When auditing is enabled, Alternator (DynamoDB-compatible API) requests are audited using the same
backends and the same filtering configuration (``audit_categories``, ``audit_keyspaces``,
``audit_tables``) as CQL operations. No additional configuration is needed.
Both successful and failed Alternator requests are audited.
Alternator Operation Categories
""""""""""""""""""""""""""""""""
Each Alternator API operation is assigned to one of the standard audit categories:
========= ====================================================================================================
Category Alternator Operations
========= ====================================================================================================
DDL CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource, UpdateTimeToLive
--------- ----------------------------------------------------------------------------------------------------
DML PutItem, UpdateItem, DeleteItem, BatchWriteItem
--------- ----------------------------------------------------------------------------------------------------
QUERY GetItem, BatchGetItem, Query, Scan, DescribeTable, ListTables, DescribeEndpoints,
ListTagsOfResource, DescribeTimeToLive, DescribeContinuousBackups,
ListStreams, DescribeStream, GetShardIterator, GetRecords
========= ====================================================================================================
.. note:: AUTH, DCL, and ADMIN categories do not apply to Alternator operations. These categories
are specific to CQL authentication, authorization, and service-level management.
Operation Field Format
"""""""""""""""""""""""
For CQL operations, the ``operation`` field in the audit log contains the raw CQL query string.
For Alternator operations, the format is:
.. code-block:: none
<OperationName>|<JSON request body>
For example:
.. code-block:: none
PutItem|{"TableName":"my_table","Item":{"p":{"S":"pk_val"},"c":{"S":"ck_val"},"v":{"S":"data"}}}
.. note:: The full JSON request body is included in the ``operation`` field. For batch operations
(such as BatchWriteItem), this can be very large (up to 16 MB).
Keyspace and Table Filtering for Alternator
""""""""""""""""""""""""""""""""""""""""""""
The real keyspace name of an Alternator table ``T`` is ``alternator_T``.
The ``audit_tables`` config flag uses the shorthand format ``alternator.T`` to refer to such
tables -- the parser expands it to the real keyspace name automatically.
For ``audit_keyspaces``, use the real keyspace name directly.
For example, to audit an Alternator table called ``my_table_name`` use either of the below:
.. code-block:: yaml
# Using audit_tables - use 'alternator' as the keyspace name:
audit_tables: "alternator.my_table_name"
# Using audit_keyspaces - use the real keyspace name:
audit_keyspaces: "alternator_my_table_name"
**Global and batch operations**: Some Alternator operations are not scoped to a single table:
* ``ListTables`` and ``DescribeEndpoints`` have no associated keyspace or table.
* ``BatchWriteItem`` and ``BatchGetItem`` may span multiple tables.
These operations are logged whenever their category matches ``audit_categories``, regardless of
``audit_keyspaces`` or ``audit_tables`` filters. Their ``keyspace_name`` field is empty, and for
batch operations the ``table_name`` field contains a pipe-separated (``|``) list of all involved table names.
**DynamoDB Streams operations**: For streams-related operations (``DescribeStream``, ``GetShardIterator``,
``GetRecords``), the ``table_name`` field contains the base table name and the CDC log table name
separated by a pipe (e.g., ``my_table|my_table_scylla_cdc_log``).
Alternator Audit Log Examples
""""""""""""""""""""""""""""""
Syslog output example (PutItem):
.. code-block:: shell
Mar 18 10:15:03 ip-10-143-2-108 scylla-audit[28387]: node="10.143.2.108", category="DML", cl="LOCAL_QUORUM", error="false", keyspace="alternator_my_table", query="PutItem|{\"TableName\":\"my_table\",\"Item\":{\"p\":{\"S\":\"pk_val\"}}}", client_ip="127.0.0.1", table="my_table", username="anonymous"
Table output example (PutItem):
.. code-block:: shell
SELECT * FROM audit.audit_log ;
returns:
.. code-block:: none
date | node | event_time | category | consistency | error | keyspace_name | operation | source | table_name | username |
-------------------------+--------------+--------------------------------------+----------+--------------+-------+-----------------------+----------------------------------------------------------------------------------+-----------+------------+-----------+
2026-03-18 00:00:00+0000 | 10.143.2.108 | 3429b1a5-2a94-11e8-8f4e-000000000001 | DML | LOCAL_QUORUM | False | alternator_my_table | PutItem|{"TableName":"my_table","Item":{"p":{"S":"pk_val"}}} | 127.0.0.1 | my_table | anonymous |
(1 row)
Configuring Audit Storage
---------------------------
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>` or both.
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a ScyllaDB :ref:`table <auditing-table-storage>` or both.
.. _auditing-syslog-storage:
@@ -99,13 +208,13 @@ Storing Audit Messages in Syslog
# All tables in those keyspaces will be audited
audit_keyspaces: "mykespace"
#. Restart the Scylla node.
#. Restart the ScyllaDB node.
.. include:: /rst_include/scylla-commands-restart-index.rst
By default, audit messages are written to the same destination as Scylla :doc:`logging </getting-started/logging>`, with ``scylla-audit`` as the process name.
By default, audit messages are written to the same destination as ScyllaDB :doc:`logging </getting-started/logging>`, with ``scylla-audit`` as the process name.
Logging output example (drop table):
Logging output example (CQL drop table):
.. code-block:: shell
@@ -123,7 +232,7 @@ To redirect the Syslog output to a file, follow the steps below (available only
Storing Audit Messages in a Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Messages are stored in a Scylla table named ``audit.audit_log``.
Messages are stored in a ScyllaDB table named ``audit.audit_log``.
For example:
@@ -170,11 +279,11 @@ For example:
# All tables in those keyspaces will be audited
audit_keyspaces: "mykespace"
#. Restart Scylla node.
#. Restart the ScyllaDB node.
.. include:: /rst_include/scylla-commands-restart-index.rst
Table output example (drop table):
Table output example (CQL drop table):
.. code-block:: shell
@@ -196,7 +305,7 @@ Storing Audit Messages in a Table and Syslog Simultaneously
**Procedure**
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart scylla only once.
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart ScyllaDB only once.
To have both syslog and table you need to specify both backends separated by a comma:

View File

@@ -1,41 +0,0 @@
================
About Upgrade
================
ScyllaDB upgrade is a rolling procedure - it does not require a full cluster
shutdown and is performed without any downtime or disruption of service.
To ensure a successful upgrade, follow
the :doc:`documented upgrade procedures <upgrade-guides/index>` tested by
ScyllaDB. This means that:
* You should follow the upgrade policy:
* Starting with version **2025.4**, upgrades can **skip minor versions** if:
* They remain within the same major version (for example, upgrading
directly from *2025.1 → 2025.4* is supported).
* You upgrade to the next major version (for example, upgrading
directly from *2025.3 → 2026.1* is supported).
* For versions **prior to 2025.4**, upgrades must be performed consecutively—
each successive X.Y version must be installed in order, **without skipping
any major or minor version** (for example, upgrading directly from 2025.1 → 2025.3
is not supported).
* You cannot skip major versions. Upgrades must move from one major version to
the next using the documented major-version upgrade path.
* You should upgrade to a supported version of ScyllaDB.
See `ScyllaDB Version Support <https://docs.scylladb.com/stable/versioning/version-support.html>`_.
* Before you upgrade to the next version, the whole cluster (each node) must
be upgraded to the previous version.
* You cannot perform an upgrade by replacing the nodes in the cluster with new
nodes with a different ScyllaDB version. You should never add a new node with
a different version to a cluster - if you
:doc:`add a node </operating-scylla/procedures/cluster-management/add-node-to-cluster>`,
it must have the same X.Y.Z (major.minor.patch) version as the other nodes in
the cluster.
Upgrading to each patch version by following the Maintenance Release Upgrade
Guide is optional. However, we recommend upgrading to the latest patch release
for your version before upgrading to a new version.

View File

@@ -5,7 +5,6 @@ Upgrade ScyllaDB
.. toctree::
:titlesonly:
About Upgrade <about-upgrade>
Upgrade Guides <upgrade-guides/index>

View File

@@ -5,6 +5,7 @@ Upgrade ScyllaDB
.. toctree::
ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1/index>
ScyllaDB 2026.x Patch Upgrades <upgrade-guide-from-2026.x.y-to-2026.x.z>
ScyllaDB Image <ami-upgrade>

View File

@@ -20,7 +20,7 @@ This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL), CentOS,
and Ubuntu. See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions. It also applies when using the ScyllaDB official image on EC2, GCP, or Azure.
See :doc:`About Upgrade </upgrade/about-upgrade/>` for the ScyllaDB upgrade policy.
See `Upgrade Policy <https://docs.scylladb.com/stable/versioning/upgrade-policy.html>`_ for the ScyllaDB upgrade policy.
Before You Upgrade ScyllaDB
==============================

View File

@@ -0,0 +1,268 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 2026.x.y
.. |NEW_VERSION| replace:: 2026.x.z
==========================================================================
Upgrade - |SCYLLA_NAME| |SRC_VERSION| to |NEW_VERSION| (Patch Upgrades)
==========================================================================
This document describes a step-by-step procedure for upgrading from
|SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION| (where "z" is
the latest available version), and rolling back to version |SRC_VERSION|
if necessary.
This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL),
CentOS, Debian, and Ubuntu.
See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions.
It also applies to the ScyllaDB official image on EC2, GCP, or Azure.
See `Upgrade Policy <https://docs.scylladb.com/stable/versioning/upgrade-policy.html>`_ for the ScyllaDB upgrade policy.
Upgrade Procedure
=================
.. note::
Apply the following procedure **serially** on each node. Do not move to the next
node before validating that the node is up and running the new version.
A ScyllaDB upgrade is a rolling procedure that does **not** require a full cluster
shutdown. For each of the nodes in the cluster, you will:
#. Drain the node and back up the data.
#. Backup configuration file.
#. Stop ScyllaDB.
#. Download and install new ScyllaDB packages.
#. Start ScyllaDB.
#. Validate that the upgrade was successful.
**Before** upgrading, check which version you are running now using
``scylla --version``. Note the current version in case you want to roll back
the upgrade.
**During** the rolling upgrade it is highly recommended:
* Not to use new |NEW_VERSION| features.
* Not to run administration functions, like repairs, refresh, rebuild or add
or remove nodes. See
`sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending
ScyllaDB Manager's scheduled or running repairs.
* Not to apply schema changes.
Upgrade Steps
=============
Back up the data
------------------------------
Back up all the data to an external device. We recommend using
`ScyllaDB Manager <https://manager.docs.scylladb.com/stable/backup/index.html>`_
to create backups.
Alternatively, you can use the ``nodetool snapshot`` command.
For **each** node in the cluster, run the following:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all
the directories with this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is completed on all nodes, remove the snapshot with the
``nodetool clearsnapshot -t <snapshot>`` command to prevent running out of
space.
Back up the configuration file
------------------------------
Back up the ``scylla.yaml`` configuration file and the ScyllaDB packages
in case you need to roll back the upgrade.
.. tabs::
.. group-tab:: Debian/Ubuntu
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
sudo cp /etc/apt/sources.list.d/scylla.list ~/scylla.list-backup
.. group-tab:: RHEL/CentOS
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
sudo cp /etc/yum.repos.d/scylla.repo ~/scylla.repo-backup
Gracefully stop the node
------------------------
.. code:: sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
You dont need to update the ScyllaDB DEB or RPM repo when you upgrade to
a patch release.
.. tabs::
.. group-tab:: Debian/Ubuntu
To install a patch version on Debian or Ubuntu, run:
.. code:: sh
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
To install a patch version on RHEL or CentOS, run:
.. code:: sh
sudo yum clean all
sudo yum update scylla\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
If you're using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions.
If you're using your own image and have installed ScyllaDB packages for
Ubuntu or Debian, you need to apply an extended upgrade procedure:
#. Install the new ScyllaDB version with the additional
``scylla-machine-image`` package:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
sudo apt-get dist-upgrade scylla-machine-image
#. Run ``scylla_setup`` without ``running io_setup``.
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
Start the node
--------------
.. code:: sh
sudo service start scylla-server
Validate
--------
#. Check cluster status with ``nodetool status`` and make sure **all** nodes,
including the one you just upgraded, are in UN status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"``
to check the ScyllaDB version.
#. Use ``journalctl _COMM=scylla`` to check there are no new errors in the log.
#. Check again after 2 minutes to validate that no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in
the cluster.
Rollback Procedure
==================
The following procedure describes a rollback from ScyllaDB release
|NEW_VERSION| to |SRC_VERSION|. Apply this procedure if an upgrade from
|SRC_VERSION| to |NEW_VERSION| failed before completing on all nodes.
* Use this procedure only on nodes you upgraded to |NEW_VERSION|.
* Execute the following commands one node at a time, moving to the next node only
after the rollback procedure is completed successfully.
ScyllaDB rollback is a rolling procedure that does **not** require a full
cluster shutdown. For each of the nodes to roll back to |SRC_VERSION|, you will:
#. Drain the node and stop ScyllaDB.
#. Downgrade to the previous release.
#. Restore the configuration file.
#. Restart ScyllaDB.
#. Validate the rollback success.
Rollback Steps
==============
Gracefully shutdown ScyllaDB
-----------------------------
.. code:: sh
nodetool drain
sudo service stop scylla-server
Downgrade to the previous release
----------------------------------
.. tabs::
.. group-tab:: Debian/Ubuntu
To downgrade to |SRC_VERSION| on Debian or Ubuntu, run:
.. code-block:: console
:substitutions:
sudo apt-get install scylla=|SRC_VERSION|\* scylla-server=|SRC_VERSION|\* scylla-tools=|SRC_VERSION|\* scylla-tools-core=|SRC_VERSION|\* scylla-kernel-conf=|SRC_VERSION|\* scylla-conf=|SRC_VERSION|\*
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
To downgrade to |SRC_VERSION| on RHEL or CentOS, run:
.. code-block:: console
:substitutions:
sudo yum downgrade scylla\*-|SRC_VERSION|-\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
If youre using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions.
If youre using your own image and have installed ScyllaDB packages for
Ubuntu or Debian, you need to additionally downgrade
the ``scylla-machine-image`` package.
.. code-block:: console
:substitutions:
sudo apt-get install scylla=|SRC_VERSION|\* scylla-server=|SRC_VERSION|\* scylla-tools=|SRC_VERSION|\* scylla-tools-core=|SRC_VERSION|\* scylla-kernel-conf=|SRC_VERSION|\* scylla-conf=|SRC_VERSION|\*
sudo apt-get install scylla-machine-image=|SRC_VERSION|\*
Answer y to the first two questions.
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup /etc/scylla/scylla.yaml
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node
rollback is successful, move to the next node in the cluster.

View File

@@ -227,13 +227,19 @@ Security
Indexing and Caching
^^^^^^^^^^^^^^^^^^^^^
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
| Options | Support |
+==============================================================+======================================================================================+
|:doc:`Secondary Index </features/secondary-indexes>` | |v| |
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:doc:`Materialized Views </features/materialized-views>` | |v| |
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
| Options | Support |
+================================================================+======================================================================================+
|:doc:`Secondary Index </features/secondary-indexes>` | |v| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|StorageAttachedIndex (SAI) | |x| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:ref:`SAI for vector search <cassandra-sai-compatibility>` | |v| :sup:`*` |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:doc:`Materialized Views </features/materialized-views>` | |v| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
:sup:`*` SAI class name on vector columns is rewritten to native ``vector_index``
Additional Features

View File

@@ -727,7 +727,12 @@ public:
// now we need one page more to be able to save one for next lap
auto fill_size = align_up(buf1.size(), block_size) + block_size - buf1.size();
auto buf2 = co_await _input.read_exactly(fill_size);
// If the underlying stream is already at EOF (e.g. buf1 came from
// cached _next while the previous read_exactly drained the source),
// skip the read_exactly call — it would return empty anyway.
auto buf2 = _input.eof()
? temporary_buffer<char>()
: co_await _input.read_exactly(fill_size);
temporary_buffer<char> output(buf1.size() + buf2.size());

View File

@@ -380,7 +380,7 @@ public:
}
template<typename HostType, typename CacheType, typename ConfigType>
shared_ptr<HostType> get_host(const sstring& host, CacheType& cache, const ConfigType& config_map) {
shared_ptr<HostType> get_host(const sstring& host, CacheType& cache, const ConfigType& config_map, std::string_view config_entry_name) {
auto& host_cache = cache[this_shard_id()];
auto it = host_cache.find(host);
if (it != host_cache.end()) {
@@ -394,23 +394,26 @@ public:
return result;
}
throw std::invalid_argument("No such host: " + host);
throw std::invalid_argument(fmt::format(
"Encryption host \"{}\" is not defined in scylla.yaml. "
"Make sure it is listed under the \"{}\" section.",
host, config_entry_name));
}
shared_ptr<kmip_host> get_kmip_host(const sstring& host) override {
return get_host<kmip_host>(host, _per_thread_kmip_host_cache, _cfg->kmip_hosts());
return get_host<kmip_host>(host, _per_thread_kmip_host_cache, _cfg->kmip_hosts(), "kmip_hosts");
}
shared_ptr<kms_host> get_kms_host(const sstring& host) override {
return get_host<kms_host>(host, _per_thread_kms_host_cache, _cfg->kms_hosts());
return get_host<kms_host>(host, _per_thread_kms_host_cache, _cfg->kms_hosts(), "kms_hosts");
}
shared_ptr<gcp_host> get_gcp_host(const sstring& host) override {
return get_host<gcp_host>(host, _per_thread_gcp_host_cache, _cfg->gcp_hosts());
return get_host<gcp_host>(host, _per_thread_gcp_host_cache, _cfg->gcp_hosts(), "gcp_hosts");
}
shared_ptr<azure_host> get_azure_host(const sstring& host) override {
return get_host<azure_host>(host, _per_thread_azure_host_cache, _cfg->azure_hosts());
return get_host<azure_host>(host, _per_thread_azure_host_cache, _cfg->azure_hosts(), "azure_hosts");
}

View File

@@ -437,7 +437,6 @@ void ldap_connection::poll_results() {
const auto found = _msgid_to_promise.find(id);
if (found == _msgid_to_promise.end()) {
mylog.error("poll_results: got valid result for unregistered id {}, dropping it", id);
ldap_msgfree(result);
} else {
found->second.set_value(std::move(result_ptr));
_msgid_to_promise.erase(found);

View File

@@ -41,7 +41,7 @@ public:
_ip == other._ip;
}
endpoint_state(inet_address ip) noexcept
explicit endpoint_state(inet_address ip) noexcept
: _heart_beat_state()
, _update_timestamp(clk::now())
, _ip(ip)

View File

@@ -172,12 +172,14 @@ public:
gms::feature rack_list_rf { *this, "RACK_LIST_RF"sv };
gms::feature driver_service_level { *this, "DRIVER_SERVICE_LEVEL"sv };
gms::feature strongly_consistent_tables { *this, "STRONGLY_CONSISTENT_TABLES"sv };
gms::feature logstor { *this, "LOGSTOR"sv };
gms::feature client_routes { *this, "CLIENT_ROUTES"sv };
gms::feature removenode_with_left_token_ring { *this, "REMOVENODE_WITH_LEFT_TOKEN_RING"sv };
gms::feature size_based_load_balancing { *this, "SIZE_BASED_LOAD_BALANCING"sv };
gms::feature topology_noop_request { *this, "TOPOLOGY_NOOP_REQUEST"sv };
gms::feature tablets_intermediate_fallback_cleanup { *this, "TABLETS_INTERMEDIATE_FALLBACK_CLEANUP"sv };
gms::feature batchlog_v2 { *this, "BATCHLOG_V2"sv };
gms::feature vnodes_to_tablets_migrations { *this, "VNODES_TO_TABLETS_MIGRATIONS"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -59,7 +59,6 @@ using clk = gossiper::clk;
static logging::logger logger("gossip");
constexpr std::chrono::milliseconds gossiper::INTERVAL;
constexpr std::chrono::hours gossiper::A_VERY_LONG_TIME;
constexpr generation_type::value_type gossiper::MAX_GENERATION_DIFFERENCE;
const sstring& gossiper::get_cluster_name() const noexcept {
@@ -648,7 +647,7 @@ future<> gossiper::do_apply_state_locally(locator::host_id node, endpoint_state
}
// Re-rake after apply_new_states
es = get_endpoint_state_ptr(node);
if (!is_alive(es->get_host_id()) && !is_dead_state(*es) && !shadow_round) { // unless of course, it was dead
if (!is_alive(es->get_host_id()) && !is_left(*es) && !shadow_round) { // unless of course, it was dead
mark_alive(es);
}
} else {
@@ -767,7 +766,7 @@ future<> gossiper::remove_endpoint(locator::host_id endpoint, permit_id pid) {
if (was_alive) {
try {
logger.info("InetAddress {}/{} is now DOWN, status = {}", state->get_host_id(), ip, get_gossip_status(*state));
logger.info("InetAddress {}/{} is now DOWN, status = {}", host_id, ip, get_node_status(host_id));
co_await do_on_dead_notifications(ip, std::move(state), pid);
} catch (...) {
logger.warn("Fail to call on_dead callback: {}", std::current_exception());
@@ -1174,10 +1173,10 @@ future<> gossiper::unregister_(shared_ptr<i_endpoint_state_change_subscriber> su
std::set<locator::host_id> gossiper::get_live_members() const {
std::set<locator::host_id> live_members(_live_endpoints.begin(), _live_endpoints.end());
auto myip = get_broadcast_address();
auto myid = my_host_id();
logger.debug("live_members before={}", live_members);
if (!is_shutdown(myip)) {
live_members.insert(my_host_id());
if (!is_shutdown(myid)) {
live_members.insert(myid);
}
logger.debug("live_members after={}", live_members);
return live_members;
@@ -1248,7 +1247,6 @@ future<> gossiper::evict_from_membership(locator::host_id hid, permit_id pid) {
}
g._endpoint_state_map.erase(hid);
});
_expire_time_endpoint_map.erase(hid);
logger.debug("evicting {} from gossip", hid);
}
@@ -1321,21 +1319,6 @@ future<> gossiper::replicate(endpoint_state es, permit_id pid) {
}
}
future<> gossiper::advertise_token_removed(locator::host_id host_id, permit_id pid) {
auto permit = co_await lock_endpoint(host_id, pid);
pid = permit.id();
auto eps = get_endpoint_state(host_id);
eps.update_timestamp(); // make sure we don't evict it too soon
eps.get_heart_beat_state().force_newer_generation_unsafe();
auto expire_time = compute_expire_time();
eps.add_application_state(application_state::STATUS, versioned_value::removed_nonlocal(host_id, expire_time.time_since_epoch().count()));
logger.info("Completing removal of {}", host_id);
add_expire_time_for_endpoint(host_id, expire_time);
co_await replicate(std::move(eps), pid);
// ensure at least one gossip round occurs before returning
co_await sleep_abortable(INTERVAL * 2, _abort_source);
}
future<> gossiper::assassinate_endpoint(sstring address) {
throw std::runtime_error("Assassinating endpoint is not supported in topology over raft mode");
}
@@ -1368,13 +1351,10 @@ future<> gossiper::do_gossip_to_unreachable_member(gossip_digest_syn message) {
std::uniform_real_distribution<double> dist(0, 1);
double rand_dbl = dist(_random_engine);
if (rand_dbl < prob) {
std::set<locator::host_id> addrs;
for (auto&& x : _unreachable_endpoints) {
// Ignore the node which is decommissioned
if (get_gossip_status(_address_map.get(x.first)) != sstring(versioned_value::STATUS_LEFT)) {
addrs.insert(x.first);
}
}
auto addrs = _unreachable_endpoints | std::ranges::views::keys | std::views::filter([this] (auto ep) {
// Ignore the node which is no longer part of the cluster
return !_topo_sm._topology.left_nodes.contains(raft::server_id(ep.uuid()));
}) | std::ranges::to<std::set>();
logger.trace("do_gossip_to_unreachable_member: live_endpoint nr={} unreachable_endpoints nr={}",
live_endpoint_count, unreachable_endpoint_count);
return send_gossip(message, addrs);
@@ -1383,17 +1363,6 @@ future<> gossiper::do_gossip_to_unreachable_member(gossip_digest_syn message) {
return make_ready_future<>();
}
clk::time_point gossiper::get_expire_time_for_endpoint(locator::host_id id) const noexcept {
/* default expire_time is A_VERY_LONG_TIME */
auto it = _expire_time_endpoint_map.find(id);
if (it == _expire_time_endpoint_map.end()) {
return compute_expire_time();
} else {
auto stored_time = it->second;
return stored_time;
}
}
endpoint_state_ptr gossiper::get_endpoint_state_ptr(locator::host_id ep) const noexcept {
auto it = _endpoint_state_map.find(ep);
if (it == _endpoint_state_map.end()) {
@@ -1420,7 +1389,7 @@ endpoint_state& gossiper::my_endpoint_state() {
auto ep = get_broadcast_address();
auto it = _endpoint_state_map.find(id);
if (it == _endpoint_state_map.end()) {
it = _endpoint_state_map.emplace(id, make_endpoint_state_ptr({ep})).first;
it = _endpoint_state_map.emplace(id, make_endpoint_state_ptr(endpoint_state{ep})).first;
}
return const_cast<endpoint_state&>(*it->second);
}
@@ -1634,9 +1603,8 @@ future<> gossiper::real_mark_alive(locator::host_id host_id) {
}
// Do not mark a node with status shutdown as UP.
auto status = sstring(get_gossip_status(*es));
if (status == sstring(versioned_value::SHUTDOWN)) {
logger.warn("Skip marking node {} with status = {} as UP", host_id, status);
if (is_shutdown(*es)) {
logger.warn("Skip marking node {} with status = shutdown as UP", host_id);
co_return;
}
@@ -1649,7 +1617,6 @@ future<> gossiper::real_mark_alive(locator::host_id host_id) {
auto [it_, inserted] = data.live.insert(addr);
was_live = !inserted;
});
_expire_time_endpoint_map.erase(host_id);
if (was_live) {
co_return;
}
@@ -1662,7 +1629,7 @@ future<> gossiper::real_mark_alive(locator::host_id host_id) {
auto addr = es->get_ip();
logger.info("InetAddress {}/{} is now UP, status = {}", host_id, addr, status);
logger.info("InetAddress {}/{} is now UP, status = {}", host_id, addr, get_node_status(host_id));
co_await _subscribers.for_each([addr, host_id, es, pid = permit.id()] (shared_ptr<i_endpoint_state_change_subscriber> subscriber) -> future<> {
co_await subscriber->on_alive(addr, host_id, es, pid);
@@ -1678,7 +1645,7 @@ future<> gossiper::mark_dead(locator::host_id addr, endpoint_state_ptr state, pe
data.live.erase(addr);
data.unreachable[addr] = now();
});
logger.info("InetAddress {} is now DOWN, status = {}", addr, get_gossip_status(*state));
logger.info("InetAddress {} is now DOWN, status = {}", addr, get_node_status(addr));
co_await do_on_dead_notifications(state->get_ip(), std::move(state), pid);
}
@@ -1688,14 +1655,14 @@ future<> gossiper::handle_major_state_change(endpoint_state eps, permit_id pid,
endpoint_state_ptr eps_old = get_endpoint_state_ptr(ep);
if (!is_dead_state(eps) && !shadow_round) {
if (!is_left(eps) && !shadow_round) {
if (_endpoint_state_map.contains(ep)) {
logger.info("Node {} has restarted, now UP, status = {}", ep, get_gossip_status(eps));
logger.info("Node {} has restarted, now UP, status = {}", ep, get_node_status(ep));
} else {
logger.debug("Node {} is now part of the cluster, status = {}", ep, get_gossip_status(eps));
logger.debug("Node {} is now part of the cluster, status = {}", ep, get_node_status(ep));
}
}
logger.trace("Adding endpoint state for {}, status = {}", ep, get_gossip_status(eps));
logger.trace("Adding endpoint state for {}, status = {}", ep, get_node_status(ep));
co_await replicate(eps, pid);
if (shadow_round) {
@@ -1713,10 +1680,10 @@ future<> gossiper::handle_major_state_change(endpoint_state eps, permit_id pid,
if (!ep_state) {
throw std::out_of_range(format("ep={}", ep));
}
if (!is_dead_state(*ep_state)) {
if (!is_left(*ep_state)) {
mark_alive(ep_state);
} else {
logger.debug("Not marking {} alive due to dead state {}", ep, get_gossip_status(eps));
logger.debug("Not marking {} alive due to dead state {}", ep, get_node_status(ep));
co_await mark_dead(ep, ep_state, pid);
}
@@ -1730,8 +1697,8 @@ future<> gossiper::handle_major_state_change(endpoint_state eps, permit_id pid,
}
}
bool gossiper::is_dead_state(const endpoint_state& eps) const {
return std::ranges::any_of(DEAD_STATES, [state = get_gossip_status(eps)](const auto& deadstate) { return state == deadstate; });
bool gossiper::is_left(const endpoint_state& eps) const {
return _topo_sm._topology.left_nodes.contains(raft::server_id(eps.get_host_id().uuid()));
}
bool gossiper::is_shutdown(const locator::host_id& endpoint) const {
@@ -1746,10 +1713,6 @@ bool gossiper::is_normal(const locator::host_id& endpoint) const {
return get_gossip_status(endpoint) == versioned_value::STATUS_NORMAL;
}
bool gossiper::is_silent_shutdown_state(const endpoint_state& ep_state) const{
return std::ranges::any_of(SILENT_SHUTDOWN_STATES, [state = get_gossip_status(ep_state)](const auto& deadstate) { return state == deadstate; });
}
future<> gossiper::apply_new_states(endpoint_state local_state, const endpoint_state& remote_state, permit_id pid, bool shadow_round) {
// don't SCYLLA_ASSERT here, since if the node restarts the version will go back to zero
//int oldVersion = local_state.get_heart_beat_state().get_heart_beat_version();
@@ -2173,16 +2136,14 @@ future<> gossiper::do_stop_gossiping() {
logger.info("gossip is already stopped");
co_return;
}
auto my_ep_state = get_this_endpoint_state_ptr();
if (my_ep_state) {
logger.info("My status = {}", get_gossip_status(*my_ep_state));
}
if (my_ep_state && !is_silent_shutdown_state(*my_ep_state)) {
if (my_ep_state && _topo_sm._topology.normal_nodes.contains(raft::server_id(my_host_id().uuid()))) {
auto local_generation = my_ep_state->get_heart_beat_state().get_generation();
logger.info("Announcing shutdown");
co_await add_local_application_state(application_state::STATUS, versioned_value::shutdown(true));
auto live_endpoints = _live_endpoints;
for (locator::host_id id : live_endpoints) {
co_await coroutine::parallel_for_each(live_endpoints, [this, &local_generation] (locator::host_id id) -> future<> {
logger.info("Sending a GossipShutdown to {} with generation {}", id, local_generation);
try {
co_await ser::gossip_rpc_verbs::send_gossip_shutdown(&_messaging, id, get_broadcast_address(), local_generation.value());
@@ -2190,7 +2151,7 @@ future<> gossiper::do_stop_gossiping() {
} catch (...) {
logger.warn("Fail to send GossipShutdown to {}: {}", id, std::current_exception());
}
}
});
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
} else {
logger.warn("No local state or state is in silent shutdown, not announcing shutdown");
@@ -2241,19 +2202,6 @@ bool gossiper::is_enabled() const {
return _enabled && !_abort_source.abort_requested();
}
void gossiper::add_expire_time_for_endpoint(locator::host_id endpoint, clk::time_point expire_time) {
auto now_ = now();
auto diff = std::chrono::duration_cast<std::chrono::seconds>(expire_time - now_).count();
logger.info("Node {} will be removed from gossip at [{:%Y-%m-%d %T %z}]: (expire = {}, now = {}, diff = {} seconds)",
endpoint, fmt::gmtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
now_.time_since_epoch().count(), diff);
_expire_time_endpoint_map[endpoint] = expire_time;
}
clk::time_point gossiper::compute_expire_time() {
return now() + A_VERY_LONG_TIME;
}
bool gossiper::is_alive(locator::host_id id) const {
if (id == my_host_id()) {
return true;
@@ -2373,91 +2321,22 @@ std::string_view gossiper::get_gossip_status(const locator::host_id& endpoint) c
return do_get_gossip_status(get_application_state_ptr(endpoint, application_state::STATUS));
}
bool gossiper::is_safe_for_bootstrap(inet_address endpoint) const {
// We allow to bootstrap a new node in only two cases:
// 1) The node is a completely new node and no state in gossip at all
// 2) The node has state in gossip and it is already removed from the
// cluster either by nodetool decommission or nodetool removenode
bool allowed = true;
auto host_id = try_get_host_id(endpoint);
if (!host_id) {
logger.debug("is_safe_for_bootstrap: node={}, status=no state in gossip, allowed_to_bootstrap={}", endpoint, allowed);
return allowed;
std::string gossiper::get_node_status(const locator::host_id& endpoint) const noexcept {
if (this_shard_id() != 0) {
on_internal_error(logger, "get_node_status should only be called on shard 0");
}
auto eps = get_endpoint_state_ptr(*host_id);
if (!eps) {
logger.debug("is_safe_for_bootstrap: node={}, status=no state in gossip, allowed_to_bootstrap={}", endpoint, allowed);
return allowed;
if (is_shutdown(endpoint)) {
return "shutdown";
}
auto status = get_gossip_status(*eps);
std::unordered_set<std::string_view> allowed_statuses{
versioned_value::STATUS_LEFT,
versioned_value::REMOVED_TOKEN,
};
allowed = allowed_statuses.contains(status);
logger.debug("is_safe_for_bootstrap: node={}, status={}, allowed_to_bootstrap={}", endpoint, status, allowed);
return allowed;
}
std::set<sstring> gossiper::get_supported_features(locator::host_id endpoint) const {
auto app_state = get_application_state_ptr(endpoint, application_state::SUPPORTED_FEATURES);
if (!app_state) {
return {};
}
return feature_service::to_feature_set(app_state->value());
}
std::set<sstring> gossiper::get_supported_features(const std::unordered_map<locator::host_id, sstring>& loaded_peer_features, ignore_features_of_local_node ignore_local_node) const {
std::unordered_map<locator::host_id, std::set<sstring>> features_map;
std::set<sstring> common_features;
for (auto& x : loaded_peer_features) {
auto features = feature_service::to_feature_set(x.second);
if (features.empty()) {
logger.warn("Loaded empty features for peer node {}", x.first);
} else {
features_map.emplace(x.first, std::move(features));
auto n = _topo_sm._topology.find(raft::server_id{endpoint.uuid()});
if (!n) {
if (_topo_sm._topology.left_nodes.contains(raft::server_id{endpoint.uuid()})) {
return "left";
}
return "unknown";
} else {
return fmt::format("{}", n->second.state);
}
for (auto& x : _endpoint_state_map) {
auto host_id = x.second->get_host_id();
auto features = get_supported_features(host_id);
if (ignore_local_node && host_id == my_host_id()) {
logger.debug("Ignore SUPPORTED_FEATURES of local node: features={}", features);
continue;
}
if (features.empty()) {
auto it = loaded_peer_features.find(host_id);
if (it != loaded_peer_features.end()) {
logger.info("Node {} does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={}", host_id, feature_service::to_feature_set(it->second));
} else {
logger.warn("Node {} does not contain SUPPORTED_FEATURES in gossip or system table", host_id);
}
} else {
// Replace the features with live info
features_map[host_id] = std::move(features);
}
}
if (ignore_local_node) {
features_map.erase(my_host_id());
}
if (!features_map.empty()) {
common_features = features_map.begin()->second;
}
for (auto& x : features_map) {
auto& features = x.second;
std::set<sstring> result;
std::set_intersection(features.begin(), features.end(),
common_features.begin(), common_features.end(),
std::inserter(result, result.end()));
common_features = std::move(result);
}
common_features.erase("");
return common_features;
}
void gossiper::check_snitch_name_matches(sstring local_snitch_name) const {

View File

@@ -91,7 +91,6 @@ struct loaded_endpoint_state {
class gossiper : public seastar::async_sharded_service<gossiper>, public seastar::peering_sharded_service<gossiper> {
public:
using clk = seastar::lowres_system_clock;
using ignore_features_of_local_node = bool_class<class ignore_features_of_local_node_tag>;
using generation_for_nodes = std::unordered_map<locator::host_id, generation_type>;
private:
using messaging_verb = netw::messaging_verb;
@@ -198,18 +197,7 @@ private:
endpoint_locks_map _endpoint_locks;
public:
static constexpr std::array DEAD_STATES{
versioned_value::REMOVED_TOKEN,
versioned_value::STATUS_LEFT,
};
static constexpr std::array SILENT_SHUTDOWN_STATES{
versioned_value::REMOVED_TOKEN,
versioned_value::STATUS_LEFT,
versioned_value::STATUS_BOOTSTRAPPING,
versioned_value::STATUS_UNKNOWN,
};
static constexpr std::chrono::milliseconds INTERVAL{1000};
static constexpr std::chrono::hours A_VERY_LONG_TIME{24 * 3};
// Maximum difference between remote generation value and generation
// value this node would get if this node were restarted that we are
@@ -241,7 +229,6 @@ private:
/* initial seeds for joining the cluster */
std::set<inet_address> _seeds;
std::map<locator::host_id, clk::time_point> _expire_time_endpoint_map;
bool _in_shadow_round = false;
@@ -341,13 +328,6 @@ private:
utils::chunked_vector<gossip_digest> make_random_gossip_digest() const;
public:
/**
* Handles switching the endpoint's state from REMOVING_TOKEN to REMOVED_TOKEN
*
* @param endpoint
* @param host_id
*/
future<> advertise_token_removed(locator::host_id host_id, permit_id);
/**
* Do not call this method unless you know what you are doing.
@@ -363,7 +343,6 @@ public:
future<generation_type> get_current_generation_number(locator::host_id endpoint) const;
future<version_type> get_current_heart_beat_version(locator::host_id endpoint) const;
bool is_safe_for_bootstrap(inet_address endpoint) const;
private:
/**
* Returns true if the chosen target was also a seed. False otherwise
@@ -383,7 +362,6 @@ private:
future<> do_gossip_to_unreachable_member(gossip_digest_syn message);
public:
clk::time_point get_expire_time_for_endpoint(locator::host_id endpoint) const noexcept;
// Gets a shared pointer to the endpoint_state, if exists.
// Otherwise, returns a null ptr.
@@ -467,7 +445,7 @@ private:
public:
bool is_alive(locator::host_id id) const;
bool is_dead_state(const endpoint_state& eps) const;
bool is_left(const endpoint_state& eps) const;
// Wait for nodes to be alive on all shards
future<> wait_alive(std::vector<gms::inet_address> nodes, std::chrono::milliseconds timeout);
future<> wait_alive(std::vector<locator::host_id> nodes, std::chrono::milliseconds timeout);
@@ -588,17 +566,12 @@ public:
public:
bool is_enabled() const;
public:
void add_expire_time_for_endpoint(locator::host_id endpoint, clk::time_point expire_time);
static clk::time_point compute_expire_time();
public:
bool is_seed(const inet_address& endpoint) const;
bool is_shutdown(const locator::host_id& endpoint) const;
bool is_shutdown(const endpoint_state& eps) const;
bool is_normal(const locator::host_id& endpoint) const;
bool is_cql_ready(const locator::host_id& endpoint) const;
bool is_silent_shutdown_state(const endpoint_state& ep_state) const;
void force_newer_generation();
public:
std::string_view get_gossip_status(const endpoint_state& ep_state) const noexcept;
@@ -615,12 +588,8 @@ private:
gossip_address_map& _address_map;
gossip_config _gcfg;
condition_variable _failure_detector_loop_cv;
// Get features supported by a particular node
std::set<sstring> get_supported_features(locator::host_id endpoint) const;
locator::token_metadata_ptr get_token_metadata_ptr() const noexcept;
public:
// Get features supported by all the nodes this node knows about
std::set<sstring> get_supported_features(const std::unordered_map<locator::host_id, sstring>& loaded_peer_features, ignore_features_of_local_node ignore_local_node) const;
std::string get_node_status(const locator::host_id& endpoint) const noexcept;
private:
seastar::metrics::metric_groups _metrics;
public:

Some files were not shown because too many files have changed in this diff Show More