Commit Graph

53114 Commits

Author SHA1 Message Date
Nadav Har'El
bb2fb810bb cql3: document WRITETIME() and TTL() for elements of map, set or UDT
Add to the SELECT documentation (docs/cql/dml/select.rst) documentation
of the new ability to select WRITETIME() and TTL() of a single element
of map, set or UDT.

Also in the TTL documentation (docs/cql/time-to-live.rst), which already
had a section on "TTL for a collection", add a mention of the ability
to read a single element's TTL(), and an example.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
a544dae047 test/boost: test WRITETIME() and TTL() on map collection elements
Add tests in test/boost/expr_test.cc for the low-level implementation
of writetime() and ttl() on a map element.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
ccb94618cc test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
This patch adds many tests verifying the behavior of WRITETIME() and
TTL() on individual elements of maps, sets and UDTs, serving as a
regression test for issue #15427. We also add tests verifying our
understanding of related issues like WRITETIME() and TTL() of entire
collections and of individual elements of *frozen* collections.

All new tests pass on Cassandra 5.0, helping to verify that our
implementation is compatible with Cassandra. They also pass on
ScyllaDB after the previous patch (most didn't before that patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:27:40 +03:00
Nadav Har'El
35e807a36c cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
Complete the implementation of SELECT WRITETIME(col[key])/TTL(col[key])
and WRITETIME(col.field)/TTL(col.field), building on the grammar (commit 1),
wire format (commit 2), and selection-layer (commit 3) changes in the
preceding patches.

* prepare_column_mutation_attribute() (prepare_expr.cc) now handles the
  subscript and field_selection nodes that the grammar produces:
  - For subscripts, it validates that the inner column is a non-frozen
    map or set and checks the 'writetime_ttl_individual_element' feature
    flag so the feature is rejected during rolling upgrades.
  - For field selections, it validates that the inner column is a
    non-frozen UDT, with the same feature-flag check.

* do_evaluate(column_mutation_attribute) (expression.cc) handles the
  same two cases. For a field selection it serializes the field index as
  a key and looks it up in collection_element_metadata; for a subscript
  it evaluates the subscript key and looks it up in the same map.
  A missing key (element not found or expired) returns NULL, matching
  Cassandra behavior.

Together with the preceding three patches, this finally fixes #15427.

The next three patches will add tests and documentation for the new
feature, and the final eighth patch will fix the implementation of
UDT fields in LWT expressions - which the first patch made the grammar
allow but is still not implemented correctly.
2026-04-12 13:28:28 +03:00
Nadav Har'El
4ac63de063 cql3: parse per-element timestamps/TTLs in the selection layer
Wire up the selection and result-set infrastructure to consume the
extended collection wire format introduced in the previous patch and
expose per-element timestamps and TTLs to the expression evaluator.

* Add collection_cell_metadata: maps from raw element-key bytes to
  timestamp and remaining TTL, one entry per collection or UDT cell.
  Add a corresponding collection_element_metadata span to
  evaluation_inputs so that evaluators can access it.

* Add a flag _collect_collection_timestamps to selection (selection.hh/cc).
  When any selected expression contains a WRITETIME(col[key])/TTL(col[key])
  or WRITETIME(col.field)/TTL(col.field) attribute, the flag is set and
  the send_collection_timestamps partition-slice option is enabled,
  causing storage nodes to use the extended wire format from the
  previous patch.

* Implement result_set_builder::add_collection() (selection.cc): when
  _collect_collection_timestamps is set, parse the extended format,
  decode per-element timestamps and remaining TTLs (computed from the
  stored expiry time and the query time), and store them in
  _collection_element_metadata indexed by column position.  When the
  flag is not set, the existing plain-bytes path is unchanged.

After this patch, the new selection feature is still not available to
the end-user because the prepare step still forbids it. The next patch
will finally complete the expression preparation and evaluation.
It will read the new collection_element_metadata and return the correct
timestamp or TTL value.
2026-04-12 12:51:06 +03:00
Nadav Har'El
bb63db34e5 cql3: add extended wire format for per-element timestamps and TTLs
Introduce the infrastructure needed to transport per-element timestamps
and TTL expiry times from replicas to coordinators, required for
WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) /
TTL(col.field).

* Add a 'writetime_ttl_individual_element' cluster feature flag that
  guards usage of the new wire format during rolling upgrades: the
  extended format is only emitted and consumed when every node in the
  cluster supports it.

* Implement serialize_for_cql_with_timestamps() (types/types.cc), a
  variant of serialize_for_cql() that appends a per-element section to
  the regular CQL bytes, listing each live element's serialized key,
  timestamp, and expiry.  The format is:
    [uint32 cql_len][cql bytes]
    [int32  entry_count]
    [per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)]
  expiry is -1 when the element has no TTL.

* Add partition_slice::option::send_collection_timestamps and modify
  write_cell() (mutation_partition.cc) to use the new function
  serialize_for_cql_with_timestamps() when this option is available.

This commit stands alone with no user-visible effect: nothing yet sets
the new partition-slice option.  The next patch adds the selection-layer
code that sets the option and parses the extended response.
2026-04-12 11:49:06 +03:00
Nadav Har'El
38b675737d cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
Previously, WRITETIME() and TTL() only accepted a simple column name
(cident), so WRITETIME(m['key']) or WRITETIME(x.a) was a syntax error.
This patch begins to implements support for applying WRITETIME() and
TTL() to individual elements of a non-frozen map, set or UDT, as
requested in issue #15427.

On its own this commit only changes the parser (Cql.g). The prepare
step still rejects subscript and field-selection nodes with an
invalid_request_exception, so there is no user-visible behavior change
yet - just that a syntax error is replaced by a different error.

Upcoming patches add the extended wire format for per-element timestamps
(commit 2), the selection layer that consumes it (commit 3), and the
prepare/evaluate logic that ties everything together (commit 4), after
which WRITETIME() and TTL(col[key]) for collection or UDT elements
will finally be fully functional.

The parser change in this patch expands the subscriptExpr rule to
support the col.field syntax, not only col[key]. This change also
allows the UDT field syntax to be used in LWT conditions, which is
another long-standing missing feature (#13624); But to correctly
support this feature we'll need an additional patch to fix a couple
of remaining bugs - this will be the eighth commit in this series.
2026-04-12 11:10:23 +03:00
Avi Kivity
8ccee6803e Merge 'Remove upgrade view builder' from Gleb Natapov
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.

v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.

No backport needed since this is code removal.

Closes scylladb/scylladb#29105

* github.com:scylladb/scylladb:
  view: drop unused v1 builder code
  view: remove upgrade to raft code
2026-04-12 00:39:26 +03:00
Botond Dénes
9770a4c081 test/cluster/test_encryption.py: use single-partition reads in read_verify_workload()
Replace the range scan in read_verify_workload() with individual
single-partition queries, using the keys returned by
prepare_write_workload() instead of hard-coding them.

The range scan was previously observed to time out in debug mode after
a hard cluster restart. Single-partition reads are lighter on the
cluster and less likely to time out under load.

The new verification is also stricter: instead of merely checking that
the expected number of rows is returned, it verifies that each written
key is individually readable, catching any data-loss or key-identity
mismatch that the old count-only check would have missed.

This is the second attemp at stabilizing this test, after the recent
854c374ebf. That fix made sure that the
cluster has converged on topology and nodes see each other before running
the verify workload.

Fixes: SCYLLADB-1331

Closes scylladb/scylladb#29313
2026-04-12 00:38:20 +03:00
Avi Kivity
ca80ee8586 Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)

* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
  * `tablet_allocator::balance_tablets()`
  * `sstables_manager::components_reclaim_reload_fiber()`
  * `tablet_storage_group_manager::merge_completion_fiber()`
  * metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
  * hints sender
  * all view building related components (update generator, builder, workers)
  * repair
  * stream_manager
  * messaging service (except for verb handlers that switch groups)
  * join_cluster() activity
  * REST API
  * ... something else I forgot

The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.

All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).

Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.

Fixes SCYLLADB-351

New feature, not backporting

Closes scylladb/scylladb#28542

* github.com:scylladb/scylladb:
  code: Add maintenance/maintenance group
  backup: Add maintenance/backup group
  compaction: Add maintenance/maintenance_compaction group
  main: Introduce maintenance supergroup
  main: Move all maintenance sched group into streaming one
  database: Use local variable for current_scheduling_group
  code: Live-update IO throughputs from main
2026-04-12 00:34:48 +03:00
Botond Dénes
3289928679 repair: fix quadratic complexity when loading repair history
shared_tombstone_gc_state::update_repair_time() uses copy-on-write
semantics: each call copies the entire per_table_history_maps and the
per-table repair_history_map.  repair_service::load_history() called
this once per history entry, making the load O(N²) in both time and
memory.

Introduce batch_update_repair_time() which performs a single
copy-on-write for any number of entries belonging to the same table.
Restructure load_history() to collect entries into batches of up to
1000 and flush each batch in one call, keeping peak memory bounded.
The batch size limit is intentional: the repair history table currently
has no bound on the number of entries and can grow large.  Note that
this does not cause a problem in the in-memory history map itself:
entries are coalesced internally and only the latest repair time is
kept per range.  The unbounded entry count only makes the batched
update during load expensive.

Fixes: SCYLLADB-104

Closes scylladb/scylladb#29326
2026-04-11 23:54:26 +03:00
Michał Hudobski
7d648961ed vector_search: forward non-primary key restrictions to Vector Store service
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.

Add get_nonprimary_key_restrictions() getter to statement_restrictions.

Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.

Fixes: SCYLLADB-970

Closes scylladb/scylladb#29019
2026-04-10 17:16:29 +02:00
Piotr Dulikowski
3bd770d4d9 Merge 'counters: reuse counter IDs by rack' from Michael Litvak
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356

backport not needed - an enhancement

Closes scylladb/scylladb#28901

* github.com:scylladb/scylladb:
  docs/dev: add counters doc
  counters: reuse counter IDs by rack
2026-04-10 12:24:18 +02:00
Wojciech Mitros
163c6f71d6 transport: refactor result_message bounce interface
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.

- Add virtual as_bounce() returning const bounce* to the base class
  (nullptr by default, overridden in bounce to return this), replacing
  the virtual move_to_shard() which conflated bounce detection with
  shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
  unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
  that already checked as_bounce()
- Move forward declarations of message types before the virtual
  methods so as_bounce() can reference bounce

Fixes: SCYLLADB-1066

Closes scylladb/scylladb#29367
2026-04-10 12:17:43 +02:00
Piotr Dulikowski
32e3a01718 Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek
Motivation
----------

Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.

Description of solution
-----------------------

The situations we focus on are the following:

* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns

We handle each of them and provide validation tests.

Implementation strategy
-----------------------

1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.

Tests
-----

We provide tests that validate the correctness of the solution.

The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):

Before:
```
real    0m31.809s
user    1m3.048s
sys     0m21.812s
```

After:
```
real    0m34.523s
user    1m10.307s
sys     0m27.223s
```

The incremental differences in time can be found in the commit messages.

Fixes SCYLLADB-429

Backport: not needed. This is an enhancement to an experimental feature.

Closes scylladb/scylladb#28526

* github.com:scylladb/scylladb:
  service: strong_consistency: Abort state_machine::apply when aborting server
  service: strong_consistency: Abort ongoing operations when shutting down
  service: client_state: Extend with abort_source
  service: strong_consistency: Handle abort when removing Raft group
  service: strong_consistency: Abort Raft operations on timeout
  service: strong_consistency: Use timeout when mutating
  service: strong_consistency: Fix indentation
  service: strong_consistency: Enclose coordinator methods with try-catch
  service: strong_consistency: Crash at unexpected exception
  test: cluster: Extract default config & cmdline in test_strong_consistency.py
2026-04-10 11:11:21 +02:00
Pavel Emelyanov
0b336da89d Revert "cmake: add missing rolling_max_tracker_test and symmetric_key_test"
This reverts commit 8b4a91982b.

Two commits independently added rolling_max_tracker_test to test/boost/CMakeLists.txt:
8b4a919 cmake: add missing rolling_max_tracker_test and symmetric_key_test
f3a91df test/cmake: add missing tests to boost test suite

The second was merged two days after the first. They didn't conflict on
code-level and applied cleanly resulting in a duplicate add_scylla_test()
entries that breaks the CMake build:

    CMake Error: add_executable cannot create target
    "test_boost_rolling_max_tracker_test" because another target
    with the same name already exists.

Remove the duplicate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reported-by: Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>
2026-04-10 11:19:43 +03:00
Patryk Jędrzejczak
751bf31273 Merge 'More gossiper cleanups' from Gleb Natapov
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.

No need to backport. Cleanups.

Closes scylladb/scylladb#29129

* https://github.com/scylladb/scylladb:
  storage_service: cleanup unused code
  storage_service: simplify get_peer_info_for_update
  gossiper: send shutdown notifications in parallel
  gms: remove unused code
  virtual_tables: no need to call gossiper if we already know that the node is in shutdown
  gossiper: print node state from raft topology in the logs
  gossiper: use is_shutdown instead of code it manually
  gossiper: mark endpoint_state(inet_address ip) constructor as explicit
  gossiper: remove unused code
  gossiper: drop last use of LEFT state and drop the state
  gossiper: drop unused STATUS_BOOTSTRAPPING state
  gossiper: rename is_dead_state to is_left since this is all that the function checks now.
  gossiper: use raft topology state instead of gossiper one when checking node's state
  storage_service: drop check_for_endpoint_collision function
  storage_service: drop is_first_node function
  gossiper: remove unused REMOVED_TOKEN state
  gossiper: remove unused advertise_token_removed function
2026-04-10 09:56:20 +02:00
Nadav Har'El
6674aa29ca Merge 'Add Cassandra SAI (StorageAttachedIndex) compatibility' from Szymon Wasik
Cassandra's native vector index type is StorageAttachedIndex (SAI). Libraries such as CassIO, LangChain, and LlamaIndex generate `CREATE CUSTOM INDEX` statements using the SAI class name. Previously, ScyllaDB rejected these with "Non-supported custom class".

This PR adds compatibility so that SAI-style CQL statements work on ScyllaDB without modification.

1. **test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests**
   Enables the `SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS` Cassandra system property so that `search_beam_width` tests pass against Cassandra 5.0.7.

2. **test: modernize vector index test comments and fix xfail**
   Updates test comments from "Reproduces" to "Validates fix for" for clarity, and converts the `test_ann_query_with_pk_restriction` xfail into a stripped-down CREATE INDEX syntax test (removing unused INSERT/SELECT lines). Removes the redundant `test_ann_query_with_non_pk_restriction` test.

3. **cql: add Cassandra SAI (StorageAttachedIndex) compatibility**
   Core implementation: the SAI class name is detected and translated to ScyllaDB's native `vector_index`. The fully-qualified class name (`org.apache.cassandra.index.sai.StorageAttachedIndex`) requires exact case; short names (`StorageAttachedIndex`, `sai`) are matched case-insensitively — matching Cassandra's behavior. Non-vector and multi-column SAI targets are rejected with clear errors. Adds `skip_on_scylla_vnodes` fixture, SAI compatibility docs, and the Cassandra compatibility table entry (split into "SAI general" vs "SAI for vector search").

4. **cql: accept source_model option for Cassandra SAI compatibility**
   The `source_model` option is a Cassandra SAI property used by Cassandra libraries (e.g., CassIO) to tag vector indexes with the name of the embedding model. ScyllaDB accepts it for compatibility but does not use it — the validator is a no-op lambda. The option is preserved in index metadata and returned in DESCRIBE INDEX output.

- `cql3/statements/create_index_statement.cc`: SAI class detection and rewriting logic
- `index/secondary_index_manager.cc`: case-insensitive class name lookup (lowercasing restored before `classes.find()`)
- `index/vector_index.cc`: `source_model` accepted as a valid option with no-op validator
- `docs/cql/secondary-indexes.rst`: SAI compatibility documentation with `source_model` table row
- `docs/using-scylla/cassandra-compatibility.rst`: SAI entry split into general (not supported) and vector search (supported)
- `test/cqlpy/conftest.py`: `scylla_with_tablets` renamed to `skip_on_scylla_vnodes`
- `test/cqlpy/test_vector_index.py`: SAI tests inlined (no constants), `check_bad_option()` helper for numeric validation, uppercase class name test, merged `source_model` tests with DESCRIBE check

| Backend            | Passed | Skipped | Failed |
|--------------------|--------|---------|--------|
| ScyllaDB (dev)     | 42     | 0       | 0      |
| Cassandra 5.0.7    | 16     | 26      | 0      |

None: new feature.

Fixes: SCYLLADB-239

Closes scylladb/scylladb#28645

* github.com:scylladb/scylladb:
  cql: accept source_model option and show options in DESCRIBE
  cql: add Cassandra SAI (StorageAttachedIndex) compatibility
  test: modernize vector index test comments and fix xfail
  test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests
2026-04-10 10:21:20 +03:00
Avi Kivity
f67d0739d0 test: user_function_test: adjust Lua error message tests
Lua 5.5 changed the error message slightly ("?:-1" -> "?:?"). Relax
the error message tests to avoid this unimportant fragment.

Closes scylladb/scylladb#29414
2026-04-10 01:09:35 +03:00
Piotr Szymaniak
98d6edaa88 alternator: add comment explaining delta_mode::keys in add_stream_options()
Clarify that cdc::delta_mode is ignored by Alternator, so we use the
least expensive mode (keys) to reduce overhead.

Fixes scylladb/scylladb#24812

Closes scylladb/scylladb#29408
2026-04-10 01:07:21 +03:00
Michał Hudobski
c8b9fde828 auth: allow VECTOR_SEARCH_INDEXING permission to access system.tablets
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.

Fixes: VECTOR-605

Closes scylladb/scylladb#29397
2026-04-09 21:53:07 +03:00
Szymon Wasik
573def7cd8 cql: accept source_model option and show options in DESCRIBE
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.

ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.

Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
2026-04-09 17:20:03 +02:00
Szymon Wasik
80a2e4a0ab cql: add Cassandra SAI (StorageAttachedIndex) compatibility
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.

When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
  - org.apache.cassandra.index.sai.StorageAttachedIndex
  - StorageAttachedIndex
  - sai

SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.

The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.

Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().

Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
2026-04-09 17:20:03 +02:00
Szymon Wasik
fa7edc627c test: modernize vector index test comments and fix xfail
- Change 'Reproduces' to 'Validates fix for' in test comments to
  reflect that the referenced issues are already fixed.
- Condense the VECTOR-179 comment to two lines.
- Replace the xfailed test_ann_query_with_restriction_works_only_on_pk
  with a focused test (test_ann_query_with_pk_restriction) that creates
  a vector index on a table with a PK column restriction, validating
  the VECTOR-374 fix.
2026-04-09 17:20:02 +02:00
Szymon Wasik
4eab050be4 test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests 2026-04-09 17:20:02 +02:00
Andrzej Jackowski
23c386a27f test: perf: add audit-unix-socket-path to perf-simple-query
To allow performance benchmarking with custom syslog sinks.

Example use case:

-- Audit + default syslog: ~100k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY"

```
110263.72 tps ( 66.1 allocs/op,  16.0 logallocs/op,  25.7 tasks/op,  254900 insns/op,  144796 cycles/op,        0 errors)
throughput:
	mean=   107137.48 standard-deviation=3142.98
	median= 106665.00 median-absolute-deviation=1786.03
	maximum=111435.19 minimum=97620.79
instructions_per_op:
	mean=   256311.36 standard-deviation=5037.13
	median= 256288.09 median-absolute-deviation=2223.08
	maximum=274220.89 minimum=248141.40
cpu_cycles_per_op:
	mean=   146443.47 standard-deviation=2844.19
	median= 146001.85 median-absolute-deviation=1514.82
	maximum=157177.54 minimum=142981.03
```

-- Audit + custom syslog: ~400k tps
socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path /tmp/audit-null.sock

```
404929.62 tps ( 65.9 allocs/op,  16.0 logallocs/op,  25.5 tasks/op,   77406 insns/op,   35559 cycles/op,        0 errors)
throughput:
	mean=   399868.39 standard-deviation=6232.88
	median= 401770.65 median-absolute-deviation=3859.09
	maximum=406126.79 minimum=383434.84
instructions_per_op:
	mean=   77481.26 standard-deviation=168.31
	median= 77405.54 median-absolute-deviation=84.33
	maximum=78081.46 minimum=77332.84
cpu_cycles_per_op:
	mean=   35871.32 standard-deviation=516.83
	median= 35699.70 median-absolute-deviation=251.15
	maximum=37454.86 minimum=35432.60
```

-- No audit: ~800k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30

```
808970.95 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.9 tasks/op,   49904 insns/op,   20471 cycles/op,        0 errors)
throughput:
	mean=   809065.31 standard-deviation=6222.39
	median= 810507.10 median-absolute-deviation=1827.99
	maximum=815213.41 minimum=782104.84
instructions_per_op:
	mean=   49905.50 standard-deviation=21.81
	median= 49900.12 median-absolute-deviation=7.72
	maximum=50010.97 minimum=49892.57
cpu_cycles_per_op:
	mean=   20429.00 standard-deviation=41.40
	median= 20425.18 median-absolute-deviation=29.11
	maximum=20530.74 minimum=20355.42
```

Closes scylladb/scylladb#29396
2026-04-09 16:00:41 +03:00
Anna Stuchlik
c6587c6a70 doc: Fix malformed markdown link in alternator network docs
Fixes https://github.com/scylladb/scylladb/issues/29400

Closes scylladb/scylladb#29402
2026-04-09 15:54:43 +03:00
Botond Dénes
5886d1841a Merge 'cmake: align CMake build system with configure.py and add comparison script' from Ernest Zaslavsky
Every time someone modifies the build system — adding a source file, changing a compilation flag, or wiring a new test — the change tends to land in only one of our two build systems (configure.py or CMake). Over time this causes three classes of problems:

1. **CMake stops compiling entirely.** Missing defines, wrong sanitizer flags, or misplaced subdirectory ordering cause hard build failures that are only discovered when someone tries to use CMake (e.g. for IDE integration).

2. **Missing build targets.** Tests or binaries present in configure.py are never added to CMake, so `cmake --build` silently skips them. This PR fixes several such cases (e.g. `symmetric_key_test`, `auth_cache_test`, `sstable_tablet_streaming`).

3. **Missing compilation units in targets.** A `.cc` file is added to a test binary in one system but not the other, causing link errors or silently omitted test coverage.

To fix the existing drift and prevent future divergence, this series:

**Adds a build-system comparison script**
(`scripts/compare_build_systems.py`) that configures both systems into a temporary directory, parses their generated `build.ninja` files, and compares per-file compilation flags, link target sets, and per-target libraries. configure.py is treated as the baseline; CMake must match it. The script supports a `--ci` mode suitable for gating PRs that touch
build files.

**Fixes all current mismatches** found by the script:
- Mode flag alignment in `mode.common.cmake` and `mode.Coverage.cmake`
  (sanitizer flags, `-fno-lto`, stack-usage warnings, coverage defines).
- Global define alignment (`SEASTAR_NO_EXCEPTION_HACK`, `XXH_PRIVATE_API`,
  `BOOST_ALL_DYN_LINK`, `SEASTAR_TESTING_MAIN` placement).
- Seastar build configuration (shared vs static per mode, coverage
  sanitizer link options).
- Abseil sanitizer flags (`-fno-sanitize=vptr`).
- Missing test targets in `test/boost/CMakeLists.txt`.
- Redundant per-test flags now covered by global settings.
- Lua library resolution via a custom `cmake/FindLua.cmake` using
  pkg-config, matching configure.py's approach.

**Adds documentation** (`docs/dev/compare-build-systems.md`) describing how to run the script and interpret its output.

No backport needed — this is build infrastructure improvement only.

Closes scylladb/scylladb#29273

* github.com:scylladb/scylladb:
  scripts: remove lua library rename workaround from comparison script
  cmake: add custom FindLua using pkg-config to match configure.py
  test/cmake: add missing tests to boost test suite
  test/cmake: remove per-test LTO disable
  cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
  cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
  cmake: add -fno-sanitize=vptr for abseil sanitizer flags
  cmake: align Seastar build configuration with configure.py
  cmake: align global compile defines and options with configure.py
  cmake: fix Coverage mode in mode.Coverage.cmake
  cmake: align mode.common.cmake flags with configure.py
  configure.py: add sstable_tablet_streaming to combined_tests
  docs: add compare-build-systems.md
  scripts: add compare_build_systems.py to compare ninja build files
2026-04-09 15:46:09 +03:00
Yaniv Michael Kaul
13879b023f tracing: set_skip_when_empty() for error-path metrics
Add .set_skip_when_empty() to all error-path metrics in the tracing
module. Tracing itself is not a commonly used feature, making all of
these metrics almost always zero:

Tier 1 (very rare - corruption/schema issues):
- tracing_keyspace_helper::bad_column_family_errors: tracing schema
  missing or incompatible, should never happen post-bootstrap
- tracing::trace_errors: internal error building trace parameters

Tier 2 (overload - tracing backend saturated):
- tracing::dropped_sessions: too many pending sessions
- tracing::dropped_records: too many pending records

Tier 3 (general tracing write errors):
- tracing_keyspace_helper::tracing_errors: errors during writes to
  system_traces keyspace

Since tracing is an opt-in feature that most deployments rarely use,
all five metrics are almost always zero and create unnecessary
reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29346
2026-04-09 14:28:16 +03:00
Michael Litvak
3964040008 docs/dev: add counters doc
Add a documentation of the counters feature implementation in
docs/dev/counters.md.

The documentation is taken from the wiki and updated according to the
current state of the code - legacy details are removed, and a section
about the counter id is added.
2026-04-09 13:08:02 +02:00
Michael Litvak
b71762d5da counters: reuse counter IDs by rack
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356
2026-04-09 13:08:02 +02:00
Yaniv Michael Kaul
2c0076d3ef replica: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:

- database::dropped_view_updates: view updates dropped due to overload.
  NOTE: this metric appears to never be incremented in the current
  codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
  badness counter' that should always be zero. NOTE: no increment site
  was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
  badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
  fires when disk utilization is critical and user table writes are
  disabled, a very rare operational state.

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29345
2026-04-09 14:07:28 +03:00
Botond Dénes
86417d49de Merge 'transport: improve memory accounting for big responses and slow network' from Marcin Maliszkiewicz
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.

Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740

Backport: all supported versions

Closes scylladb/scylladb#29288

* github.com:scylladb/scylladb:
  transport: add per-service-level pending response memory metric
  transport: hold memory permit until response write completes
  transport: account for response size exceeding initial memory estimate
2026-04-09 13:36:31 +03:00
Yaniv Michael Kaul
5c8b4a003e db: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in the db module that are
only incremented on very rare error paths and are almost always zero:

- cache::pinned_dirty_memory_overload: described as 'should sit
  constantly at 0, nonzero is indicative of a bug'
- corrupt_data::entries_reported: only fires on actual data corruption
- hints::corrupted_files: only fires on on-disk hint file corruption
- rate_limiter::failed_allocations: only fires when the rate limiter
  hash table is completely full and gives up allocating, requiring
  extreme cardinality pressure

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29344
2026-04-09 13:32:09 +03:00
Gleb Natapov
dbaba7ab8a storage_service: cleanup unused code
Remove unused definition and double includes.
2026-04-09 13:31:41 +03:00
Gleb Natapov
b050b593b3 storage_service: simplify get_peer_info_for_update
It does nothing for fields managed in raft, so drop their processing.
2026-04-09 13:31:41 +03:00
Gleb Natapov
d0576c109f gossiper: send shutdown notifications in parallel 2026-04-09 13:31:40 +03:00
Gleb Natapov
1586fa65af gms: remove unused code
Also moved version_string(...) and make_token_string(...) to private: — they are internal helpers used only by normal(), not part of the public API
2026-04-09 13:31:40 +03:00
Gleb Natapov
b2e35c538f virtual_tables: no need to call gossiper if we already know that the node is in shutdown 2026-04-09 13:31:40 +03:00
Gleb Natapov
e17fc180a0 gossiper: print node state from raft topology in the logs
Raft topology has real node's state now. gossiper sate are now set to
NORMAL and SHUTDOWN only.
2026-04-09 13:31:40 +03:00
Gleb Natapov
8439154851 gossiper: use is_shutdown instead of code it manually 2026-04-09 13:31:39 +03:00
Gleb Natapov
7d700d0377 gossiper: mark endpoint_state(inet_address ip) constructor as explicit
get_live_members function called is_shutdown which inet_address
argument, which caused temporary endpoint_state to be created. Fix
it by prohibiting implicit conversion and calling the correct
is_shutdown function instead.
2026-04-09 13:31:39 +03:00
Gleb Natapov
6df4f572d5 gossiper: remove unused code 2026-04-09 13:31:39 +03:00
Gleb Natapov
67102496c8 gossiper: drop last use of LEFT state and drop the state
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
2026-04-09 13:31:39 +03:00
Gleb Natapov
54d2c95094 gossiper: drop unused STATUS_BOOTSTRAPPING state 2026-04-09 13:31:38 +03:00
Gleb Natapov
7c895ced19 gossiper: rename is_dead_state to is_left since this is all that the function checks now. 2026-04-09 13:31:38 +03:00
Gleb Natapov
7dfb0577b8 gossiper: use raft topology state instead of gossiper one when checking node's state
Raft topology state is a truth source for the nodes state, so use it
instead of a gossiper one.
2026-04-09 13:31:38 +03:00
Gleb Natapov
c17c4806a1 storage_service: drop check_for_endpoint_collision function
All the checks that it does are also done by join coordinator and the
join coordinator uses more reliable raft state instead of gossiper one.
2026-04-09 13:31:37 +03:00
Gleb Natapov
1ac8edb22b storage_service: drop is_first_node function
It make no sense now since the first node to bootstrap is determined by
discover_group0 algorithm.
2026-04-09 13:31:37 +03:00
Gleb Natapov
681aa9ebe1 gossiper: remove unused REMOVED_TOKEN state 2026-04-09 13:31:37 +03:00