Commit Graph

11884 Commits

Author SHA1 Message Date
Evgeniy Naydanov
901c452c82 test.py: reduce cgroup overhead in resource metrics gathering
Only enable the memory controller in cgroup subtree_control instead of
all available controllers. cpu.stat is available in cgroup v2 without
enabling the cpu controller (base accounting), and enabling io/pids/cpu
controllers adds unnecessary per-operation kernel overhead to Scylla
processes - particularly the memory controller's per-page-cache-operation
accounting combined with io controller overhead during heavy I/O.

Additionally, restrict SystemResourceMonitor to the master process only.
System-wide metrics (CPU%, memory) are identical from any process, so
running a monitoring thread in each xdist worker was redundant and added
unnecessary SQLite write contention and thread scheduling noise.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-05-26 05:10:05 +00:00
Nadav Har'El
96dd3121e7 Merge 'cql: rewrite CassIO SAI metadata index to regular secondary index' from Szymon Wasik
CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index:

```sql
CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table>
ON <keyspace>.<table> (ENTRIES(metadata_s))
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
```

ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with:

```
StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns
```

This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`).

CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern.

Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries:

- **Detection**: SAI class name + single `ENTRIES` target on a non-frozen `map` column
- **Rewrite**: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries)
- **Warning**: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI

The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication.

After this change, the CassIO schema setup succeeds on ScyllaDB:
- `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index
- The index is functional and can accelerate metadata filtering queries
- A CQL warning makes the rewrite transparent to operators
- SAI on non-vector, non-map-entries columns is still rejected as before
- Vector SAI indexes continue to be rewritten to `vector_index` as before

- `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name)
- `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias
- `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected

All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped).

Fixes: SCYLLADB-2113
Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added.

Closes scylladb/scylladb#29981

* github.com:scylladb/scylladb:
  cql: rewrite CassIO SAI metadata index to regular secondary index
  db/config: add enable_cassio_compatibility flag
2026-05-26 00:19:03 +03:00
Michał Hudobski
1d17d2144f index, vector_index: limit primary key columns to 255
The vector-store's InvariantKey type supports at most 255 key
components. Reject vector index creation when the base table's
primary key (partition + clustering columns) exceeds this limit.

Fixes: VECTOR-553

Closes scylladb/scylladb#29317
2026-05-25 19:24:17 +03:00
Szymon Wasik
5ee339b11d cql: rewrite CassIO SAI metadata index to regular secondary index
When CassIO creates a SAI ENTRIES index on a map column,
ScyllaDB now rewrites it to a regular secondary index and emits
a CQL warning. This allows LangChain/CassIO applications to work
without DDL errors.

The rewrite is gated behind the enable_cassio_compatibility flag
(disabled by default).

Refs: SCYLLADB-2113
2026-05-25 15:11:43 +02:00
Botond Dénes
db89f3f095 Merge 'compaction_manager: unregister compaction module on early shutdown' from Patryk Jędrzejczak
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.

Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.

Fixes: SCYLLADB-2106

Backport to all supported branches, as the bug is there and it has
already caused a failure in 2026.1 CI.

Closes scylladb/scylladb#30015

* github.com:scylladb/scylladb:
  test: add test_stop_before_starting_compaction_manager
  compaction_manager: unregister compaction module on early shutdown
2026-05-25 16:08:20 +03:00
Dmitry Kropachev
74fa423271 transport: report host id in SUPPORTED
Currently driver creates network layout (node IP addresses and ports)
from `system.local`, `system.peers`, `system.client_routes` and then
runs on assumption that this network layout is correct.
It does not check if it is.
If, for example it happens so that node ip/port (say on proxy) will not
match what driver calculated it will go unnoticed.

The goal of this feature is to provide driver host-id on SUPPORTED frame,
so that it would know which node it connected to and could make decision
wether keep connection or drop it.

- add `SCYLLA_HOST_ID` to the CQL `SUPPORTED` response
- add a regression test that hooks the Python driver handshake and
  verifies the reported host id

- `python3.12 -m py_compile test/cqlpy/test_protocol_exceptions.py`
- syntax-only compile of `transport/server.cc` with the repo toolchain
  flags inside `dbuild`

Refs #27452
Refs https://scylladb.atlassian.net/browse/DRIVER-610

Closes scylladb/scylladb#29809
2026-05-25 14:36:53 +03:00
Avi Kivity
892f22f49c Merge 'cql: atomic add/subtract operations with LWT' from Nadav Har'El
ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations in **LWT** statements. For example

	UPDATE ... SET a = a - 7 IF a > 0

or

	UPDATE ... SET a = a + 1 IF a != NULL

The row updated in the operation, and the updated column (`a`) should be initialized before the update. The example `SET a = a + 1 IF a != NULL` will fail the condition if `a` is not set. A different request `SET a = a + 1 IF EXISTS` will just leave `a` unset if it's unset (NULL + 1 is NULL, this is SQL's null propagation rules).

This add/subtract operations is allowed on any numeric (integer or floating point) column.

The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT.

This series does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b"  syntax already existed for counters, and we just allow the same syntax for non- counters. However, the series does add a bit of machinery that will allow us to easily support more general expressions in the future. In particular, this series implements the addition, subtraction, and unary-minus operators for expressions, and adds the machinery needed to run **any** expression in "SET a = expr()", using existing row values fetched by LWT.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568
Refs #22918 ("Support arithmetic operators"), SCYLLADB-1576 ("Decimal arithmetic operations OOM")

This is is a new feature, so normally would not be backported.

Closes scylladb/scylladb#29939

* github.com:scylladb/scylladb:
  cql: atomic add/subtract operations with LWT
  cql3: let constants::setter evaluate expressions using prefetched row data
  cql3/expr: add NEG unary operator for numeric negation
  cql3/expr: add SUB binary operator for numeric subtraction
  cql3/expr: add ADD binary operator for numeric addition
  types: add is_arithmetic() method for types
2026-05-25 14:27:33 +03:00
Dmitry Kropachev
06eeaf48ff tests: avoid CQL_ALTERNATOR_QUERIED on zero-token nodes
The keyspace RF test starts zero-token nodes as part of its topology setup.

The python driver 3.29.9 can't schedule queries on zero-token nodes, so waiting for `CQL_ALTERNATOR_QUERIED` on those nodes is the wrong readiness gate.
This change makes the zero-token `server_add()` calls stop at `CQL_ALTERNATOR_CONNECTED`.
The test still exercises the keyspace replication assertions through a normal token-owning contact point.

Verified with running all 4 variations of `cluster.test_keyspace_rf::test_create_keyspace_with_default_replication_factor` on this branch.

Closes scylladb/scylladb#29779
2026-05-25 14:22:04 +03:00
Piotr Dulikowski
3a5dd2e5be Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.

By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.

Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.

The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request

Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.

Fixes: SCYLLADB-1157

Closes scylladb/scylladb#29575

* github.com:scylladb/scylladb:
  strong_consistency: test read forwarding to leader
  strong_consistency: skip read_barrier() for non-linearizable reads
  strong_consistency: split coordinator-level read latency metrics
  strong_consistency: forward linearizable reads to raft leader
  strong_consistency: classify reads by consistency level
  strong_consistency: add begin_read() to raft_server
2026-05-25 10:55:00 +02:00
Nadav Har'El
f8aaeb5e87 cql: atomic add/subtract operations with LWT
ScyllaDB has special counter columns for which atomic add/subtract
operations like `SET a = a + 1` are allowed. Such operations have not
been allowed on ordinary non-counter columns, as they would not be
properly atomic - the read an the write are separate, and concurrent
operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations
in *LWT* statements. Some examples:

        UPDATE ... SET a = a - 1 IF a > 0

        UPDATE ... SET a = a + 1 IF EXISTS

        UPDATE ... SET a = a + 1 a != NULL

The row updated in the operation, and the updated column (a) should
be initialized before the update - arithmetic operations on missing
column values silently leave the column null (no error is generated).

This add/subtract operations is allowed on any numeric column -
integer or floating point of any size.

The ability of LWT to fetch the old values of a column and use it to
calculate the new value has long been available in our internal CAS
implementation - and has been in use for years in Alternator - but until
this patch it was not exposed in CQL's LWT.

This patch does not add new syntax to CQL - the "SET a = a + b"
and "SET a = a - b" syntax that already existed for counters is now
allowed for non-counters.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:09:11 +03:00
Nadav Har'El
b026aea6f7 cql3/expr: add NEG unary operator for numeric negation
This patch adds a new expression type, unary_operator, analogous to
the existing binary_operator but takes just one operand instead of
two.

This patch also implements the first and only unary operator type,
unary_oper_t::NEG, implementing negation (unary minus) for all numeric
types.

For fixed-width integer types overflow or underflow results in an error.
If the operand is NULL, the result is a NULL as well.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We also do not plan to use it in the
following patch which uses the separate SUB (subtraction) operation,
not the new NEG. But since I already implemented a unary minus operator,
and we'll surely need it in the future for general arithmentic operations,
I thought I might as well include this patch as well.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:08:11 +03:00
Nadav Har'El
f27d1f08fc cql3/expr: add SUB binary operator for numeric subtraction
In this patch we add to our expressions oper_t::SUB, for subtraction,
analogous to the ADD from the previous patch.

The only reason why we need a separate SUB operation and can't just
combine ADD with a unary minus (NEG) operator is the minimum integer
in fixed-sized integer. For example, 8-bit integers have the range
-128...127. A subtraction like -1 - (-128) is valid (its value is 127)
but the negation of (-128) would be invalid (128). One of the tests
we add in this patch validates this fact.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:06:28 +03:00
Nadav Har'El
083adf84ab cql3/expr: add ADD binary operator for numeric addition
Extend oper_t with a new ADD operator, to represent addition between two
numeric expressions. Supports all numeric types - tinyint, smallint,
int, bigint, float, double, varint, and decimal.

For fixed-width integer type overflow or underflow results in an error.
If one of the operand is NULL, the result is also a NULL.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We plan to start using this new operator
in a following patch which implements counter syntax ("SET r = r + 1" )
for LWT, but in the future we can use it for more general cases.

At the moment, ADD requires that both operands have the same type.
This is all we need for the first use case, and this limitation can
be relaxed later.

Interestingly, ADD is our first binary operator implementation that
does not return a boolean. Until now all our binary operators have been
comparison operators, and all returned boolean. In contrast, ADD's
return type is the type of its operands.

This implementation is susceptible to the pre-existing bug SCYLLADB-1576,
where adding 1e1000000 and 1 in "decimal" or "varint" types will
happily allocate a million-digit number and run out of memory. A
reproducing test is included, and this issue will be solved in one
place for all operations that have additions (including aggregations
and arithmetic expressions) in a followup pull-request.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:05:09 +03:00
Gleb Natapov
0bf050d175 storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution
Otherwise if a table is dropped in the middle of a scan the object may
disappear.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137

Closes scylladb/scylladb#29988
2026-05-24 21:54:08 +03:00
Michael Litvak
73470150a0 logstor: disable logstor compaction in table truncate
in database::truncate_table_on_all_shards disable logstor compaction
before the table data is truncated, similarly to how non-logstor
compaction is disabled, to avoid race conditions between logstor
compaction and segments discarding.

Fixes SCYLLADB-2186
2026-05-24 10:25:08 +02:00
Wojciech Mitros
45f5df14e5 strong_consistency: test read forwarding to leader
Test the linearizable read forwarding behavior in a single test that
exercises all scenarios on one cluster:
- CL=QUORUM reads on leader, follower, and non-replica nodes
- CL=ONE reads (non-linearizable, no forwarding)
- Linearizability: write + CL=QUORUM read from follower (10 iterations)
- Coordinator latency histogram metrics for both read types

Refs: SCYLLADB-1157
2026-05-23 11:35:37 +02:00
Wojciech Mitros
d07692a7ff strong_consistency: split coordinator-level read latency metrics
Split the latency metrics for strongly consistent reads into two
categories: linearizable and non-linearizable. They replace the
existing metrics for both types combined - this shouldn't cause
issues because the feature is still experimental and both the
initial introduction of latency metrics and the split will be
a part of the same release.

Also fix a test that was using the old metric.
2026-05-23 11:35:37 +02:00
Yaniv Michael Kaul
acd3115645 sstables: include SSTable filename in Stats metadata error messages
When Stats metadata is not available or malformed, include the SSTable
filename in the error message to help operators identify which SSTable
files need attention during startup failures.

Fixes: https://github.com/scylladb/scylla-enterprise/issues/5439
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: yes
Backport: no, benign improvement

Closes scylladb/scylladb#29950
2026-05-22 16:49:37 +03:00
Łukasz Paszkowski
96a992002c tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
   topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
   registered on the CV. The loop kept spinning, holding the HTTP
   server's task gate open and preventing http_server::stop() from
   completing. After ~15 minutes, systemd killed the process with
   SIGABRT.

The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532

Closes scylladb/scylladb#29485
2026-05-22 16:47:48 +03:00
Raphael S. Carvalho
3ba6184462 repair, test: fix split-repair synchronization test timeout in debug mode
The test_split_and_incremental_repair_synchronization[True] test was
timing out waiting for 'Finalizing resize decision for table' in
debug mode.

The root cause is a timing race: the incremental_repair_prepare_wait
error injection has a hardcoded 60s auto-expiry timeout
(wait_for_message(60s)), but split compactions in debug mode take ~58s
per SSTable due to -O0 compilation and scheduler starvation (the
maintenance_compaction group gets ~10% of wall-clock time). When the
injection auto-expires before split finalization, the repair fails,
leaving tablets stuck in transition=repair state. This prevents the
topology coordinator from finalizing the split, causing the 600s test
timeout.

Fix both contributing factors:

- Increase the injection timeout from 60s to 10min, giving split
  compactions ample time to complete before the injection auto-expires.
  The test explicitly messages the injection to release it (line 2200),
  so the longer timeout is just a safety net.

- Reduce data volume from 256 to 64 rows (and repair data from 256 to
  64 rows), producing smaller SSTables that split much faster in debug
  mode.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2123.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#30004
2026-05-22 15:03:47 +03:00
Patryk Jędrzejczak
082936ce43 Merge 'test: pylib: Convict the node on server_stop()' from Tomasz Grabiec
This is about ungraceful stop, where the node is killed.

Test cases typically need to wait for other nodes to notice that the
node is down before proceeding. By default, that takes about 20s. Can
be reduced via config by reducing failure detector threshold, but it's
not the best solution:

 - cannot set the threshold too low, or we'll introduce falkiness due to false
   positives
 - so it's still slow (a couple of seconds)
 - developers forget about it and the test still works

This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter.

At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps:

- the series starts with defaulting to convict=False
- each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable
- finally, the "convict" parameter is made mandatory

There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost.

Tested on dev-mode
cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain:
Wall clock time reduced from 41s to 16s

No backport: enhancement

Closes scylladb/scylladb#28495

* https://github.com/scylladb/scylladb:
  test: gossiper: Add test for natural failure detection
  test: pylib: Make convict a required parameter in server_stop()
  test: Annotate server_stop() calls where conviction is harmful
  test: Annotate server_stop() calls where conviction is beneficial
  test: Annotate server_stop() calls where conviction is useless
  test: pylib: Add convict option to server_stop()
  api: failure_detector: Introduce convict-node API
  gms: gossiper: Make convict() public and safe to call from any scheduling group
  api: Extract validate functions to common header
2026-05-22 13:39:50 +02:00
Yaniv Michael Kaul
bb69ae5a02 test: assert ALTER TYPE RENAME rejected on frozen PK UDTs
Add assertion that ALTER TYPE RENAME is rejected when the UDT is used
as a frozen partition key column. The existing test only covered ALTER
TYPE ADD. This closes the coverage gap from dtest
udtencoding_test.py::test_udt_change_in_partition_key, enabling its
removal.

Refs: SCYLLADB-1929

Closes scylladb/scylladb#29840
2026-05-22 12:29:43 +02:00
Marcin Maliszkiewicz
dcff319221 Merge 'cql: request-side custom payload parsing' from Dario Mirovic
When a CQL client sends a request with the `CUSTOM_PAYLOAD` flag (`0x04`) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as `"truncated frame: expected 65546 bytes"`.

This was discovered through DataStax Java Driver 4.19.x tests that attach a `request-id` to queries via custom payload. The same issue affects any CQL client that sets the `CUSTOM_PAYLOAD` flag.

Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable `QueryHandler`. Cassandra's default `QueryHandler` also discards them.

Fixes SCYLLADB-745

Reported on 2026.2, backport.

Closes scylladb/scylladb#30005

* github.com:scylladb/scylladb:
  cql: fix request-side custom payload parsing
  test/cqlpy: add tests for request-side custom payload handling
2026-05-22 12:18:26 +02:00
Patryk Jędrzejczak
b7400d20dd test: add test_stop_before_starting_compaction_manager 2026-05-22 11:58:37 +02:00
Marcin Maliszkiewicz
18dd281e72 Merge 'test: audit: pin empty-keyspace DDL audit behavior' from Andrzej Jackowski
9646ee05bd changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace.

No backport, just a missing test case.

Closes scylladb/scylladb#29542

* github.com:scylladb/scylladb:
  test: audit: pin empty-keyspace DDL audit behavior
  test: audit: restart server when any non-live config key changes
  test: audit: rename 'needed' to 'target_config' for clarity
2026-05-22 09:42:34 +02:00
Tomasz Grabiec
445a8b9a3e test: gossiper: Add test for natural failure detection
Add test_natural_failure_detection which verifies that the failure
detector detects a killed node as DOWN without using the convict
mechanism. Uses the failure_detector_timeout fixture to keep the
FD timeout short (2s in release mode).

This ensures that natural failure detection continues to work
correctly even as other tests adopt the convict mechanism for speed.
2026-05-21 21:33:24 +02:00
Tomasz Grabiec
fa7e24f5f7 test: pylib: Make convict a required parameter in server_stop()
Remove the default value from the convict parameter in
ManagerClient.server_stop(), making it required. All call sites
have been annotated with explicit convict=True or convict=False
in the preceding commits, so this change enforces that every
future caller must make a conscious choice about whether to
convict the stopped node.
2026-05-21 21:33:24 +02:00
Tomasz Grabiec
9b40cf89fe test: Annotate server_stop() calls where conviction is harmful
Add explicit convict=False to server_stop() calls where convicting
the node would break or weaken the test.

In test_backoff_when_node_fails_task_rpc, the desired behavior is for
the node to not be marked as down immediately:

    # The purpose of this is to simulate a situation when the gossiper
    # doesn't mark a dead node as such immediately.

In raft tests, conviction could trigger voter reassignment while the
test wants to test the scenario with voters being still down.

In test_tablet_mv_replica_pairing_during_replace, conviction triggers
SCYLLADB-1996 (replace fails with "Failed to add server").
2026-05-21 21:33:19 +02:00
Tomasz Grabiec
92416d850a test: Annotate server_stop() calls where conviction is beneficial
Add explicit convict=True to server_stop() calls where the test
needs other nodes to detect the stopped node as DOWN in order to
proceed. These are cases before remove_node, replace, or explicit
waits for failure detection (server_not_sees_other_server,
wait_new_coordinator_elected).

Convicting immediately speeds up the test.
2026-05-21 21:31:22 +02:00
Tomasz Grabiec
624fe11178 test: Annotate server_stop() calls where conviction is useless
Pass convict=False explicitly to server_stop() calls where conviction
provides no benefit because there is no consumer of the failure
detection:

 - single-node clusters (no other node to call the API on)
 - all nodes being stopped concurrently (no live node remains)
 - immediate restart (no test logic between stop and start
   depends on other nodes detecting the stopped node as dead)
 - node stopped for file manipulation or bootstrap abort
 - majority killed with no quorum on surviving nodes to react
 - no test logic depends on other nodes detecting the failure

This is a no-op change since the default is already convict=False,
but makes the intent explicit for each call site.
2026-05-21 21:13:55 +02:00
Tomasz Grabiec
a19e3f6f64 test: pylib: Add convict option to server_stop()
Add support for convicting a node after stopping it non-gracefully.

Convicting a node means that all live nodes will immediately mark it as
DOWN, bypassing the natural failure detection delay (~20s by default).

The convict parameter defaults to False (no conviction). Tests that
want fast failure detection after a kill should pass convict=True
explicitly.

Tested on dev-mode
cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain:
Wall clock time reduced from 41s to 16s (when using convict=True)
2026-05-21 21:13:54 +02:00
Dario Mirovic
f9e8518776 cql: fix request-side custom payload parsing
When a CQL client sends a request with the CUSTOM_PAYLOAD flag (0x04)
set, the frame body starts with a [bytes map] before the message.
Scylla never implemented parsing of this map on the request side.
This caused it to fail parsing with protocol errors such as
"truncated frame: expected 65546 bytes".

Fix this by skipping over the custom payload [bytes map] from the frame
body before dispatching to opcode-specific handlers. The payload
contents are discarded since Scylla has no pluggable QueryHandler.
Cassandra's default QueryHandler also discards them.

Fixes SCYLLADB-745
2026-05-21 18:36:37 +02:00
Dario Mirovic
8e6d2d0631 test/cqlpy: add tests for request-side custom payload handling
Add tests that verify Scylla's handling of CQL native protocol
requests with the CUSTOM_PAYLOAD flag (0x04) set. Each test asserts
the specific parse error that the unfixed server produces.

A separate CQL session is used for each test. The protocol error
kills the driver connection, and we need to catch it properly.

Refs SCYLLADB-745
2026-05-21 18:34:43 +02:00
Avi Kivity
305346a3ec Merge 'Don't materialize collections into intermediate representations' from Botond Dénes
Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again.
This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections.

This PR aims to solve this problem once and for all.  The plan is as follows:
* Promote direct use of the serialized collection format:
    - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`.
    - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`).
* Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure.
* Drop the old infrastructure, to avoid accidental regressions.

Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion.

To help focus review, here is a summary of the patches:
* [1, 2] preparatory refactoring: drop some unused abstract_type params
* [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR
* [6, -1) replace all usage of old materializing infrastructure with usage of the new one
* [-1] drop old infrastructure

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |   After | Change     |
|--------------------------|--------:|--------:|------------|
| Throughput (median tps)  | 315,760 | 332,021 | **+5.1%**  |
| Instructions/op (median) |  53,776 |  48,681 | **-9.5%**  |
| CPU cycles/op (median)   |  17,365 |  16,471 | **-5.1%**  |
| Allocations/op           |    85.1 |    82.1 | **-3.5%**  |

**Significant improvement.** Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write
```

| Metric                   |    Before |    After | Change    |
|--------------------------|----------:|---------:|-----------|
| Throughput (median tps)  |   150,823 |  149,678 | **-0.8%** |
| Instructions/op (median) |   108,388 |  103,858 | **-4.2%** |
| CPU cycles/op (median)   |    34,860 |   35,371 | **+1.5%** |
| Allocations/op           | ~105–108  | ~102–103 | **-3.0%** |

**Mixed, mostly neutral.** Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |  After | Change    |
|--------------------------|--------:|-------:|-----------|
| Throughput (median tps)  |  55,777 | 56,051 | **+0.5%** |
| Instructions/op (median) | 246,215 |246,610 | **+0.2%** |
| CPU cycles/op (median)   |  77,641 | 77,020 | **-0.8%** |
| Allocations/op           |   340.4 |  335.4 | **-1.5%** |

**Essentially neutral.** All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise.

---

The change has a **clear, substantial positive effect on reads** (~5% throughput gain, ~9.5% fewer instructions per op).
The write and alternator paths are **unaffected in practice** — changes there are within measurement noise. No regressions are apparent.
This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads.

Fixes: #3602

Improvement, no backport.

Closes scylladb/scylladb#29127

* github.com:scylladb/scylladb:
  mutation/collection_mutation: make collection_mutation::_data private
  mutation_collection: drop collection_mutation_description and friends
  test: move away from collection_mutation_description
  tree: move away from collection_mutation_description
  test: move away from collection_mutation_view::with_deserialized()
  tree: move away from collection_mutation_view::with_deserialized()
  types: fix indendation, left broken by previous commit
  types: move away from collection_mutation_view::with_deserialized()
  types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT()
  schema: column_computation: move away from collection_mutation_view::with_deserialized()
  mutation: move away from collection_mutation_view::with_deserialized()
  alternator: move away from collection_mutation_view::with_deserialized()
  cdc: move away from collection_mutation_view::with_deserialized()
  mutation/collection_mutation: printer: don't deserialize collections
  mutation/collection_mutation: difference(): don't deserialize collections
  mutation/collection_mutation: merge(): don't deserialize collections
  mutation/collection_mutation: extract compact_and_expire() to free function
  mutation/collection_mutation: refactor empty(), is_any_live() and last_update()
  compaction_garbage_collector: pass collection_mutation to collect()
  test/boost/mutation_test: add tests for collection_mutation_{view,writer}
  mutation/collaction_mutation: collection_mutation_view: add methods to inspect content
  mutation/collection_mutation: add collection_mutation_writer
  mutation/collection_mutation: collection_mutation(): generate valid collection
  mutation/collection_mutation: collection_mutation(): remove unused abstract_type param
  mutation/atomic_cell: drop unused type param from from_bytes()
2026-05-21 17:10:40 +03:00
Patryk Jędrzejczak
1ed3f5c4af Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).

Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.

The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.

Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.

Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.

Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked

Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665

backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1

Closes scylladb/scylladb#29882

* https://github.com/scylladb/scylladb:
  storage_service: cancel write handlers during drain to prevent shutdown deadlock
  test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
  test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
  test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
  test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
  test: scylla_cluster: allow stop() to bypass start_stop_lock
  error_injection: add non-shared mode to wait_for_message
  error_injection: release waiters when injection is disabled
2026-05-21 15:43:36 +02:00
Piotr Dulikowski
6148316f66 Merge 'db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak
There is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work.

To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV and is checked before awaiting on the CV.

Fixes SCYLLADB-2029

The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator.

Closes scylladb/scylladb#27313

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer
  db/view/view_building_coordinator: add flag to mark if any remote work was finished
2026-05-21 15:11:58 +02:00
Andrzej Jackowski
f8156702de tree: add missing -present to copyright headers
~2076 files used "Copyright (C) YYYY-present ScyllaDB" while
~88 files used "Copyright (C) YYYY ScyllaDB". This
inconsistency leads to unnecessary code review discussions
and gradual spread of the less common format.

Standardize all ScyllaDB copyright headers to use -present.

Fixes SCYLLADB-1984

Closes scylladb/scylladb#29876
2026-05-21 10:57:42 +02:00
Wojciech Mitros
13c043903d strong_consistency: cache leader location for non-replica nodes
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.

Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.

The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.

Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.

On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.

An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.

We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.

Fixes: SCYLLADB-1064

Closes scylladb/scylladb#29392
2026-05-21 10:32:56 +02:00
Gleb Natapov
cc034f84c5 schema: ensure committed_by_group0 is set for all non-system tables on boot
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.

Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):

1. Queries system_schema.scylla_tables for rows where committed_by_group0
   is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
   cell (from the in-memory schema). The committed_by_group0 = true flag
   is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.

On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.

This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.

Closes scylladb/scylladb#29911
2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak
cbadc3d675 test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.

Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted

Fixes: SCYLLADB-2120

Closes scylladb/scylladb#29967
2026-05-21 10:50:00 +03:00
Artsiom Mishuta
2259307c2e test.py: remove redundant pytest.mark.asyncio decorators
Fixes: SCYLLADB-1935
2026-05-21 10:36:47 +03:00
Botond Dénes
da7903de79 test: move away from collection_mutation_description
Use collection_mutation_writer instead.
2026-05-21 10:23:29 +03:00
Botond Dénes
c76ab90fb2 test: move away from collection_mutation_view::with_deserialized()
Use the collection_mutation_view directly.
2026-05-21 10:23:29 +03:00
Botond Dénes
7c8b5681f4 mutation/collection_mutation: extract compact_and_expire() to free function
The new free-function variant operates on a collection_mutation_view
directly, instead of on collection_mutation_description.
2026-05-21 10:23:15 +03:00
Andrzej Jackowski
d2bb72438e test: audit: pin empty-keyspace DDL audit behavior
9646ee05bd changed behavior of empty keyspace handling
and this code path was never tested for CQL audit.
Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE
targeting both an existing keyspace and a nonexistent
one to verify both are audited with empty keyspace.

Before 9646ee05bd, an empty keyspace in audit_info
would be checked against audit_keyspaces like any other
value, silently skipping the statement when "" did not
match any configured keyspace. That commit introduced a
will_log() helper that treats an empty keyspace as
unfilterable, so these DDL statements are now always
logged when their category matches.

Refs SCYLLADB-1641
2026-05-21 08:49:44 +02:00
Andrzej Jackowski
2c15277d02 test: audit: restart server when any non-live config key changes
_check_restart_needed only compared NON_LIVE_AUDIT_KEYS against the
running server config, so extra keys like enable_user_defined_functions
were silently ignored and never applied. Generalize the check to
restart whenever any key outside LIVE_AUDIT_KEYS differs.
2026-05-21 08:49:44 +02:00
Botond Dénes
f8ac8540bd Merge 'logstor: compare records by timestamp and segment sequence number' from Michael Litvak
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.

Add a segment sequence number that is a global (per-shard) increasing
number that is allocated when getting a new segment for write, and is
written in buffer headers in the segment.
It is used to distinguish between buffers written to different generations
of a segment, and for recovery to break ties by keeping the record
from the newest segment.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-770

no backport - logstor is a new feature

Closes scylladb/scylladb#29933

* github.com:scylladb/scylladb:
  test: logstor: add basic delete test
  logstor: rewrite segment seq num from streaming
  logstor: add segment sequence number
  logstor: get_segment helper
  logstor: compare records by timestamp
2026-05-21 08:44:18 +03:00
Andrzej Jackowski
29b7bef15d test: audit: rename 'needed' to 'target_config' for clarity 2026-05-21 07:41:51 +02:00
Botond Dénes
c5d12d44c6 test/boost/mutation_test: add tests for collection_mutation_{view,writer}
Test the new facilities for producing and inspecting serialized
collection mutations directly, without intermediate formats.
2026-05-21 08:34:21 +03:00
Botond Dénes
24fdfa34dd mutation/collection_mutation: collection_mutation(): remove unused abstract_type param 2026-05-21 08:34:21 +03:00