Commit Graph

54113 Commits

Author SHA1 Message Date
Botond Dénes
853edcbf75 tracing: add_query(): change query param to utils::chunked_string
Having to unconditionally linearize the chunked query string when
passing it to tracing undoes the work put into reducing large
alloctions on the query path. The add_query() is evaluated eagerly on
every query, even if tracing is disabled. Defer the linearization to
build_parameres_map(), which is only called if tracing is enabled.
2026-05-26 09:08:06 +03:00
Botond Dénes
6c3f104b67 cql3: store raw query string in utils::chunked_string
Read query as fragmented string from the input stream in
transport/server.cc, propagate it a such to query_processor::prepare()
and also store it as such in cql3::cql_statement::raw_cql_statement.

Unfortunately, the query still has to be linearized for parsing, as
ANTLR -- although allows for custom InputStream implementation -- plays
pointer arithmetics games with the pointers obtained from them, so
fragmented input cannot be used.
To amortize the cost of this linearization, the query string is
linearized through utils::reusable_buffer. The parser can be
invoked recursively, nested invokations linearize directly.

Still, this patch limits the places where the query is linearized to the
following:
* Parsing
* Audit
* Logs and error messages

So the normal query paths for queries that actually can get arbitrarily
large (UPDATE and INSERT) should only linearize the query temporarily
for parsing.
2026-05-26 09:08:06 +03:00
Botond Dénes
bf1a775fe4 serializer: add serializer<utils::chunked_string>
Also add normalizer which maps to sstring. utils::chunked_string's wire
representation is binary compatible with that of sstring, which allows
for seamless migration of RPCs from sstring to utils::chunked_string
where needed. Will be used in the next commit for forward CQL prepare
request (query string).
2026-05-26 09:08:06 +03:00
Botond Dénes
05cfd7ac5e utils/reusable_buffer: add get_linearized_view(managed_bytes_view)
Allow using reusable buffer with managed bytes too. To be used soon to
amortize linearizing query strings before passing them to ANTLR for
parsing.
2026-05-26 09:08:06 +03:00
Botond Dénes
4af3359744 cql3/expr: use utils::chunked_string for untyped_constant::raw_text
This value can be a string or bytes literal, which can get very large in
rare cases. Use chunked storage to avoid large allocations.
2026-05-26 09:08:06 +03:00
Botond Dénes
2c9a5f9634 types: abstract_type::from_string() switch to fragmented buffers (implementation)
The previous patch changed the interface and callers, this one updates
the implementation to actually work with fragmented buffers. Most types
just use with_linearized() to linearize the fragmented input buffer for
parsing. This is fine, as most types have a fixed or bounded-size string
representation that is small.
Importantly, the input is not linearized for the 3 types which have
unbounded values: ascii, bytes and text. The tuple type can contain any
of these types itself, so it is also converted to avoid linearization.
2026-05-26 09:08:06 +03:00
Botond Dénes
597d4252dc types: abstract_type::from_string() switch to fragmented buffers (interface)
Change input: str::string_view -> utils::chunked_string_view.
Change return value: bytes -> managed_bytes.

This patch only changes the interface, with some to_bytes() sprinkled in
the internals to deal with recursive calls.
Internals will be updated in the next patch, to keep the churn of
updating callers separate from the actually important changes.
2026-05-26 09:08:06 +03:00
Botond Dénes
c8aba19114 types: use write_fragmented from utils/fragment_range.hh
Instead of local open-coded equivalent (used in a single place).
2026-05-26 09:08:06 +03:00
Botond Dénes
40cb9f8ccb types: timestamp_from_string(): don't assume std::string_view is null-terminated
std::string_view is not guaranteed to point to null-terminated string
literals, it may point to a substring of such a string or a string which
is not null-terminated.

std::strtoll() assumes a null terminated string and triggers heap buffer
overflow if this is not true.
Use std::from_chars() -- which doesn't assume or require null-terminated
strings -- to parse numbers from strings instead of std:strtoll().

While at it: fix a small mistake in error reporting. When reporting
failure to parse the number, include the original string in the error
report, instead of the (failed-to-parse) number.

Not a problem on current master, as all callers pass null-terminated
strings.
2026-05-26 09:08:06 +03:00
Botond Dénes
ffefa1aa05 types/duration: don't assume std::string_view is null-terminated
std::string_view is not guaranteed to point to null-terminated string
literals, it may point to a substring of such a string or a string which
is not null-terminated.

cql_duration() constructor obtains data() pointer from std::string_view
and creates another std::string_view from it, after some conditional
pointer arithmetics. Constructing a new std::string_view from a raw
pointer, without specifying its length, will lead to strlen() being
called on the pointer, resulting in undefined behaviour if the string
is not null-terminated. Use substr() instead of pointer arithmetics to
avoid this problem altogether.

boost::regex_match() invokations also use std::string_view::data().
This leads to strlen() and heap-buffer-overflow if the string is not
null-terminated. Invoke the overload which takes an iterator pair
instead.

Not a problem on current master, as all callers pass null-terminated
strings.
2026-05-26 09:08:06 +03:00
Botond Dénes
a9028d88b2 utils/hashers: add calculate(managed_bytes_view) overload
Uses update() for each fragment, then finalize. Yields identical hash to
calling calculate(std::string_view) with linearized buffer. This is
checked by new tests.
2026-05-26 09:08:05 +03:00
Botond Dénes
e924b5871b utils/ascii: add validate(managed_bytes_view) overload
Allow validating fragmented buffers without linearization.
2026-05-26 09:08:05 +03:00
Botond Dénes
332d011530 utils: add managed_bytes_fwd.hh
Forward declaration of managed_bytes[_view].
enum class mutable_view was moved from utils/managed_bytes.hh to
utils/mutable_view.hh, because it is needed in the forward declaration.
2026-05-26 09:08:05 +03:00
Botond Dénes
a2fff12bcd utils: add chunked_string
A thin facade over managed_bytes[_view], offering some extra
convenience for working with strings, as well as a strong type
communicating the purpose (storing text instead of a blob).

Also introduces utils::from_hex(chunked_string_view), a fragmented
hex-decode that operates directly on a chunked_string_view without
requiring linearization. Hex pairs straddling fragment boundaries
are handled via a carry-over nibble.
2026-05-26 09:08:05 +03:00
Botond Dénes
09743aed36 utils: add managed_bytes_basic_view::byte_iterator
bytes-wise iterator which works both as bidirectional-iterator and as
output-iterator (for mutable views). Allows using managed_bytes_view in
algorithms which are iterator based.

Added unit tests for covering the iterator functionality.
2026-05-26 09:08:05 +03:00
Nadav Har'El
96dd3121e7 Merge 'cql: rewrite CassIO SAI metadata index to regular secondary index' from Szymon Wasik
CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index:

```sql
CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table>
ON <keyspace>.<table> (ENTRIES(metadata_s))
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
```

ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with:

```
StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns
```

This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`).

CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern.

Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries:

- **Detection**: SAI class name + single `ENTRIES` target on a non-frozen `map` column
- **Rewrite**: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries)
- **Warning**: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI

The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication.

After this change, the CassIO schema setup succeeds on ScyllaDB:
- `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index
- The index is functional and can accelerate metadata filtering queries
- A CQL warning makes the rewrite transparent to operators
- SAI on non-vector, non-map-entries columns is still rejected as before
- Vector SAI indexes continue to be rewritten to `vector_index` as before

- `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name)
- `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias
- `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected

All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped).

Fixes: SCYLLADB-2113
Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added.

Closes scylladb/scylladb#29981

* github.com:scylladb/scylladb:
  cql: rewrite CassIO SAI metadata index to regular secondary index
  db/config: add enable_cassio_compatibility flag
2026-05-26 00:19:03 +03:00
Botond Dénes
722efb4d8f storage_proxy: avoid large allocation in data_read_resolver::resolve
The versions collection in data_read_resolver::resolve() is a
std::vector<std::vector<version>>. This contains one entry per unique
partition in the union of all results from each replica.
The vector's size is reserved to the size of partitions in the first
replica's response. Later, new entries are added via `emplace_back()`
for partitions found only in other replica's responses.
This can become really large if there are lot of small partitions, and
especially when there are big differences between the partition set
returned by individual replicas.

With small partitions (e.g. Alternator items with TTL, typically 150-200
bytes each), a single 1 MB read page can carry thousands of partitions,
easily pushing this vector past 2730 entries -- the point at which a
std::vector doubling reallocation exceeds the 128 KB seastar
large-allocation warning threshold:

    2 * 2731 * sizeof(std::vector<version>=24) > 131072

Switching to utils::chunked_vector caps every individual allocation at
128 KB by design, regardless of the number of partitions or how much
the replicas diverge.  The four internal helper functions that receive
this container (find_short_partitions, get_last_row,
got_incomplete_information_across_partitions, got_incomplete_information)
are updated to accept the new type; their logic is unchanged.

Fixes: SCYLLADB-460

Closes scylladb/scylladb#29325
2026-05-25 21:09:36 +03:00
Michał Hudobski
1d17d2144f index, vector_index: limit primary key columns to 255
The vector-store's InvariantKey type supports at most 255 key
components. Reject vector index creation when the base table's
primary key (partition + clustering columns) exceeds this limit.

Fixes: VECTOR-553

Closes scylladb/scylladb#29317
2026-05-25 19:24:17 +03:00
Szymon Wasik
5ee339b11d cql: rewrite CassIO SAI metadata index to regular secondary index
When CassIO creates a SAI ENTRIES index on a map column,
ScyllaDB now rewrites it to a regular secondary index and emits
a CQL warning. This allows LangChain/CassIO applications to work
without DDL errors.

The rewrite is gated behind the enable_cassio_compatibility flag
(disabled by default).

Refs: SCYLLADB-2113
2026-05-25 15:11:43 +02:00
Botond Dénes
db89f3f095 Merge 'compaction_manager: unregister compaction module on early shutdown' from Patryk Jędrzejczak
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.

Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.

Fixes: SCYLLADB-2106

Backport to all supported branches, as the bug is there and it has
already caused a failure in 2026.1 CI.

Closes scylladb/scylladb#30015

* github.com:scylladb/scylladb:
  test: add test_stop_before_starting_compaction_manager
  compaction_manager: unregister compaction module on early shutdown
2026-05-25 16:08:20 +03:00
Dmitry Kropachev
74fa423271 transport: report host id in SUPPORTED
Currently driver creates network layout (node IP addresses and ports)
from `system.local`, `system.peers`, `system.client_routes` and then
runs on assumption that this network layout is correct.
It does not check if it is.
If, for example it happens so that node ip/port (say on proxy) will not
match what driver calculated it will go unnoticed.

The goal of this feature is to provide driver host-id on SUPPORTED frame,
so that it would know which node it connected to and could make decision
wether keep connection or drop it.

- add `SCYLLA_HOST_ID` to the CQL `SUPPORTED` response
- add a regression test that hooks the Python driver handshake and
  verifies the reported host id

- `python3.12 -m py_compile test/cqlpy/test_protocol_exceptions.py`
- syntax-only compile of `transport/server.cc` with the repo toolchain
  flags inside `dbuild`

Refs #27452
Refs https://scylladb.atlassian.net/browse/DRIVER-610

Closes scylladb/scylladb#29809
2026-05-25 14:36:53 +03:00
Avi Kivity
892f22f49c Merge 'cql: atomic add/subtract operations with LWT' from Nadav Har'El
ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations in **LWT** statements. For example

	UPDATE ... SET a = a - 7 IF a > 0

or

	UPDATE ... SET a = a + 1 IF a != NULL

The row updated in the operation, and the updated column (`a`) should be initialized before the update. The example `SET a = a + 1 IF a != NULL` will fail the condition if `a` is not set. A different request `SET a = a + 1 IF EXISTS` will just leave `a` unset if it's unset (NULL + 1 is NULL, this is SQL's null propagation rules).

This add/subtract operations is allowed on any numeric (integer or floating point) column.

The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT.

This series does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b"  syntax already existed for counters, and we just allow the same syntax for non- counters. However, the series does add a bit of machinery that will allow us to easily support more general expressions in the future. In particular, this series implements the addition, subtraction, and unary-minus operators for expressions, and adds the machinery needed to run **any** expression in "SET a = expr()", using existing row values fetched by LWT.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568
Refs #22918 ("Support arithmetic operators"), SCYLLADB-1576 ("Decimal arithmetic operations OOM")

This is is a new feature, so normally would not be backported.

Closes scylladb/scylladb#29939

* github.com:scylladb/scylladb:
  cql: atomic add/subtract operations with LWT
  cql3: let constants::setter evaluate expressions using prefetched row data
  cql3/expr: add NEG unary operator for numeric negation
  cql3/expr: add SUB binary operator for numeric subtraction
  cql3/expr: add ADD binary operator for numeric addition
  types: add is_arithmetic() method for types
2026-05-25 14:27:33 +03:00
Dmitry Kropachev
06eeaf48ff tests: avoid CQL_ALTERNATOR_QUERIED on zero-token nodes
The keyspace RF test starts zero-token nodes as part of its topology setup.

The python driver 3.29.9 can't schedule queries on zero-token nodes, so waiting for `CQL_ALTERNATOR_QUERIED` on those nodes is the wrong readiness gate.
This change makes the zero-token `server_add()` calls stop at `CQL_ALTERNATOR_CONNECTED`.
The test still exercises the keyspace replication assertions through a normal token-owning contact point.

Verified with running all 4 variations of `cluster.test_keyspace_rf::test_create_keyspace_with_default_replication_factor` on this branch.

Closes scylladb/scylladb#29779
2026-05-25 14:22:04 +03:00
Michael Litvak
704fb3a5fd logstor: fix compaction state removal
previously the logstor compaction state for a compaction group could be
removed by a compaction reenabler guard. this caused an invalid access
in stop_ongoing_compactions, because it holds an iterator to the
compaction state across a yield point, so the iterator can be
invalidated if erased by another source concurrently.

we change the compaction state removal to be done only in a remove()
function that is called when the compaction group is stopped, after
waiting for ongoing compaction to stop and after the gates are closed.
this is safer because we keep the compaction state while the compaction
group exists, and remove it only when it's stopped and there are no
compactions in progress. this is similar to compaction state removal for
non-logstor tables in compaction_group::stop.

Fixes SCYLLADB-2199

Closes scylladb/scylladb#30068
2026-05-25 14:07:07 +03:00
Piotr Dulikowski
3a5dd2e5be Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.

By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.

Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.

The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request

Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.

Fixes: SCYLLADB-1157

Closes scylladb/scylladb#29575

* github.com:scylladb/scylladb:
  strong_consistency: test read forwarding to leader
  strong_consistency: skip read_barrier() for non-linearizable reads
  strong_consistency: split coordinator-level read latency metrics
  strong_consistency: forward linearizable reads to raft leader
  strong_consistency: classify reads by consistency level
  strong_consistency: add begin_read() to raft_server
2026-05-25 10:55:00 +02:00
Nadav Har'El
f8aaeb5e87 cql: atomic add/subtract operations with LWT
ScyllaDB has special counter columns for which atomic add/subtract
operations like `SET a = a + 1` are allowed. Such operations have not
been allowed on ordinary non-counter columns, as they would not be
properly atomic - the read an the write are separate, and concurrent
operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations
in *LWT* statements. Some examples:

        UPDATE ... SET a = a - 1 IF a > 0

        UPDATE ... SET a = a + 1 IF EXISTS

        UPDATE ... SET a = a + 1 a != NULL

The row updated in the operation, and the updated column (a) should
be initialized before the update - arithmetic operations on missing
column values silently leave the column null (no error is generated).

This add/subtract operations is allowed on any numeric column -
integer or floating point of any size.

The ability of LWT to fetch the old values of a column and use it to
calculate the new value has long been available in our internal CAS
implementation - and has been in use for years in Alternator - but until
this patch it was not exposed in CQL's LWT.

This patch does not add new syntax to CQL - the "SET a = a + b"
and "SET a = a - b" syntax that already existed for counters is now
allowed for non-counters.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:09:11 +03:00
Nadav Har'El
3c6931c1ed cql3: let constants::setter evaluate expressions using prefetched row data
Previously, constants::setter evaluated its expression using only the query
options, which means expressions referencing row columns (column_value nodes)
would crash or return incorrect results.

Add evaluate_on_prefetched_row() to update_parameters: it evaluates an
expression in the context of the prefetched row for a given (pkey, ckey),
falling back to options-only evaluate() when no selection is available
(non-LWT context) or no column values are needed, and treating absent
columns needed by the expression as null.

Extend constants::setter to use this method:
- setter::execute() now calls evaluate_on_prefetched_row() or evaluate()
  as needed.
- setter::requires_read() returns true when the expression contains a
  column_value node, triggering a prefetch read.
- setter::requires_lwt() mirrors requires_read(), enforcing that column-
  referencing arithmetic is only allowed inside a conditional (IF) statement.

We'll use this new feature to implement "SET r = r + 1" and similar
expressions in the next patch.
2026-05-25 10:09:11 +03:00
Nadav Har'El
b026aea6f7 cql3/expr: add NEG unary operator for numeric negation
This patch adds a new expression type, unary_operator, analogous to
the existing binary_operator but takes just one operand instead of
two.

This patch also implements the first and only unary operator type,
unary_oper_t::NEG, implementing negation (unary minus) for all numeric
types.

For fixed-width integer types overflow or underflow results in an error.
If the operand is NULL, the result is a NULL as well.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We also do not plan to use it in the
following patch which uses the separate SUB (subtraction) operation,
not the new NEG. But since I already implemented a unary minus operator,
and we'll surely need it in the future for general arithmentic operations,
I thought I might as well include this patch as well.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:08:11 +03:00
Nadav Har'El
f27d1f08fc cql3/expr: add SUB binary operator for numeric subtraction
In this patch we add to our expressions oper_t::SUB, for subtraction,
analogous to the ADD from the previous patch.

The only reason why we need a separate SUB operation and can't just
combine ADD with a unary minus (NEG) operator is the minimum integer
in fixed-sized integer. For example, 8-bit integers have the range
-128...127. A subtraction like -1 - (-128) is valid (its value is 127)
but the negation of (-128) would be invalid (128). One of the tests
we add in this patch validates this fact.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:06:28 +03:00
Nadav Har'El
083adf84ab cql3/expr: add ADD binary operator for numeric addition
Extend oper_t with a new ADD operator, to represent addition between two
numeric expressions. Supports all numeric types - tinyint, smallint,
int, bigint, float, double, varint, and decimal.

For fixed-width integer type overflow or underflow results in an error.
If one of the operand is NULL, the result is also a NULL.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We plan to start using this new operator
in a following patch which implements counter syntax ("SET r = r + 1" )
for LWT, but in the future we can use it for more general cases.

At the moment, ADD requires that both operands have the same type.
This is all we need for the first use case, and this limitation can
be relaxed later.

Interestingly, ADD is our first binary operator implementation that
does not return a boolean. Until now all our binary operators have been
comparison operators, and all returned boolean. In contrast, ADD's
return type is the type of its operands.

This implementation is susceptible to the pre-existing bug SCYLLADB-1576,
where adding 1e1000000 and 1 in "decimal" or "varint" types will
happily allocate a million-digit number and run out of memory. A
reproducing test is included, and this issue will be solved in one
place for all operations that have additions (including aggregations
and arithmetic expressions) in a followup pull-request.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:05:09 +03:00
Nadav Har'El
a21779928e types: add is_arithmetic() method for types
Add a is_arithmetic() method for types, which can be used to check if
this is a numeric type on which arithmentic operators will allowed -
for example in the following patch to support `SET x = x + 1`.

The arithmetic types are byte, short, int, long, varint, float, double
and decimal.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 09:58:51 +03:00
Avi Kivity
69a5b417d1 Merge 'pgo: enable tablets for SI and LWT' from Michael Litvak
PGO training for secondary indexes and LWT was configured with tablets
disabled because it wasn't supported at the time. This is no longer the
case, so we should remove the restrictions and enable the training with
the default mode.

To make this work we also need to fix the training cluster to be RF-rack-valid,
because some workloads have RF=3 but the cluster has 3 nodes in a single rack.
We change the script to create a 3-rack cluster by writing a separate rackdc file
for each node.

no backport needed - small build improvement

Closes scylladb/scylladb#30002

* github.com:scylladb/scylladb:
  pgo: enable train with tablets for SI and LWT
  pgo: make training cluster RF-rack-valid
2026-05-24 22:15:23 +03:00
Gleb Natapov
0bf050d175 storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution
Otherwise if a table is dropped in the middle of a scan the object may
disappear.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137

Closes scylladb/scylladb#29988
2026-05-24 21:54:08 +03:00
Yaron Kaikov
648fa8f5b1 dist: remove bundled node_exporter, add dependency on scylla-node-exporter
The node_exporter binary has moved to its own dedicated repository
(scylladb/scylla-node-exporter). Remove the bundled copy from the core
repo to eliminate the toolchain dependency required to build/package it
here and to resolve associated CVEs inherited from the vendored binary.

This removes the download logic, build rules, packaging subpackage,
systemd/sysconfig/supervisor files, and install/uninstall references.
Instead, a hard dependency on the separate scylla-node-exporter package
is declared in both the RPM spec and Debian control file.

    [Yaron:
      - regenerate frozen toolchain with optimized clang from

        https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
        https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
    ]

Fixes: RELENG-502
Fixes: RELENG-503

Closes scylladb/scylladb#29716
2026-05-24 16:30:24 +03:00
Avi Kivity
8f5c67b458 Merge 'logstor: disable logstor compaction in table truncate' from Michael Litvak
in database::truncate_table_on_all_shards disable logstor compaction
before the table data is truncated, similarly to how non-logstor
compaction is disabled, to avoid race conditions between logstor
compaction and segments discarding.

Fixes [SCYLLADB-2186](https://scylladb.atlassian.net/browse/SCYLLADB-2186)

backport to 2026.2 for CI stability

[SCYLLADB-2186]: https://scylladb.atlassian.net/browse/SCYLLADB-2186?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#30055

* github.com:scylladb/scylladb:
  logstor: compaction state cleanup
  logstor: disable logstor compaction in table truncate
2026-05-24 16:10:17 +03:00
Michael Litvak
bde18c4e51 logstor: compaction state cleanup
add a simple cleanup for the logstor compaction state map to remove
entries of stale compaction groups. remove the state of compaction group
from the map if it doesn't have anything in progress.
2026-05-24 10:25:37 +02:00
Michael Litvak
73470150a0 logstor: disable logstor compaction in table truncate
in database::truncate_table_on_all_shards disable logstor compaction
before the table data is truncated, similarly to how non-logstor
compaction is disabled, to avoid race conditions between logstor
compaction and segments discarding.

Fixes SCYLLADB-2186
2026-05-24 10:25:08 +02:00
Petr Gusev
954426407e storage_proxy: only cancel write handlers with pending remote targets during drain
The previous fix (cancel_all_write_response_handlers in do_drain)
was too aggressive — it killed all handlers including ones used by
group0 for raft commits. Since group0 is still running at that point
(before wait_for_group0_stop), this caused group0 operations to fail
(SCYLLADB-2168).

The actual problem is only with handlers that have pending remote
targets: after stop_transport() their MUTATION_DONE responses can
never arrive via messaging. Handlers whose only pending targets are
local can still complete via apply_locally and should be left alone.

Add cancel_nonlocal_write_response_handlers() which checks each
handler's remaining targets against the local host ID. Only handlers
with at least one remote pending target are cancelled. Use it in
do_drain instead of cancel_all_write_response_handlers. The latter
remains unchanged for drain_on_shutdown (final proxy shutdown where
all handlers must be killed).

Fixes: SCYLLADB-2168

Closes scylladb/scylladb#30020
2026-05-23 13:37:34 +02:00
Wojciech Mitros
45f5df14e5 strong_consistency: test read forwarding to leader
Test the linearizable read forwarding behavior in a single test that
exercises all scenarios on one cluster:
- CL=QUORUM reads on leader, follower, and non-replica nodes
- CL=ONE reads (non-linearizable, no forwarding)
- Linearizability: write + CL=QUORUM read from follower (10 iterations)
- Coordinator latency histogram metrics for both read types

Refs: SCYLLADB-1157
2026-05-23 11:35:37 +02:00
Wojciech Mitros
afa2ef6816 strong_consistency: skip read_barrier() for non-linearizable reads
Non-linearizable reads (CL=ONE) no longer call read_barrier() before
querying the local replica. This is safe because state_machine::apply()
only writes to the table after raft commit, so a local read without
read_barrier cannot see uncommitted data — just potentially stale data
which is acceptable for CL=ONE semantics.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
d07692a7ff strong_consistency: split coordinator-level read latency metrics
Split the latency metrics for strongly consistent reads into two
categories: linearizable and non-linearizable. They replace the
existing metrics for both types combined - this shouldn't cause
issues because the feature is still experimental and both the
initial introduction of latency metrics and the split will be
a part of the same release.

Also fix a test that was using the old metric.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
297094c08f strong_consistency: forward linearizable reads to raft leader
For linearizable reads (CL=QUORUM), check leadership via begin_read()
before proceeding. If this node is not the leader, redirect the request
to the leader via need_redirect (handled by bounce_to_node() in the CQL
layer). If the leader is unknown, wait and retry. When this node is the
leader, perform read_barrier() locally. This avoids sending an RPC from
the replica to the leader to get the index to wait for apply - it's
available locally. Also, linearizable reads can use and fill the cache
of leaders that we store for strongly consistent tablet groups.

Non-linearizable reads (CL=ONE) retain the existing behavior:
create_operation_ctx() redirects if not a replica, then read_barrier()
is performed on the local replica. This will be changed in the following
commit.

Also fix a copy-paste typo in the unknown exception log message that said
"mutate()" instead of "query()"

Fixes: SCYLLADB-1157
2026-05-23 11:35:37 +02:00
Wojciech Mitros
c0ea98f922 strong_consistency: classify reads by consistency level
Introduce a read_type enum (linearizable vs non_linearizable) and transform
the existing "validate" function into a "parse" method - instead of checking
if the consistency level is one of the accepted ones, we now also return the
correcponding read type for strong consistency.
The "parse" function maps CQL consistency levels to following read types:
- CL=(LOCAL_)QUORUM -> linearizable (this is the default CL)
- CL=(LOCAL_)ONE -> non_linearizable
- all others -> throw

The classification is performed in the CQL layer (select_statement) to
keep the coordinator free of CL concepts.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
1f91524547 strong_consistency: add begin_read() to raft_server
Add begin_read() method to raft_server that checks leadership for read
operations. Unlike begin_mutate(), it does not need to compute a
timestamp or interact with leader_info. It simply checks current_leader()
and returns one of three dispositions:

  - ok: this node is the leader, proceed with read_barrier() locally
  - raft::not_a_leader: redirect to the indicated leader
  - need_wait_for_leader: leader unknown, caller must wait and retry

This will be used by the read forwarding logic in subsequent commits.
2026-05-23 11:35:36 +02:00
Tomasz Grabiec
6bffc0d2e0 Merge 'utils/serialized_action: harden shutdown synchronization' from Piotr Szymaniak
`serialized_action::join()` is used as a shutdown barrier. After it returns, callers commonly destroy the owning object, and action lambdas often capture that owner by `this`.

The previous implementation waited for the internal semaphore once. This handles actions that are already running or triggers already queued before `join()`, because Seastar semaphores serve waiters FIFO. The problematic case is a late `trigger()` after `join()` has started while an older action is still running. Such a trigger can queue behind `join()`, allowing `join()` to return before that late trigger runs.

Review also found a separate semaphore bookkeeping bug in `trigger()`. The code manually waited on the semaphore and later signaled it through the caller-visible pending future. If the wait itself completed exceptionally, the signal path could still run and give back a semaphore unit that had never been acquired.

Make `join()` a terminal operation for `serialized_action`. Once `join()` starts, new `trigger()` calls fail with `broken_semaphore`. `join()` still waits for work that was accepted before it started, and only then breaks the semaphore so later waiters are rejected.

I audited the existing `serialized_action` users. Some callers explicitly remove trigger sources before `join()`, such as audit and topology_coordinator. Others rely on observer destruction or broader shutdown ordering, such as database, compaction_manager, io_throughput_updater, and schema_push. The least locally fenced case is `migration_manager::_group0_barrier`, which is reachable through several external paths, including task status lookup and other services. That makes this better enforced in `serialized_action` itself rather than relying on each caller to prove all trigger entrances are closed.

This is generic hardening of the shutdown contract, not a fix for a confirmed topology_coordinator-specific reproducer.

Also restore acquire/release ownership in `trigger()` by using `with_semaphore()`. This keeps semaphore release tied to successful acquisition while preserving the existing behavior where action completion and action errors are reported through the shared pending future.

Refs SCYLLADB-1904

No backport: this is generic shutdown hardening without a confirmed user-visible reproducer. The semaphore bookkeeping fix closes a latent exceptional wait path noticed during review, not a known production failure.

Closes scylladb/scylladb#29991

* github.com:scylladb/scylladb:
  utils/serialized_action: pair semaphore release with acquisition
  utils/serialized_action: harden join() against late triggers
2026-05-23 00:45:24 +02:00
Łukasz Paszkowski
cf0ad2bde9 tablet_allocator: use chunked_vector in cluster_resize_load to avoid oversized allocations
In make_resize_plan(), the tables_need_resize vector in cluster_resize_load
accumulates all tables that require a resize decision before the downstream
heap-based logic selects the top-N most urgent ones to emit.

In clusters with thousands of tables and aggressive tablets-per-shard scaling
(e.g., 5000 empty tables with scaling factors of 0.04-0.12), nearly all tables
satisfy the merge condition (scaled target < current tablet count), causing
the vector to grow to thousands of entries. With ~100 bytes per element,
std::vector's doubling strategy triggers contiguous allocations exceeding
256KB, producing seastar oversized allocation warnings.

Replace std::vector with utils::chunked_vector in cluster_resize_load for
both tables_need_resize and tables_being_resized. chunked_vector caps
individual allocations at 128KB, splitting into multiple chunks when needed.
For normal workloads (fewer than ~1300 resize candidates), behavior is
iadentical to std::vector — single contiguous chunk, same performance.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1955

Closes scylladb/scylladb#29946
2026-05-22 16:52:12 +03:00
Yaniv Michael Kaul
acd3115645 sstables: include SSTable filename in Stats metadata error messages
When Stats metadata is not available or malformed, include the SSTable
filename in the error message to help operators identify which SSTable
files need attention during startup failures.

Fixes: https://github.com/scylladb/scylla-enterprise/issues/5439
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: yes
Backport: no, benign improvement

Closes scylladb/scylladb#29950
2026-05-22 16:49:37 +03:00
Łukasz Paszkowski
96a992002c tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
   topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
   registered on the CV. The loop kept spinning, holding the HTTP
   server's task gate open and preventing http_server::stop() from
   completing. After ~15 minutes, systemd killed the process with
   SIGABRT.

The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532

Closes scylladb/scylladb#29485
2026-05-22 16:47:48 +03:00
Piotr Szymaniak
b9a7a6c25d utils/serialized_action: pair semaphore release with acquisition
The previous manual wait/signal split could signal the semaphore even if wait() completed exceptionally, giving back units that were never acquired. Use with_semaphore() so failed acquisition does not release anything.

Bug found by tgrabiec.
2026-05-22 14:19:36 +02:00
Pavel Emelyanov
8cb32a9958 replica: Fix use-after-free in get_sstables_from_object_store
The lambdas inside get_sstables_from_object_store captured get_abort_src
by reference, but get_abort_src is a by-value function parameter living
on the stack frame of get_sstables_from_object_store. Since the outer
lambda is moved into seastar::async via get_sstables_from and executed
after get_sstables_from_object_store returns, the reference becomes
dangling.

Fix by capturing get_abort_src by value (copying the std::function)
in both lambdas.

Found by AddressSanitizer: stack-use-after-return at
distributed_loader.cc:243.

Fixes SCYLLADB-2172

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29954
2026-05-22 15:05:21 +03:00