Commit Graph

11954 Commits

Author SHA1 Message Date
Gleb Natapov
492a75ffbb raft: separate group0 server start from in-memory state machine enablement
Split start_server_for_group0 so it only starts the raft server and
replays the log (applying mutations to system tables), without loading
the state into memory. A new enable_group0_state_machine() method is
added which callers must invoke explicitly after all dependencies
(CDC generation service, non-system schemas, etc.) are available.

This prepares for moving setup_group0_if_exist earlier in the startup
sequence so the raft log can be replayed before non-system keyspaces
are loaded, while deferring the in-memory state loading until after
all dependencies are initialized.
2026-06-02 12:42:49 +03:00
Botond Dénes
3613d9a07d test/cqlpy/nodetool: fix NameError in compact_keyspace nodetool path
compact_keyspace() operates on a whole keyspace and has no 'cf' variable
in scope, but the nodetool fallback branch mistakenly passed it to
args.extend([ks, cf]), which would raise NameError whenever that path
was taken. Fix by passing only the keyspace.

Closes scylladb/scylladb#30097
2026-06-02 08:44:06 +03:00
Łukasz Paszkowski
a84d9cb8c4 test_backup.py: fix race in test_restore_tablets_vs_migration
The test was racing move_tablet against restore_tablets without
ensuring that move_tablet had actually reached the streaming phase
before restore began. This caused restore to win the group0 race,
putting the tablet into transition first, which made move_tablet
fail with "Tablet is in transition".

Fix by adding a log message to the block_tablet_streaming error
injection and waiting for it in the test, ensuring the move has
entered the streaming phase (and is blocked) before restore starts.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2147

Closes scylladb/scylladb#30173
2026-06-02 08:17:54 +03:00
Nadav Har'El
75a05fc2b3 Merge 'cql3: fix stack overflow and quadratic behavior' from Avi Kivity
This series fixes two vulnerabilities:

unbounded recursion during expression evaluation with deeply nested expressions
quadratic computation with large WHERE clauses
The fixes simply bound the depth of recursion and the length of the WHERE clause.

The WHERE clause limits are configurable. Nesting is less likely to be exceeded, so not configurable.

Limits inspired by Common Expression Language:

https://github.com/google/cel-spec/blob/master/doc/langdef.md#syntax

Implementations are required to support at least:

24-32 repetitions of repeating rules
12 repetitions of recursive rules

CVE-2026-31948
CVE-2026-31947

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002
Fixes https://github.com/scylladb/scylladb/issues/14472

Closes scylladb/scylladb-ghsa-m4h7-g37h-mgxf#3

* github.com:scylladb/scylladb-ghsa-m4h7-g37h-mgxf:
  cql3: limit number of relations in WHERE clause
  cql3: add max_relations_in_where_clause to dialect
  test/cqlpy: add tests for WHERE clause relation count limit
  cql3: limit nesting depth of function calls and CASTs in CQL parser
  test/cqlpy: add tests for deeply nested function calls and CASTs
2026-06-01 22:31:56 +03:00
Marcin Maliszkiewicz
1dc975c491 Merge 'table_helper: observe detached setup_table() future' from Andrzej Jackowski
During shutdown, group0 may be torn down while
cache_table_info() has a detached setup_table() future
in flight. This causes raft_group_not_found to propagate
as an abandoned failed future.

Add .handle_exception() to log the failure at debug level
instead of leaving the future unobserved.

Fixes: SCYLLADB-2224

Backport to 2026.2 and 2026.1, because the test failed on 2026.1

Closes scylladb/scylladb#30093

* github.com:scylladb/scylladb:
  test: table_helper: verify detached setup failure is consumed
  table_helper: observe detached setup_table() future
2026-06-01 19:32:34 +02:00
Aleksandra Martyniuk
33af16d808 test/cluster/test_tablets: increase timeout for test_multi_rf_of_many_keyspaces_0_N
Multi-RF change handles multiple keyspaces concurrently, but tablet
rebuilds are not all started at once — the load balancer considers
machine load when scheduling them. With 3 keyspaces each having a base
table and materialized view, the total operation time approaches the
default 200s CQL timeout on slow/busy CI machines (observed at ~191s).

Double the timeout to 400s to provide sufficient margin.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2042.

Closes scylladb/scylladb#30018
2026-06-01 20:07:03 +03:00
Michael Litvak
a7a7f02392 test: test_cdc_with_tablets: add read barrier
Add group0 read barrier in test_cdc_with_tablets whenever we observed a
condition such as tablet count change or cdc stream change, and we want
to proceed to check that cdc tables are consistent with the change. For
example, when we wait for tablet count change and then check the cdc
streams changed as well.

The problem is that when we observe the tablet count change, for
example, even though the cdc streams are changed in the same group0
operation, we may observe it during the group0 apply, when the operation
is only partially applied. The read barrier ensures that the change we
observed is fully applied.

Fixes SCYLLADB-2352

Closes scylladb/scylladb#30177
2026-06-01 13:56:01 +02:00
Avi Kivity
520b130b97 cql3: limit number of relations in WHERE clause
A WHERE clause with many relations (e.g. hundreds of AND-ed conditions)
can cause quadratic complexity. Check the relation count during parsing
and reject queries exceeding the configurable max_relations_in_where_clause
limit (default 100) with a SyntaxException.

The changes to IDL don't cause problems during upgrade, because
CQL forwarding is not in any released version, and because
it is part of an experimental feature.

CVE-2026-31947

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002
2026-06-01 14:01:27 +03:00
Avi Kivity
fdcc44c425 cql3: add max_relations_in_where_clause to dialect
Add a configurable max_relations_in_where_clause parameter (default 100)
to the CQL dialect, plumbed through db::config, transport server, and
test environment. This will be used by the CQL parser to reject WHERE
clauses with too many relations that cause quadratic complexity.
2026-06-01 14:01:27 +03:00
Avi Kivity
1ad1c8ef7f test/cqlpy: add tests for WHERE clause relation count limit
Add tests that verify the CQL parser rejects WHERE clauses with too
many relations (e.g. WHERE a=1 AND b=1 AND ... repeated 200 times),
and that a reasonable number of relations (50) is still accepted.
2026-06-01 14:01:25 +03:00
Botond Dénes
bb81dbf65e Merge 'guardrails: Add replica-side large data guardrails' from Taras Veretilnyk
Adds write-path guardrails that reject or warn on mutations targeting partitions, rows, or collections that already exceed configured size thresholds, based on SSTable `large_data_record` metadata.
ScyllaDB already detects and records large partitions/rows/cells in `system.large_data_records` after compaction, but takes no preventive action on the write path. Once a partition grows past operational limits it causes latency spikes, OOM, and repair failures. These guardrails let operators set hard and soft thresholds so that writes to already-oversized data are rejected (hard) or logged as warnings (soft) before they make the problem worse.
- **Intrusive index over SSTable metadata**: A per-table `large_data_record_index` maintains three `boost::intrusive::multiset`s (partitions, rows, cells) using `auto_unlink` hooks directly on `large_data_record`. SSTable destruction automatically removes records from the index — no explicit deregistration needed.
- **Virtual dispatch for zero-cost disabled path**: `large_data_guardrail_base` → `noop_large_data_guardrail` / `large_data_guardrail`. Tables without guardrails enabled pay only a virtual call to a no-op. No index is built or maintained for disabled tables.
-  **Schema storage**: The per-table flag is stored as a scylla_tables column, following the tablets pattern: only write a live cell when enabled, omit entirely when disabled. The CQL feature gate prevents enabling until all nodes are upgraded.
- **Write-path integration**: The guardrail check runs in `do_apply` after the frozen mutation is deserialized but before it is applied to the memtable. Hint replay and Paxos learn skip the check via `skip_large_data_guardrails`.
Uses existing `large_*_warn_threshold` config options as soft limits and new `large_*_fail_threshold` options as hard limits. Checked dimensions:
- Partition size (bytes)
- Partition row count
- Row size (bytes)
- Collection element count

Backport is not required

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-180

Closes scylladb/scylladb#29733

* github.com:scylladb/scylladb:
  test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests
  test/cqlpy: add large collection guardrail tests
  test/cqlpy: add large row guardrail tests
  test/cqlpy: add large partition guardrail tests
  test/boost: add large_data_guardrail unit tests
  test/cluster: add large data guardrails rolling upgrade test
  replica: wire large_data_guardrail into the write path
  schema: add per-table large_data_guardrails_enabled flag
  db: implement large_data_guardrail
  db: implement large_data_record_index
  sstables: add intrusive index hook to large_data_record
  db: add large_collection_elements_fail_threshold config option
  db: add large_row_fail_threshold_mb config option
  db: add rows_count_fail_threshold config option
  db: add large_partition_fail_threshold_mb config option
  replica: introduce large_data_exception
2026-06-01 13:26:00 +03:00
Nadav Har'El
b254a9826a test/cluster: add pylib-style nodetool.py
Tests in test/cqlpy use a tiny nodetool-like library, where calls to
nodetool.flush() are translated to the parallel REST API request on
Scylla - but use an external "nodetool" command when running the test
against Cassandra.

Some tests/cluster also began using test/cqlpy/nodetool.py, but it is
NOT a good fit for test/cluster tests, because:

1. It falls back to using the external "nodetool" when it thinks the
   REST API is not available. In cluster tests, no such fallback is
   needed (these tests can't be run on Cassandra). If the REST API is
   down, the test should fail - not fall back to an irrelevant method.

2. The nodetool.flush() et al. functions are not async, and cluster
   tests are supposed (by design...) to only use async APIs.

3. test/cqlpy/nodetool.py was not written in the "style" defined for
   the test/cluster codebase - specifically they don't have docstrings
   or strong typing.

This patch introduces test/pylib/nodetool.py, based on
test/cqlpy/nodetool.py but fixing all the above problems - there are
no Cassandra fallbacks, there are docstrings and type hints, and
all the functions are async.

We also fix the test/cluster tests that used test/cqlpy/nodetool.py to
switch to test/pylib/nodetool.py. Of course it means the newly async
functions need to be "await"ed, not just called, so this patch changes
that too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#30129
2026-06-01 13:03:29 +03:00
Piotr Szymaniak
21f1380df1 test/pylib: fix starting server cleanup race
test_localnodes_joining_nodes stops a server while manager.server_add() is still waiting for that server to finish startup. Stopping the process can make the background add_server() fail and run its cleanup path first, removing the server from ScyllaCluster.starting. When the stop request later resumes, its own self.starting.pop(server_id) raises KeyError, which the manager returns as HTTP 500.

The opposite ordering is possible as well: server_stop() can remove the entry before add_server() reaches its finally block.

Make cleanup of ScyllaCluster.starting idempotent in both paths. add_server() remains the normal cleanup path, while server_stop() provides fallback cleanup when it wins the race.

Fixes SCYLLADB-2314

Closes scylladb/scylladb#30128
2026-05-31 23:22:44 +03:00
Nadav Har'El
33dce2b7fc Merge 'cql3: statement_restrictions: continue exploitation of predicate work' from Avi Kivity
In 6165124fcc, we changed analysis of expressions in the WHERE clause
to use predicates, an annotated form of an expression that constrains a column
when the expression is set to true.

Here, we exploit this work to simplify the analysis further, reusing already computed
attributes rather than re-analyzing the expression.

Not backporting, this is a refactor with no functional change and no bugs fixed.

Closes scylladb/scylladb#30049

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: simplify find_idx to return only the index
  cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans
  cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key
  cql3: statement_restrictions: replace find_clustering_order with predicate order field
  cql3: statement_restrictions: replace has_partition_token with variant check
  cql3: statement_restrictions: replace has_slice with predicate is_slice check
  cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column
  cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering
  cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool
  cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column
  cql3: statement_restrictions: replace find_needs_filtering with predicate op check
  cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool
  cql3: statement_restrictions: replace find_binop column extraction with predicate on field
  cql3: statement_restrictions: set op on all binary-operator-derived predicates
2026-05-31 23:22:43 +03:00
Avi Kivity
503add224d cql3: statement_restrictions: simplify find_idx to return only the index
The expression returned as the second element of find_idx()'s pair was
stored in view_indexed_table_select_statement::_used_index_restrictions
but never read — dead code. Simplify find_idx() to return just the
optional<index>, and remove the dead member and constructor parameter
from view_indexed_table_select_statement.

The now unused _idx_restrictions is also removed.
2026-05-29 17:18:21 +03:00
Taras Veretilnyk
9abf594397 test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests
Per-table toggle: disabled-at-create, alter-disable, alter-reenable.
LWT exemption: Paxos learn must bypass the guardrail.
Multi-category independence: all three guardrails warn/reject
independently when SSTable records span partition, row, and collection
categories.
2026-05-29 12:51:43 +02:00
Dimitrios Symonidis
4c0a991017 test/cluster: fix proxy resource leak in internode compression test
The test_internode_compression_between_datacenters test was flaky due to
proxy servers and leased host IPs not being cleaned up on failure paths.
If any exception occurred after proxies were started (e.g. during
server_start or driver_connect), the asyncio.Server listeners remained
bound and leased hosts were never released back to HostRegistry. On
subsequent test runs, this caused EADDRINUSE (errno 98) when trying to
bind the same address:port.

Wrap the proxy/server lifecycle in try/finally to ensure proxies are
always stopped and hosts are always released, regardless of whether
the test succeeds or fails.

Fixes: SCYLLADB-2183

Closes scylladb/scylladb#30127
2026-05-29 13:51:43 +03:00
Taras Veretilnyk
7d365844a3 test/cqlpy: add large collection guardrail tests
Tests for collection element-count guardrail: hard-limit rejection,
disabled-when-zero, soft-limit log warning, and no-warning below
threshold.
2026-05-29 12:51:43 +02:00
Taras Veretilnyk
19a9e45da8 test/cqlpy: add large row guardrail tests
Tests for row-size guardrail: hard-limit rejection, disabled-when-zero,
soft-limit log warning, and no-warning below threshold.
2026-05-29 12:51:42 +02:00
Taras Veretilnyk
67b659e2bf test/cqlpy: add large partition guardrail tests
Tests for partition size and row-count guardrails: hard-limit rejection,
disabled-when-zero, soft-limit log warnings, and no-warning below
threshold.  Includes shared helpers and log assertion utilities used by
subsequent commits.
2026-05-29 12:51:42 +02:00
Taras Veretilnyk
ff84b1dbc4 test/boost: add large_data_guardrail unit tests
8 tests covering the record_compare template comparator,
intrusive multiset equal_range grouping with heterogeneous
lookup_key, and auto_unlink on record destruction.
2026-05-29 12:51:42 +02:00
Taras Veretilnyk
0201c1530e test/cluster: add large data guardrails rolling upgrade test
Simulated rolling upgrade: start a 2-node cluster where one node
suppresses the LARGE_DATA_GUARDRAILS feature, verify that enabling
guardrails is rejected, then upgrade the old node and verify that
enabling guardrails succeeds.
2026-05-29 12:51:31 +02:00
Pavel Emelyanov
5d0371620d test/backup: Reduce s3 logging from trace to debug
Change s3 log level from TRACE to DEBUG in backup tests.

TRACE level generates excessive log volume with too much low-level
detail about S3 operations. While it was usefult in the early days
of S3 client, nowadays DEBUG level likely provides sufficient
diagnostic information for backup test troubleshooting.

The reduced log volume significantly improves test performance, which
is the main outcome of this change:
- Less I/O time writing logs during test execution
- Faster teardown: each test scans all server logs for errors, and
  smaller logs mean faster grep operations (23.3s → 9.97s for 8-node
  cluster teardown)

Impact on test_restore_with_streaming_scopes[topology4] (8 nodes):
- Log volume: 49 MB → 23 MB (reduced by half)
- Test runtime: 82.55s → 57.53s (30% faster)
- Teardown time: 23.3s → 9.97s (57% faster)

Tests that start smaller clusters also have notable timing improvements

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#30109
2026-05-29 13:46:10 +03:00
Pavel Emelyanov
24c0ea6b19 sstables_loader: Prevent table destruction during tablet restore download
Similar to e5e6608f20 ("sstables_loader: prevent use-after-free on
table drop during streaming") which fixed the same class of race for
load_and_stream, the tablet restore path also holds a replica::table&
reference across the download_sstable() coroutine without preventing
concurrent table destruction.

If DROP KEYSPACE is applied while download_sstable() is writing SSTable
components to the table's data directory, the directory is removed
mid-write causing ENOENT → abort (with --abort-on-internal-error).

Fix by acquiring a stream_in_progress() phaser guard after
find_column_family() and before download_sstable(). table::stop()
calls _pending_streams_phaser.close() which blocks until all
outstanding guards are released, keeping the table alive for the
duration of the download.

Fixes: SCYLLADB-2187

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#30094
2026-05-29 13:43:37 +03:00
Pavel Emelyanov
8b2ff16cae schema: Move grace_period from schema_ctxt to schema_registry
The schema_registry_grace_period field on schema_ctxt was only used by
schema_registry itself for eviction timing. Move it to be a direct member
of schema_registry, passed at init() time. This removes one db::config
dependency from schema_ctxt.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#30038
2026-05-29 13:42:23 +03:00
Botond Dénes
1384c9523e Merge 'Simplify handler injection call sites to use appropriate existing API' from Pavel Emelyanov
Several error injection call sites use the verbose handler-lambda API when simpler alternatives already exist in the framework. This series converts them to use the appropriate overloads, reducing boilerplate and making the injection intent immediately obvious from the call site.

Cleaning up in-code debugging facilities, no need to backport

Closes scylladb/scylladb#29962

* github.com:scylladb/scylladb:
  error_injection: Convert handler-style breakpoints to wait_for_message sugar
  error_injection: Convert no-op handler injections to enter()/is_enabled()
  error_injection: Convert handler-throw injections to lambda-throw style
  utils: Add share_messages parameter to breakpoint injection API
2026-05-29 13:41:09 +03:00
Botond Dénes
3ae88e31bd Merge 'test/pylib: stop using random ports for MinIO and JMX' from Piotr Smaron
Replace random port selection in MinIO and JMX test helpers with fixed
ports on unique per-test loopback IPs, eliminating TOCTOU races.

Commits:
- kmip_wrapper: default hostname to 127.0.0.1
- nodetool: bind JMX to the per-module loopback IP with fixed port 7199
- minio: use fixed service and console ports on a unique HostRegistry IP
  instead of probing the ephemeral range; raise on start failure

Fixes: SCYLLADB-1817

Minor improvement, no need to backport.

Closes scylladb/scylladb#29741

* github.com:scylladb/scylladb:
  test/pylib: use fixed MinIO ports on unique loopback IPs
  test/nodetool: bind JMX to per-module loopback IP
  test/pylib: default KMIP wrapper to loopback
2026-05-29 13:40:24 +03:00
Taras Veretilnyk
5a0974e781 schema: add per-table large_data_guardrails_enabled flag
Add a per-table large_data_guardrails_enabled flag controlled via the CQL
table property WITH large_data_guardrails_enabled = true|false.

Store the flag as a boolean column in system_schema_ext.scylla_tables.
Only write a live cell when enabled; when disabled (the default), omit
the cell entirely so that old nodes that don't know this column can
still read the SSTable during rolling upgrade or rollback.  When the
property transitions from true to false via ALTER TABLE, a tombstone is
written in make_update_table_mutations to override the previous live
cell — this is safe because the CQL feature gate ensures all nodes are
upgraded before the property can be set to true.

Gate the CQL property behind the LARGE_DATA_GUARDRAILS cluster feature:
attempting to set large_data_guardrails_enabled = true before all nodes
advertise the feature raises a ConfigurationException.
2026-05-29 12:18:33 +02:00
Botond Dénes
46631692cd mutation_fragment_stream_validator: use legacy byte order for same-token partition key comparison
When two partition keys share the same token, their relative order is
determined by their raw serialized bytes (legacy_tri_compare), which
matches the physical on-disk order in SSTables.  The validator was
using partition_key::tri_compare instead — a type-aware comparator
that can disagree with byte order for types like timeuuid.

The result was a false-positive "out-of-order partition key" error
for any two same-token partitions whose timeuuid (or other type-aware)
order is the reverse of their byte order.  In scrub mode this caused
the second partition to be silently dropped.

Fixes: SCYLLADB-2304

Closes scylladb/scylladb#30120
2026-05-29 11:54:20 +02:00
Tomasz Grabiec
5ceabcbcc5 Merge 'tablets: fix update_tablet_metadata failures during bootstrap' from Aleksandra Martyniuk
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone
and/or static row while subsequent mutations contain only clustered
rows. The hint logic would correctly clear tokens (marking a full
partition read) upon seeing the tombstone in the first mutation, but
then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.

Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens. Additionally, fall back to a full partition read
when the tablet map is missing locally, which happens when the joining
node receives tablet metadata for a table it has never seen before.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2303.

Needs backports to 2026.1+. 2026.1 introduces the regression with b17a36c071

Closes scylladb/scylladb#30115

* github.com:scylladb/scylladb:
  tablets: fall back to full partition read when tablet map is missing
  tablets: fix hint re-adding tokens after full partition read decision
2026-05-29 11:53:36 +02:00
Botond Dénes
091e3f5191 Merge 'test.py: reduce resource metrics gathering overhead' from Evgeniy Naydanov
Only enable the memory controller in cgroup subtree_control instead of all available controllers. cpu.stat is available in cgroup v2 without enabling the cpu controller (base accounting), and enabling io/pids/cpu controllers adds unnecessary per-operation kernel overhead to Scylla processes - particularly the memory controller's per-page-cache-operation accounting combined with io controller overhead during heavy I/O.

Additionally, restrict SystemResourceMonitor to the master process only. System-wide metrics (CPU%, memory) are identical from any process, so running a monitoring thread in each xdist worker was redundant and added unnecessary SQLite write contention and thread scheduling noise.

Replace cpu_percent(interval=0.1) with a non-blocking cpu_percent()
that returns CPU% since the previous call. Use stop_event.wait(timeout=2.0) as the
loop control to both space out iterations and allow immediate shutdown responsiveness.

Fixes SCYLLADB-2141

Closes scylladb/scylladb#29987

* github.com:scylladb/scylladb:
  test: use non-blocking cpu_percent in SystemResourceMonitor
  test.py: reduce cgroup overhead in resource metrics gathering
2026-05-29 10:52:17 +03:00
Nadav Har'El
7a387a499f Merge 'cql3: extract vector search select statement into cql3/statements/external_search/' from Szymon Malewski
Extract vector_indexed_table_select_statement and its filter logic out of
the monolithic select_statement.cc and vector_search/ module into a
dedicated directory cql3/statements/index_search/.

This improves modularity and eliminates a circular dependency between cql3
and vector_search: the filter code depends heavily on cql3 types
(expressions, query_options, statement_restrictions) and belongs in the cql3
layer. Follow-up to VECTOR-250 which originally addressed the same
dependency but has since regressed.

This is also a preparatory refactoring for full-text search select statements,
which can share some implementation with the vector search.

Pure refactoring, no semantic changes - no need for backporting.

Closes scylladb/scylladb#30100

* github.com:scylladb/scylladb:
  vector_index: move filter into cql3/statements/external_search
  cql3: extract vector_indexed_table_select_statement into own compilation unit
  vector_index: split query_base_table to return raw coordinator_result
2026-05-28 11:26:49 +03:00
Piotr Dulikowski
8dfd455001 Merge 'strong consistency: fix drop table blocking on stuck writes and handle timeout in update()' from Petr Gusev
- Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum
- Handle unhandled timeout exception in the wait-for-leader loop during group startup

When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion.

Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state.

SCYLLADB-2080 fix:
- Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released).
- Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling.
- Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between).

Timeout handling fix:
- Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled.

Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that:
1. Pauses a write on the leader before add_entry
2. Drops the table (follower destroys its group immediately)
3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds

backport: no need, strong consistency is not released yet

Fixes: SCYLLADB-2080

Closes scylladb/scylladb#30105

* github.com:scylladb/scylladb:
  strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
  strong consistency: abort raft server before gate close when dropping a table
  test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
2026-05-28 09:59:20 +02:00
Szymon Malewski
ed1006928f vector_index: move filter into cql3/statements/external_search
Move prepared_filter, prepared_restriction, prepared_rhs types and
prepare_filter() from vector_search/filter.{hh,cc} into new files
cql3/statements/external_search/filter.{hh,cc} under namespace
cql3::statements::external_search.

This eliminates a circular dependency between the cql3 and vector_search
modules: the filter code depends heavily on cql3 types (expressions,
query_options, statement_restrictions) and belongs in the cql3 layer.

This is a follow-up to VECTOR-250 which originally addressed the same
circular dependency but has since regressed.
2026-05-27 21:43:56 +02:00
Ferenc Szili
76dac2fd8e test: fix format string typo in error logging in ldap_server.py
This change fixes a typo in the error logging format string: s% -> %s

Fixes: SCYLLADB-2244

Closes scylladb/scylladb#30088
2026-05-27 17:22:21 +03:00
Aleksandra Martyniuk
d6c1707a04 tablets: fix hint re-adding tokens after full partition read decision
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone and/or
static row while subsequent mutations contain only clustered rows.

The tablet metadata change hint logic would correctly clear tokens (marking
a full partition read) upon seeing the tombstone in the first mutation,
but then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.

Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens.
2026-05-27 15:36:16 +02:00
Nadav Har'El
21ecc12fc6 Merge 'index: fix local vector index locality detection after schema reload' from Michał Hudobski
After schema reload, `target_parser::is_local()` did not recognize the
vector-index local target format `{"pk": [...], "tc": "..."}`, causing
local vector indexes to be treated as global. This broke duplicate
detection when both a global and a local vector index existed on the same
column. Fix by introducing `vector_index::is_local()` and dispatching
to it from `create_index_from_index_row()` based on the index class.
Also adds tests for local/global vector index coexistence.

Fixes: SCYLLADB-987

backport reasoning: we added local vector index support in 2026.1

Closes scylladb/scylladb#29492

* github.com:scylladb/scylladb:
  test/cqlpy: add tests for global and local vector index coexistence
  index: fix local vector index locality detection after schema reload
2026-05-27 15:34:57 +03:00
Petr Gusev
d922c43358 strong consistency: abort raft server before gate close when dropping a table
When a strongly consistent table is dropped, schedule_raft_group_deletion()
used to call g->close() first, which waits for all in-flight operations to
release their gate holders. But other nodes may have already destroyed their
raft servers for this group, so an in-flight write on the leader cannot
reach quorum and hangs until the client timeout expires, unnecessarily
delaying group deletion.

Fix: initiate gate close (prevents new operations from entering), then
abort the raft server (causes in-flight add_entry/read_barrier to throw
raft::stopped_error, releasing their gate holders), then await the gate
future (resolves immediately since holders are now released).

Handle raft::stopped_error in the coordinator's top-level catch blocks
(both write and read paths): if the table no longer exists, return
no_such_column_family (which the CQL layer converts to InvalidRequest
'unconfigured table'). Otherwise fall through to the default timeout
handling.

Also replace gate->hold() with try_hold() + on_internal_error in
acquire_server, and handle the timeout exception in the wait-for-leader
loop in update() gracefully (log + break instead of propagating).

Fixes: SCYLLADB-2080
2026-05-27 12:06:46 +02:00
Petr Gusev
89307064b5 test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes
the quorum-loss scenario: when a table is dropped, the follower destroys
its raft group immediately while the leader's in-flight operations are
still holding the gate.

The test pauses both a read and a write on the leader, drops the table,
then resumes them. Both are expected to fail with 'no such column family'
since the raft server is aborted as part of group deletion. A 15-second
timeout guard detects the old buggy behavior (write stuck forever).

Marked xfail until the fix is applied in the next commit.
2026-05-27 12:06:46 +02:00
Botond Dénes
555cfbcd38 Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity
Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads.
- schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.

No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway)

Closes scylladb/scylladb#29990

* github.com:scylladb/scylladb:
  treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
  scylla-gdb: read shard count from smp::_this_smp instead of smp::count
  schema_builder: make shard_count an explicit constructor parameter
2026-05-27 09:42:06 +03:00
Avi Kivity
8010e408a2 treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
Replace all uses of the deprecated seastar::smp::count with
this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards()
across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call
sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default
  parameter value. This is safe since all callers are on reactor threads,
  but the expression is now evaluated at each call site rather than being
  a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh,
  ent/encryption/encryption.cc: used in default member initializers and
  constructor member-init-lists. Objects are always constructed on reactor
  threads.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
2026-05-26 17:35:20 +03:00
Wojciech Mitros
ae0d77257f mv: fix view_update_builder losing fragments across batch boundaries
When a mutation generates more view updates than max_rows_for_view_updates
(100), view_update_builder::build_some() splits the work into multiple
batches. There was a bug in how fragments were read between batches:

When should_stop_updates() returned true, the old code called stop()
which returned stop_iteration::yes without reading the next fragments.
On the next build_some() call, read_both_next_fragments() was called
at the start, which advanced BOTH readers - skipping any fragment that
was already read but not yet consumed. A row could be not consumed if
either:
- the 100th (last in the batch) update was a row insertion and we still
  had insertions/updates remaining
- the 100th (last in the batch) update was a row deletion and we still
  had deletions/updates remaining
For the most common case where work is split in batches, i.e. range
deletions, we couldn't hit this because range delete generates only
view row deletions.
On tables with a single materialized view, we also couldn't get this
for any batches with less than 50 statements (unless the batch also
contained range deletions), because one non-range-delete update can
generate up to 2 view updates.
Howeveer, for a range of scenarios outside these 2, we could lose
view updates, resulting in persistent inconsistencies.

The fix:
- read_*_next_fragment() now accept a stop_iteration parameter, so the
  next fragments are always read after consuming (even when stopping),
  but stop_iteration::yes is correctly propagated to break the loop.
- build_some() no longer re-reads fragments at the start. Instead, an
  initialize() method performs the initial read once at construction.
- because now we only advance readers after consuming, we won't advance
  readers after end_of_partition, so we extend the break condition to
  accept either readers evaluating to `false` or them being at the
  end_of_partition. We also handle the optimization with
  _skip_row_updates

Fixes: scylladb/scylladb#29155

Closes scylladb/scylladb#29498
2026-05-26 14:15:12 +02:00
Pavel Emelyanov
cd7d9a63bc error_injection: Convert handler-style breakpoints to wait_for_message sugar
Replace verbose handler lambdas that only log and call
wait_for_message() with the equivalent one-liner breakpoint sugar.
The behavior is identical -- the sugar produces the same log messages
in the format "{name}: waiting for message" / "{name}: message received".

Update Python tests that waited for the old ad-hoc log messages to
match the new standardized format.

Converted injections:
 - topology_state_load_before_update_cdc (storage_service.cc)
 - migration_streaming_wait x2 (storage_service.cc)
 - pause_after_streaming_tablet (storage_service.cc)
 - cdc_generation_publisher_fiber (topology_coordinator.cc)
 - wait_after_tablet_cleanup (topology_coordinator.cc)
 - fast_orphan_removal_fiber (topology_coordinator.cc)
 - split_storage_groups_wait (table.cc)
 - wait_before_stop_compaction_groups (table.cc)
 - tasks_vt_get_children (task_manager.cc)
 - truncate_compaction_disabled_wait (database.cc)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-26 15:01:01 +03:00
Avi Kivity
c59985c38b Merge 'cql3: limit large allocations when parsing queries' from Botond Dénes
Queries are stored and passed around as sstring/std::string_view. While normally they are small enough to not cause problems, as the `test_cdc_large_values.TestLargeColumnsWithCDC.test_single_column_blob_max_size_with_cdc_preimage_full_postimage[unprepared_statements]` demonstrates, queries can be arbitrarily large, putting heavy strain on Scylla internals via large allocations, in the extreme case causing denial of service.

This PR attempts to alleviate this by using fragmented storage for queries: read query as fragmented string from the input stream in `transport/server.cc`, propagate it as such to `query_processor::prepare()` and also store it as such in `cql3::cql_statement::raw_cql_statement`. Also avoid linearizing raw values during in the CQL expression tree: switch `cql3::expr::untyped_constant::raw_text` to fragmented storage.

For this to be possible, some infrastructure code had to be made fragmented storage friendly: ascii/utf8 validation, hashers, from_hex and importantly: `abstract_type::from_string()`.

Unfortunately, the query still has to be linearized for parsing itself, as ANTLR -- although allows for custom InputStream implementation -- plays pointer arithmetics games with the pointers obtained from them, so fragmented input cannot be used.

Still, this PR limits the places where the query is linearized to the
following:
* Parsing
* Audit
* Logs and error messages

So the normal query paths for queries that actually can get arbitrarily large (UPDATE and INSERT) should only linearize the query temporarily for parsing.

Fixes #10779

Improvement, no backport

Closes scylladb/scylladb#28619

* github.com:scylladb/scylladb:
  tracing: add_query(): change query param to utils::chunked_string
  cql3: store raw query string in utils::chunked_string
  serializer: add serializer<utils::chunked_string>
  utils/reusable_buffer: add get_linearized_view(managed_bytes_view)
  cql3/expr: use utils::chunked_string for untyped_constant::raw_text
  types: abstract_type::from_string() switch to fragmented buffers (implementation)
  types: abstract_type::from_string() switch to fragmented buffers (interface)
  types: use write_fragmented from utils/fragment_range.hh
  types: timestamp_from_string(): don't assume std::string_view is null-terminated
  types/duration: don't assume std::string_view is null-terminated
  utils/hashers: add calculate(managed_bytes_view) overload
  utils/ascii: add validate(managed_bytes_view) overload
  utils: add managed_bytes_fwd.hh
  utils: add chunked_string
  utils: add managed_bytes_basic_view::byte_iterator
2026-05-26 15:00:53 +03:00
Andrzej Jackowski
45ff773466 test: table_helper: verify detached setup failure is consumed
Add test_best_effort_setup_table_failure_is_consumed which
triggers a setup_table() failure via a missing keyspace and
asserts no abandoned future escapes. This guards against
regressions where the detached future loses its exception
handler.

Remove the test_skipped_no_error_injection placeholder since
the new test runs unconditionally keeping the suite non-empty
in all build modes.
2026-05-26 13:32:56 +02:00
Avi Kivity
f165b396fd schema_builder: make shard_count an explicit constructor parameter
A recent Seastar update deprecated smp::count and introduced
this_smp_shard_count() as a replacement. One difference is that
this_smp_shard_count() wants to run on a reactor thread.

This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE)
that nevertheless use a schema, as the schema_builder constructor
references smp::count. If we replace it with this_smp_shard_count()
then it will crash when running without a reactor.

To fix, remove the implicit this_smp_shard_count() call from raw_schema's
constructor and require callers to pass shard_count explicitly to
schema_builder. This allows tests that don't run on a reactor thread
to construct schemas without crashing.

Production code and reactor-based tests pass this_smp_shard_count().
Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test,
wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test)
pass a fixed shard count of 1.

Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE)
but also contains one plain BOOST_AUTO_TEST_CASE
(test_empty_key_view_comparison) that constructs a schema_builder without
a reactor context. This test also receives a fixed shard count of 1.
2026-05-26 11:55:56 +03:00
Nikos Dragazis
54cb6d4608 test: Order task-wait before finalization in test_migration_wait_task
The purpose of this test is to verify that the task manager's "wait" API
works correctly for vnodes-to-tablets migration virtual tasks. It starts
a `wait_task` HTTP request concurrently with a finalize (or rollback)
operation, and asserts that the wait returns the correct final state
("done" or "suspended").

The test `uses asyncio.create_task()` to wrap the wait request into a
task, and then immediately calls finalize. With asyncio's lazy task
scheduling, the wait coroutine does not start until the event loop
yields, so the finalization request reaches the server before wait, and
therefore may also complete before it. Once finalization completes, the
virtual migration task is no longer discoverable, causing a
"task not found" error.

Add a log message in Scylla's wait handler and a synchronization point
in the test to ensure that the wait request lands the server before
finalization. This follows the same pattern used in
`test_tablet_tasks.py::check_and_abort_repair_task`.

Fixes SCYLLADB-2077

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#29973
2026-05-26 10:43:22 +03:00
Botond Dénes
0fd25dc47c Merge 'Replace get_injection_parameters() with inject_parameter() where appropriate' from Pavel Emelyanov
Several error injection sites use the low-level get_injection_parameters() API to fetch the entire parameters map and then manually look up a single key. The inject_parameter() API is better suited for these cases — it combines the enabled check and typed single-parameter extraction in one call, returning std::optional.

Cleaning error injection usage, not backporting

Closes scylladb/scylladb#29970

* github.com:scylladb/scylladb:
  test: Use inject_parameter() in row_cache_test
  sstables: Use inject_parameter() for mx reader fill buffer timeout
  streaming: Use inject_parameter() for order_sstables_for_streaming
2026-05-26 10:32:44 +03:00
Nadav Har'El
f65a52f3ec Merge 'vector_search: test: migrate rescoring tests from C++/Boost to pytest' from Szymon Malewski
Migrate mock-based rescoring and oversampling tests from
test/vector_search/rescoring_test.cc to pytest and delete the C++ file.
Index option validation tests go to test_vector_index.py; rescoring tests
go to a new test_vector_search_rescoring.py which introduces shared
infrastructure (EmbeddingRow dataclass, TEST_DATA dict,
reversed_ann_response() helper, rescoring_test_table() context manager).

Two tests have updated assertions (semantic change):
filters_invalid_similarity_scores now uses per-function expected result
sets including a zero-vector row, and rescoring_with_zerovector_query
asserts empty results after NaN filtering (cosine only). Both are marked
xfail pending SCYLLADB-924.

Follow-up to #29593.

Does not require backport - simple refactoring of tests

Closes scylladb/scylladb#29906

* github.com:scylladb/scylladb:
  test/vector_search: migrate zero-vector query rescoring test to pytest; delete rescoring_test.cc
  test/vector_search: migrate invalid similarity score filtering test to pytest
  test/vector_search: migrate non-ANN similarity argument rescoring test to pytest
  test/vector_search: migrate wildcard select rescoring test to pytest
  test/vector_search: migrate similarity_function rescoring test to pytest
  test/vector_search: migrate rescoring and f32 quantization tests to pytest
  test/vector_search: migrate oversampling tests to pytest
  test/vector_search: migrate vector_index option validation tests to pytest
2026-05-26 09:45:40 +03:00
Botond Dénes
2c9a5f9634 types: abstract_type::from_string() switch to fragmented buffers (implementation)
The previous patch changed the interface and callers, this one updates
the implementation to actually work with fragmented buffers. Most types
just use with_linearized() to linearize the fragmented input buffer for
parsing. This is fine, as most types have a fixed or bounded-size string
representation that is small.
Importantly, the input is not linearized for the 3 types which have
unbounded values: ascii, bytes and text. The tuple type can contain any
of these types itself, so it is also converted to avoid linearization.
2026-05-26 09:08:06 +03:00