Split start_server_for_group0 so it only starts the raft server and
replays the log (applying mutations to system tables), without loading
the state into memory. A new enable_group0_state_machine() method is
added which callers must invoke explicitly after all dependencies
(CDC generation service, non-system schemas, etc.) are available.
This prepares for moving setup_group0_if_exist earlier in the startup
sequence so the raft log can be replayed before non-system keyspaces
are loaded, while deferring the in-memory state loading until after
all dependencies are initialized.
compact_keyspace() operates on a whole keyspace and has no 'cf' variable
in scope, but the nodetool fallback branch mistakenly passed it to
args.extend([ks, cf]), which would raise NameError whenever that path
was taken. Fix by passing only the keyspace.
Closesscylladb/scylladb#30097
The test was racing move_tablet against restore_tablets without
ensuring that move_tablet had actually reached the streaming phase
before restore began. This caused restore to win the group0 race,
putting the tablet into transition first, which made move_tablet
fail with "Tablet is in transition".
Fix by adding a log message to the block_tablet_streaming error
injection and waiting for it in the test, ensuring the move has
entered the streaming phase (and is blocked) before restore starts.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2147Closesscylladb/scylladb#30173
This series fixes two vulnerabilities:
unbounded recursion during expression evaluation with deeply nested expressions
quadratic computation with large WHERE clauses
The fixes simply bound the depth of recursion and the length of the WHERE clause.
The WHERE clause limits are configurable. Nesting is less likely to be exceeded, so not configurable.
Limits inspired by Common Expression Language:
https://github.com/google/cel-spec/blob/master/doc/langdef.md#syntax
Implementations are required to support at least:
24-32 repetitions of repeating rules
12 repetitions of recursive rules
CVE-2026-31948
CVE-2026-31947
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002
Fixes https://github.com/scylladb/scylladb/issues/14472Closesscylladb/scylladb-ghsa-m4h7-g37h-mgxf#3
* github.com:scylladb/scylladb-ghsa-m4h7-g37h-mgxf:
cql3: limit number of relations in WHERE clause
cql3: add max_relations_in_where_clause to dialect
test/cqlpy: add tests for WHERE clause relation count limit
cql3: limit nesting depth of function calls and CASTs in CQL parser
test/cqlpy: add tests for deeply nested function calls and CASTs
During shutdown, group0 may be torn down while
cache_table_info() has a detached setup_table() future
in flight. This causes raft_group_not_found to propagate
as an abandoned failed future.
Add .handle_exception() to log the failure at debug level
instead of leaving the future unobserved.
Fixes: SCYLLADB-2224
Backport to 2026.2 and 2026.1, because the test failed on 2026.1
Closesscylladb/scylladb#30093
* github.com:scylladb/scylladb:
test: table_helper: verify detached setup failure is consumed
table_helper: observe detached setup_table() future
Multi-RF change handles multiple keyspaces concurrently, but tablet
rebuilds are not all started at once — the load balancer considers
machine load when scheduling them. With 3 keyspaces each having a base
table and materialized view, the total operation time approaches the
default 200s CQL timeout on slow/busy CI machines (observed at ~191s).
Double the timeout to 400s to provide sufficient margin.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2042.
Closesscylladb/scylladb#30018
Add group0 read barrier in test_cdc_with_tablets whenever we observed a
condition such as tablet count change or cdc stream change, and we want
to proceed to check that cdc tables are consistent with the change. For
example, when we wait for tablet count change and then check the cdc
streams changed as well.
The problem is that when we observe the tablet count change, for
example, even though the cdc streams are changed in the same group0
operation, we may observe it during the group0 apply, when the operation
is only partially applied. The read barrier ensures that the change we
observed is fully applied.
Fixes SCYLLADB-2352
Closesscylladb/scylladb#30177
A WHERE clause with many relations (e.g. hundreds of AND-ed conditions)
can cause quadratic complexity. Check the relation count during parsing
and reject queries exceeding the configurable max_relations_in_where_clause
limit (default 100) with a SyntaxException.
The changes to IDL don't cause problems during upgrade, because
CQL forwarding is not in any released version, and because
it is part of an experimental feature.
CVE-2026-31947
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002
Add a configurable max_relations_in_where_clause parameter (default 100)
to the CQL dialect, plumbed through db::config, transport server, and
test environment. This will be used by the CQL parser to reject WHERE
clauses with too many relations that cause quadratic complexity.
Add tests that verify the CQL parser rejects WHERE clauses with too
many relations (e.g. WHERE a=1 AND b=1 AND ... repeated 200 times),
and that a reasonable number of relations (50) is still accepted.
Adds write-path guardrails that reject or warn on mutations targeting partitions, rows, or collections that already exceed configured size thresholds, based on SSTable `large_data_record` metadata.
ScyllaDB already detects and records large partitions/rows/cells in `system.large_data_records` after compaction, but takes no preventive action on the write path. Once a partition grows past operational limits it causes latency spikes, OOM, and repair failures. These guardrails let operators set hard and soft thresholds so that writes to already-oversized data are rejected (hard) or logged as warnings (soft) before they make the problem worse.
- **Intrusive index over SSTable metadata**: A per-table `large_data_record_index` maintains three `boost::intrusive::multiset`s (partitions, rows, cells) using `auto_unlink` hooks directly on `large_data_record`. SSTable destruction automatically removes records from the index — no explicit deregistration needed.
- **Virtual dispatch for zero-cost disabled path**: `large_data_guardrail_base` → `noop_large_data_guardrail` / `large_data_guardrail`. Tables without guardrails enabled pay only a virtual call to a no-op. No index is built or maintained for disabled tables.
- **Schema storage**: The per-table flag is stored as a scylla_tables column, following the tablets pattern: only write a live cell when enabled, omit entirely when disabled. The CQL feature gate prevents enabling until all nodes are upgraded.
- **Write-path integration**: The guardrail check runs in `do_apply` after the frozen mutation is deserialized but before it is applied to the memtable. Hint replay and Paxos learn skip the check via `skip_large_data_guardrails`.
Uses existing `large_*_warn_threshold` config options as soft limits and new `large_*_fail_threshold` options as hard limits. Checked dimensions:
- Partition size (bytes)
- Partition row count
- Row size (bytes)
- Collection element count
Backport is not required
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-180Closesscylladb/scylladb#29733
* github.com:scylladb/scylladb:
test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests
test/cqlpy: add large collection guardrail tests
test/cqlpy: add large row guardrail tests
test/cqlpy: add large partition guardrail tests
test/boost: add large_data_guardrail unit tests
test/cluster: add large data guardrails rolling upgrade test
replica: wire large_data_guardrail into the write path
schema: add per-table large_data_guardrails_enabled flag
db: implement large_data_guardrail
db: implement large_data_record_index
sstables: add intrusive index hook to large_data_record
db: add large_collection_elements_fail_threshold config option
db: add large_row_fail_threshold_mb config option
db: add rows_count_fail_threshold config option
db: add large_partition_fail_threshold_mb config option
replica: introduce large_data_exception
Tests in test/cqlpy use a tiny nodetool-like library, where calls to
nodetool.flush() are translated to the parallel REST API request on
Scylla - but use an external "nodetool" command when running the test
against Cassandra.
Some tests/cluster also began using test/cqlpy/nodetool.py, but it is
NOT a good fit for test/cluster tests, because:
1. It falls back to using the external "nodetool" when it thinks the
REST API is not available. In cluster tests, no such fallback is
needed (these tests can't be run on Cassandra). If the REST API is
down, the test should fail - not fall back to an irrelevant method.
2. The nodetool.flush() et al. functions are not async, and cluster
tests are supposed (by design...) to only use async APIs.
3. test/cqlpy/nodetool.py was not written in the "style" defined for
the test/cluster codebase - specifically they don't have docstrings
or strong typing.
This patch introduces test/pylib/nodetool.py, based on
test/cqlpy/nodetool.py but fixing all the above problems - there are
no Cassandra fallbacks, there are docstrings and type hints, and
all the functions are async.
We also fix the test/cluster tests that used test/cqlpy/nodetool.py to
switch to test/pylib/nodetool.py. Of course it means the newly async
functions need to be "await"ed, not just called, so this patch changes
that too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#30129
test_localnodes_joining_nodes stops a server while manager.server_add() is still waiting for that server to finish startup. Stopping the process can make the background add_server() fail and run its cleanup path first, removing the server from ScyllaCluster.starting. When the stop request later resumes, its own self.starting.pop(server_id) raises KeyError, which the manager returns as HTTP 500.
The opposite ordering is possible as well: server_stop() can remove the entry before add_server() reaches its finally block.
Make cleanup of ScyllaCluster.starting idempotent in both paths. add_server() remains the normal cleanup path, while server_stop() provides fallback cleanup when it wins the race.
Fixes SCYLLADB-2314
Closesscylladb/scylladb#30128
In 6165124fcc, we changed analysis of expressions in the WHERE clause
to use predicates, an annotated form of an expression that constrains a column
when the expression is set to true.
Here, we exploit this work to simplify the analysis further, reusing already computed
attributes rather than re-analyzing the expression.
Not backporting, this is a refactor with no functional change and no bugs fixed.
Closesscylladb/scylladb#30049
* github.com:scylladb/scylladb:
cql3: statement_restrictions: simplify find_idx to return only the index
cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans
cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key
cql3: statement_restrictions: replace find_clustering_order with predicate order field
cql3: statement_restrictions: replace has_partition_token with variant check
cql3: statement_restrictions: replace has_slice with predicate is_slice check
cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column
cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering
cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool
cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column
cql3: statement_restrictions: replace find_needs_filtering with predicate op check
cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool
cql3: statement_restrictions: replace find_binop column extraction with predicate on field
cql3: statement_restrictions: set op on all binary-operator-derived predicates
The expression returned as the second element of find_idx()'s pair was
stored in view_indexed_table_select_statement::_used_index_restrictions
but never read — dead code. Simplify find_idx() to return just the
optional<index>, and remove the dead member and constructor parameter
from view_indexed_table_select_statement.
The now unused _idx_restrictions is also removed.
Per-table toggle: disabled-at-create, alter-disable, alter-reenable.
LWT exemption: Paxos learn must bypass the guardrail.
Multi-category independence: all three guardrails warn/reject
independently when SSTable records span partition, row, and collection
categories.
The test_internode_compression_between_datacenters test was flaky due to
proxy servers and leased host IPs not being cleaned up on failure paths.
If any exception occurred after proxies were started (e.g. during
server_start or driver_connect), the asyncio.Server listeners remained
bound and leased hosts were never released back to HostRegistry. On
subsequent test runs, this caused EADDRINUSE (errno 98) when trying to
bind the same address:port.
Wrap the proxy/server lifecycle in try/finally to ensure proxies are
always stopped and hosts are always released, regardless of whether
the test succeeds or fails.
Fixes: SCYLLADB-2183
Closesscylladb/scylladb#30127
Tests for partition size and row-count guardrails: hard-limit rejection,
disabled-when-zero, soft-limit log warnings, and no-warning below
threshold. Includes shared helpers and log assertion utilities used by
subsequent commits.
8 tests covering the record_compare template comparator,
intrusive multiset equal_range grouping with heterogeneous
lookup_key, and auto_unlink on record destruction.
Simulated rolling upgrade: start a 2-node cluster where one node
suppresses the LARGE_DATA_GUARDRAILS feature, verify that enabling
guardrails is rejected, then upgrade the old node and verify that
enabling guardrails succeeds.
Change s3 log level from TRACE to DEBUG in backup tests.
TRACE level generates excessive log volume with too much low-level
detail about S3 operations. While it was usefult in the early days
of S3 client, nowadays DEBUG level likely provides sufficient
diagnostic information for backup test troubleshooting.
The reduced log volume significantly improves test performance, which
is the main outcome of this change:
- Less I/O time writing logs during test execution
- Faster teardown: each test scans all server logs for errors, and
smaller logs mean faster grep operations (23.3s → 9.97s for 8-node
cluster teardown)
Impact on test_restore_with_streaming_scopes[topology4] (8 nodes):
- Log volume: 49 MB → 23 MB (reduced by half)
- Test runtime: 82.55s → 57.53s (30% faster)
- Teardown time: 23.3s → 9.97s (57% faster)
Tests that start smaller clusters also have notable timing improvements
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#30109
Similar to e5e6608f20 ("sstables_loader: prevent use-after-free on
table drop during streaming") which fixed the same class of race for
load_and_stream, the tablet restore path also holds a replica::table&
reference across the download_sstable() coroutine without preventing
concurrent table destruction.
If DROP KEYSPACE is applied while download_sstable() is writing SSTable
components to the table's data directory, the directory is removed
mid-write causing ENOENT → abort (with --abort-on-internal-error).
Fix by acquiring a stream_in_progress() phaser guard after
find_column_family() and before download_sstable(). table::stop()
calls _pending_streams_phaser.close() which blocks until all
outstanding guards are released, keeping the table alive for the
duration of the download.
Fixes: SCYLLADB-2187
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#30094
The schema_registry_grace_period field on schema_ctxt was only used by
schema_registry itself for eviction timing. Move it to be a direct member
of schema_registry, passed at init() time. This removes one db::config
dependency from schema_ctxt.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#30038
Several error injection call sites use the verbose handler-lambda API when simpler alternatives already exist in the framework. This series converts them to use the appropriate overloads, reducing boilerplate and making the injection intent immediately obvious from the call site.
Cleaning up in-code debugging facilities, no need to backport
Closesscylladb/scylladb#29962
* github.com:scylladb/scylladb:
error_injection: Convert handler-style breakpoints to wait_for_message sugar
error_injection: Convert no-op handler injections to enter()/is_enabled()
error_injection: Convert handler-throw injections to lambda-throw style
utils: Add share_messages parameter to breakpoint injection API
Replace random port selection in MinIO and JMX test helpers with fixed
ports on unique per-test loopback IPs, eliminating TOCTOU races.
Commits:
- kmip_wrapper: default hostname to 127.0.0.1
- nodetool: bind JMX to the per-module loopback IP with fixed port 7199
- minio: use fixed service and console ports on a unique HostRegistry IP
instead of probing the ephemeral range; raise on start failure
Fixes: SCYLLADB-1817
Minor improvement, no need to backport.
Closesscylladb/scylladb#29741
* github.com:scylladb/scylladb:
test/pylib: use fixed MinIO ports on unique loopback IPs
test/nodetool: bind JMX to per-module loopback IP
test/pylib: default KMIP wrapper to loopback
Add a per-table large_data_guardrails_enabled flag controlled via the CQL
table property WITH large_data_guardrails_enabled = true|false.
Store the flag as a boolean column in system_schema_ext.scylla_tables.
Only write a live cell when enabled; when disabled (the default), omit
the cell entirely so that old nodes that don't know this column can
still read the SSTable during rolling upgrade or rollback. When the
property transitions from true to false via ALTER TABLE, a tombstone is
written in make_update_table_mutations to override the previous live
cell — this is safe because the CQL feature gate ensures all nodes are
upgraded before the property can be set to true.
Gate the CQL property behind the LARGE_DATA_GUARDRAILS cluster feature:
attempting to set large_data_guardrails_enabled = true before all nodes
advertise the feature raises a ConfigurationException.
When two partition keys share the same token, their relative order is
determined by their raw serialized bytes (legacy_tri_compare), which
matches the physical on-disk order in SSTables. The validator was
using partition_key::tri_compare instead — a type-aware comparator
that can disagree with byte order for types like timeuuid.
The result was a false-positive "out-of-order partition key" error
for any two same-token partitions whose timeuuid (or other type-aware)
order is the reverse of their byte order. In scrub mode this caused
the second partition to be silently dropped.
Fixes: SCYLLADB-2304
Closesscylladb/scylladb#30120
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone
and/or static row while subsequent mutations contain only clustered
rows. The hint logic would correctly clear tokens (marking a full
partition read) upon seeing the tombstone in the first mutation, but
then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.
Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens. Additionally, fall back to a full partition read
when the tablet map is missing locally, which happens when the joining
node receives tablet metadata for a table it has never seen before.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2303.
Needs backports to 2026.1+. 2026.1 introduces the regression with b17a36c071Closesscylladb/scylladb#30115
* github.com:scylladb/scylladb:
tablets: fall back to full partition read when tablet map is missing
tablets: fix hint re-adding tokens after full partition read decision
Only enable the memory controller in cgroup subtree_control instead of all available controllers. cpu.stat is available in cgroup v2 without enabling the cpu controller (base accounting), and enabling io/pids/cpu controllers adds unnecessary per-operation kernel overhead to Scylla processes - particularly the memory controller's per-page-cache-operation accounting combined with io controller overhead during heavy I/O.
Additionally, restrict SystemResourceMonitor to the master process only. System-wide metrics (CPU%, memory) are identical from any process, so running a monitoring thread in each xdist worker was redundant and added unnecessary SQLite write contention and thread scheduling noise.
Replace cpu_percent(interval=0.1) with a non-blocking cpu_percent()
that returns CPU% since the previous call. Use stop_event.wait(timeout=2.0) as the
loop control to both space out iterations and allow immediate shutdown responsiveness.
Fixes SCYLLADB-2141
Closesscylladb/scylladb#29987
* github.com:scylladb/scylladb:
test: use non-blocking cpu_percent in SystemResourceMonitor
test.py: reduce cgroup overhead in resource metrics gathering
Extract vector_indexed_table_select_statement and its filter logic out of
the monolithic select_statement.cc and vector_search/ module into a
dedicated directory cql3/statements/index_search/.
This improves modularity and eliminates a circular dependency between cql3
and vector_search: the filter code depends heavily on cql3 types
(expressions, query_options, statement_restrictions) and belongs in the cql3
layer. Follow-up to VECTOR-250 which originally addressed the same
dependency but has since regressed.
This is also a preparatory refactoring for full-text search select statements,
which can share some implementation with the vector search.
Pure refactoring, no semantic changes - no need for backporting.
Closesscylladb/scylladb#30100
* github.com:scylladb/scylladb:
vector_index: move filter into cql3/statements/external_search
cql3: extract vector_indexed_table_select_statement into own compilation unit
vector_index: split query_base_table to return raw coordinator_result
- Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum
- Handle unhandled timeout exception in the wait-for-leader loop during group startup
When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion.
Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state.
SCYLLADB-2080 fix:
- Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released).
- Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling.
- Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between).
Timeout handling fix:
- Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled.
Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that:
1. Pauses a write on the leader before add_entry
2. Drops the table (follower destroys its group immediately)
3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds
backport: no need, strong consistency is not released yet
Fixes: SCYLLADB-2080
Closesscylladb/scylladb#30105
* github.com:scylladb/scylladb:
strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
strong consistency: abort raft server before gate close when dropping a table
test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
Move prepared_filter, prepared_restriction, prepared_rhs types and
prepare_filter() from vector_search/filter.{hh,cc} into new files
cql3/statements/external_search/filter.{hh,cc} under namespace
cql3::statements::external_search.
This eliminates a circular dependency between the cql3 and vector_search
modules: the filter code depends heavily on cql3 types (expressions,
query_options, statement_restrictions) and belongs in the cql3 layer.
This is a follow-up to VECTOR-250 which originally addressed the same
circular dependency but has since regressed.
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone and/or
static row while subsequent mutations contain only clustered rows.
The tablet metadata change hint logic would correctly clear tokens (marking
a full partition read) upon seeing the tombstone in the first mutation,
but then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.
Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens.
After schema reload, `target_parser::is_local()` did not recognize the
vector-index local target format `{"pk": [...], "tc": "..."}`, causing
local vector indexes to be treated as global. This broke duplicate
detection when both a global and a local vector index existed on the same
column. Fix by introducing `vector_index::is_local()` and dispatching
to it from `create_index_from_index_row()` based on the index class.
Also adds tests for local/global vector index coexistence.
Fixes: SCYLLADB-987
backport reasoning: we added local vector index support in 2026.1
Closesscylladb/scylladb#29492
* github.com:scylladb/scylladb:
test/cqlpy: add tests for global and local vector index coexistence
index: fix local vector index locality detection after schema reload
When a strongly consistent table is dropped, schedule_raft_group_deletion()
used to call g->close() first, which waits for all in-flight operations to
release their gate holders. But other nodes may have already destroyed their
raft servers for this group, so an in-flight write on the leader cannot
reach quorum and hangs until the client timeout expires, unnecessarily
delaying group deletion.
Fix: initiate gate close (prevents new operations from entering), then
abort the raft server (causes in-flight add_entry/read_barrier to throw
raft::stopped_error, releasing their gate holders), then await the gate
future (resolves immediately since holders are now released).
Handle raft::stopped_error in the coordinator's top-level catch blocks
(both write and read paths): if the table no longer exists, return
no_such_column_family (which the CQL layer converts to InvalidRequest
'unconfigured table'). Otherwise fall through to the default timeout
handling.
Also replace gate->hold() with try_hold() + on_internal_error in
acquire_server, and handle the timeout exception in the wait-for-leader
loop in update() gracefully (log + break instead of propagating).
Fixes: SCYLLADB-2080
Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes
the quorum-loss scenario: when a table is dropped, the follower destroys
its raft group immediately while the leader's in-flight operations are
still holding the gate.
The test pauses both a read and a write on the leader, drops the table,
then resumes them. Both are expected to fail with 'no such column family'
since the raft server is aborted as part of group deletion. A 15-second
timeout guard detects the old buggy behavior (write stuck forever).
Marked xfail until the fix is applied in the next commit.
Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched).
Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads.
Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads.
- schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases.
Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway)
Closesscylladb/scylladb#29990
* github.com:scylladb/scylladb:
treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
scylla-gdb: read shard count from smp::_this_smp instead of smp::count
schema_builder: make shard_count an explicit constructor parameter
Replace all uses of the deprecated seastar::smp::count with
this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards()
across the ScyllaDB codebase (seastar submodule untouched).
Both replacement functions require a reactor thread context. All call
sites were verified to run on reactor threads.
Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default
parameter value. This is safe since all callers are on reactor threads,
but the expression is now evaluated at each call site rather than being
a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh,
ent/encryption/encryption.cc: used in default member initializers and
constructor member-init-lists. Objects are always constructed on reactor
threads.
Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
When a mutation generates more view updates than max_rows_for_view_updates
(100), view_update_builder::build_some() splits the work into multiple
batches. There was a bug in how fragments were read between batches:
When should_stop_updates() returned true, the old code called stop()
which returned stop_iteration::yes without reading the next fragments.
On the next build_some() call, read_both_next_fragments() was called
at the start, which advanced BOTH readers - skipping any fragment that
was already read but not yet consumed. A row could be not consumed if
either:
- the 100th (last in the batch) update was a row insertion and we still
had insertions/updates remaining
- the 100th (last in the batch) update was a row deletion and we still
had deletions/updates remaining
For the most common case where work is split in batches, i.e. range
deletions, we couldn't hit this because range delete generates only
view row deletions.
On tables with a single materialized view, we also couldn't get this
for any batches with less than 50 statements (unless the batch also
contained range deletions), because one non-range-delete update can
generate up to 2 view updates.
Howeveer, for a range of scenarios outside these 2, we could lose
view updates, resulting in persistent inconsistencies.
The fix:
- read_*_next_fragment() now accept a stop_iteration parameter, so the
next fragments are always read after consuming (even when stopping),
but stop_iteration::yes is correctly propagated to break the loop.
- build_some() no longer re-reads fragments at the start. Instead, an
initialize() method performs the initial read once at construction.
- because now we only advance readers after consuming, we won't advance
readers after end_of_partition, so we extend the break condition to
accept either readers evaluating to `false` or them being at the
end_of_partition. We also handle the optimization with
_skip_row_updates
Fixes: scylladb/scylladb#29155Closesscylladb/scylladb#29498
Replace verbose handler lambdas that only log and call
wait_for_message() with the equivalent one-liner breakpoint sugar.
The behavior is identical -- the sugar produces the same log messages
in the format "{name}: waiting for message" / "{name}: message received".
Update Python tests that waited for the old ad-hoc log messages to
match the new standardized format.
Converted injections:
- topology_state_load_before_update_cdc (storage_service.cc)
- migration_streaming_wait x2 (storage_service.cc)
- pause_after_streaming_tablet (storage_service.cc)
- cdc_generation_publisher_fiber (topology_coordinator.cc)
- wait_after_tablet_cleanup (topology_coordinator.cc)
- fast_orphan_removal_fiber (topology_coordinator.cc)
- split_storage_groups_wait (table.cc)
- wait_before_stop_compaction_groups (table.cc)
- tasks_vt_get_children (task_manager.cc)
- truncate_compaction_disabled_wait (database.cc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Queries are stored and passed around as sstring/std::string_view. While normally they are small enough to not cause problems, as the `test_cdc_large_values.TestLargeColumnsWithCDC.test_single_column_blob_max_size_with_cdc_preimage_full_postimage[unprepared_statements]` demonstrates, queries can be arbitrarily large, putting heavy strain on Scylla internals via large allocations, in the extreme case causing denial of service.
This PR attempts to alleviate this by using fragmented storage for queries: read query as fragmented string from the input stream in `transport/server.cc`, propagate it as such to `query_processor::prepare()` and also store it as such in `cql3::cql_statement::raw_cql_statement`. Also avoid linearizing raw values during in the CQL expression tree: switch `cql3::expr::untyped_constant::raw_text` to fragmented storage.
For this to be possible, some infrastructure code had to be made fragmented storage friendly: ascii/utf8 validation, hashers, from_hex and importantly: `abstract_type::from_string()`.
Unfortunately, the query still has to be linearized for parsing itself, as ANTLR -- although allows for custom InputStream implementation -- plays pointer arithmetics games with the pointers obtained from them, so fragmented input cannot be used.
Still, this PR limits the places where the query is linearized to the
following:
* Parsing
* Audit
* Logs and error messages
So the normal query paths for queries that actually can get arbitrarily large (UPDATE and INSERT) should only linearize the query temporarily for parsing.
Fixes#10779
Improvement, no backport
Closesscylladb/scylladb#28619
* github.com:scylladb/scylladb:
tracing: add_query(): change query param to utils::chunked_string
cql3: store raw query string in utils::chunked_string
serializer: add serializer<utils::chunked_string>
utils/reusable_buffer: add get_linearized_view(managed_bytes_view)
cql3/expr: use utils::chunked_string for untyped_constant::raw_text
types: abstract_type::from_string() switch to fragmented buffers (implementation)
types: abstract_type::from_string() switch to fragmented buffers (interface)
types: use write_fragmented from utils/fragment_range.hh
types: timestamp_from_string(): don't assume std::string_view is null-terminated
types/duration: don't assume std::string_view is null-terminated
utils/hashers: add calculate(managed_bytes_view) overload
utils/ascii: add validate(managed_bytes_view) overload
utils: add managed_bytes_fwd.hh
utils: add chunked_string
utils: add managed_bytes_basic_view::byte_iterator
Add test_best_effort_setup_table_failure_is_consumed which
triggers a setup_table() failure via a missing keyspace and
asserts no abandoned future escapes. This guards against
regressions where the detached future loses its exception
handler.
Remove the test_skipped_no_error_injection placeholder since
the new test runs unconditionally keeping the suite non-empty
in all build modes.
A recent Seastar update deprecated smp::count and introduced
this_smp_shard_count() as a replacement. One difference is that
this_smp_shard_count() wants to run on a reactor thread.
This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE)
that nevertheless use a schema, as the schema_builder constructor
references smp::count. If we replace it with this_smp_shard_count()
then it will crash when running without a reactor.
To fix, remove the implicit this_smp_shard_count() call from raw_schema's
constructor and require callers to pass shard_count explicitly to
schema_builder. This allows tests that don't run on a reactor thread
to construct schemas without crashing.
Production code and reactor-based tests pass this_smp_shard_count().
Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test,
wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test)
pass a fixed shard count of 1.
Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE)
but also contains one plain BOOST_AUTO_TEST_CASE
(test_empty_key_view_comparison) that constructs a schema_builder without
a reactor context. This test also receives a fixed shard count of 1.
The purpose of this test is to verify that the task manager's "wait" API
works correctly for vnodes-to-tablets migration virtual tasks. It starts
a `wait_task` HTTP request concurrently with a finalize (or rollback)
operation, and asserts that the wait returns the correct final state
("done" or "suspended").
The test `uses asyncio.create_task()` to wrap the wait request into a
task, and then immediately calls finalize. With asyncio's lazy task
scheduling, the wait coroutine does not start until the event loop
yields, so the finalization request reaches the server before wait, and
therefore may also complete before it. Once finalization completes, the
virtual migration task is no longer discoverable, causing a
"task not found" error.
Add a log message in Scylla's wait handler and a synchronization point
in the test to ensure that the wait request lands the server before
finalization. This follows the same pattern used in
`test_tablet_tasks.py::check_and_abort_repair_task`.
Fixes SCYLLADB-2077
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29973
Several error injection sites use the low-level get_injection_parameters() API to fetch the entire parameters map and then manually look up a single key. The inject_parameter() API is better suited for these cases — it combines the enabled check and typed single-parameter extraction in one call, returning std::optional.
Cleaning error injection usage, not backporting
Closesscylladb/scylladb#29970
* github.com:scylladb/scylladb:
test: Use inject_parameter() in row_cache_test
sstables: Use inject_parameter() for mx reader fill buffer timeout
streaming: Use inject_parameter() for order_sstables_for_streaming
Migrate mock-based rescoring and oversampling tests from
test/vector_search/rescoring_test.cc to pytest and delete the C++ file.
Index option validation tests go to test_vector_index.py; rescoring tests
go to a new test_vector_search_rescoring.py which introduces shared
infrastructure (EmbeddingRow dataclass, TEST_DATA dict,
reversed_ann_response() helper, rescoring_test_table() context manager).
Two tests have updated assertions (semantic change):
filters_invalid_similarity_scores now uses per-function expected result
sets including a zero-vector row, and rescoring_with_zerovector_query
asserts empty results after NaN filtering (cosine only). Both are marked
xfail pending SCYLLADB-924.
Follow-up to #29593.
Does not require backport - simple refactoring of tests
Closesscylladb/scylladb#29906
* github.com:scylladb/scylladb:
test/vector_search: migrate zero-vector query rescoring test to pytest; delete rescoring_test.cc
test/vector_search: migrate invalid similarity score filtering test to pytest
test/vector_search: migrate non-ANN similarity argument rescoring test to pytest
test/vector_search: migrate wildcard select rescoring test to pytest
test/vector_search: migrate similarity_function rescoring test to pytest
test/vector_search: migrate rescoring and f32 quantization tests to pytest
test/vector_search: migrate oversampling tests to pytest
test/vector_search: migrate vector_index option validation tests to pytest
The previous patch changed the interface and callers, this one updates
the implementation to actually work with fragmented buffers. Most types
just use with_linearized() to linearize the fragmented input buffer for
parsing. This is fine, as most types have a fixed or bounded-size string
representation that is small.
Importantly, the input is not linearized for the 3 types which have
unbounded values: ascii, bytes and text. The tuple type can contain any
of these types itself, so it is also converted to avoid linearization.