scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 06:23:03 +00:00

Author	SHA1	Message	Date
Gleb Natapov	492a75ffbb	raft: separate group0 server start from in-memory state machine enablement Split start_server_for_group0 so it only starts the raft server and replays the log (applying mutations to system tables), without loading the state into memory. A new enable_group0_state_machine() method is added which callers must invoke explicitly after all dependencies (CDC generation service, non-system schemas, etc.) are available. This prepares for moving setup_group0_if_exist earlier in the startup sequence so the raft log can be replayed before non-system keyspaces are loaded, while deferring the in-memory state loading until after all dependencies are initialized.	2026-06-02 12:42:49 +03:00
Botond Dénes	3613d9a07d	test/cqlpy/nodetool: fix NameError in compact_keyspace nodetool path compact_keyspace() operates on a whole keyspace and has no 'cf' variable in scope, but the nodetool fallback branch mistakenly passed it to args.extend([ks, cf]), which would raise NameError whenever that path was taken. Fix by passing only the keyspace. Closes scylladb/scylladb#30097	2026-06-02 08:44:06 +03:00
Łukasz Paszkowski	a84d9cb8c4	test_backup.py: fix race in test_restore_tablets_vs_migration The test was racing move_tablet against restore_tablets without ensuring that move_tablet had actually reached the streaming phase before restore began. This caused restore to win the group0 race, putting the tablet into transition first, which made move_tablet fail with "Tablet is in transition". Fix by adding a log message to the block_tablet_streaming error injection and waiting for it in the test, ensuring the move has entered the streaming phase (and is blocked) before restore starts. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2147 Closes scylladb/scylladb#30173	2026-06-02 08:17:54 +03:00
Nadav Har'El	75a05fc2b3	Merge 'cql3: fix stack overflow and quadratic behavior' from Avi Kivity This series fixes two vulnerabilities: unbounded recursion during expression evaluation with deeply nested expressions quadratic computation with large WHERE clauses The fixes simply bound the depth of recursion and the length of the WHERE clause. The WHERE clause limits are configurable. Nesting is less likely to be exceeded, so not configurable. Limits inspired by Common Expression Language: https://github.com/google/cel-spec/blob/master/doc/langdef.md#syntax Implementations are required to support at least: 24-32 repetitions of repeating rules 12 repetitions of recursive rules CVE-2026-31948 CVE-2026-31947 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002 Fixes https://github.com/scylladb/scylladb/issues/14472 Closes scylladb/scylladb-ghsa-m4h7-g37h-mgxf#3 * github.com:scylladb/scylladb-ghsa-m4h7-g37h-mgxf: cql3: limit number of relations in WHERE clause cql3: add max_relations_in_where_clause to dialect test/cqlpy: add tests for WHERE clause relation count limit cql3: limit nesting depth of function calls and CASTs in CQL parser test/cqlpy: add tests for deeply nested function calls and CASTs	2026-06-01 22:31:56 +03:00
Marcin Maliszkiewicz	1dc975c491	Merge 'table_helper: observe detached setup_table() future' from Andrzej Jackowski During shutdown, group0 may be torn down while cache_table_info() has a detached setup_table() future in flight. This causes raft_group_not_found to propagate as an abandoned failed future. Add .handle_exception() to log the failure at debug level instead of leaving the future unobserved. Fixes: SCYLLADB-2224 Backport to 2026.2 and 2026.1, because the test failed on 2026.1 Closes scylladb/scylladb#30093 * github.com:scylladb/scylladb: test: table_helper: verify detached setup failure is consumed table_helper: observe detached setup_table() future	2026-06-01 19:32:34 +02:00
Aleksandra Martyniuk	33af16d808	test/cluster/test_tablets: increase timeout for test_multi_rf_of_many_keyspaces_0_N Multi-RF change handles multiple keyspaces concurrently, but tablet rebuilds are not all started at once — the load balancer considers machine load when scheduling them. With 3 keyspaces each having a base table and materialized view, the total operation time approaches the default 200s CQL timeout on slow/busy CI machines (observed at ~191s). Double the timeout to 400s to provide sufficient margin. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2042. Closes scylladb/scylladb#30018	2026-06-01 20:07:03 +03:00
Michael Litvak	a7a7f02392	test: test_cdc_with_tablets: add read barrier Add group0 read barrier in test_cdc_with_tablets whenever we observed a condition such as tablet count change or cdc stream change, and we want to proceed to check that cdc tables are consistent with the change. For example, when we wait for tablet count change and then check the cdc streams changed as well. The problem is that when we observe the tablet count change, for example, even though the cdc streams are changed in the same group0 operation, we may observe it during the group0 apply, when the operation is only partially applied. The read barrier ensures that the change we observed is fully applied. Fixes SCYLLADB-2352 Closes scylladb/scylladb#30177	2026-06-01 13:56:01 +02:00
Avi Kivity	520b130b97	cql3: limit number of relations in WHERE clause A WHERE clause with many relations (e.g. hundreds of AND-ed conditions) can cause quadratic complexity. Check the relation count during parsing and reject queries exceeding the configurable max_relations_in_where_clause limit (default 100) with a SyntaxException. The changes to IDL don't cause problems during upgrade, because CQL forwarding is not in any released version, and because it is part of an experimental feature. CVE-2026-31947 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002	2026-06-01 14:01:27 +03:00
Avi Kivity	fdcc44c425	cql3: add max_relations_in_where_clause to dialect Add a configurable max_relations_in_where_clause parameter (default 100) to the CQL dialect, plumbed through db::config, transport server, and test environment. This will be used by the CQL parser to reject WHERE clauses with too many relations that cause quadratic complexity.	2026-06-01 14:01:27 +03:00
Avi Kivity	1ad1c8ef7f	test/cqlpy: add tests for WHERE clause relation count limit Add tests that verify the CQL parser rejects WHERE clauses with too many relations (e.g. WHERE a=1 AND b=1 AND ... repeated 200 times), and that a reasonable number of relations (50) is still accepted.	2026-06-01 14:01:25 +03:00
Botond Dénes	bb81dbf65e	Merge 'guardrails: Add replica-side large data guardrails' from Taras Veretilnyk Adds write-path guardrails that reject or warn on mutations targeting partitions, rows, or collections that already exceed configured size thresholds, based on SSTable `large_data_record` metadata. ScyllaDB already detects and records large partitions/rows/cells in `system.large_data_records` after compaction, but takes no preventive action on the write path. Once a partition grows past operational limits it causes latency spikes, OOM, and repair failures. These guardrails let operators set hard and soft thresholds so that writes to already-oversized data are rejected (hard) or logged as warnings (soft) before they make the problem worse. - Intrusive index over SSTable metadata: A per-table `large_data_record_index` maintains three `boost::intrusive::multiset`s (partitions, rows, cells) using `auto_unlink` hooks directly on `large_data_record`. SSTable destruction automatically removes records from the index — no explicit deregistration needed. - Virtual dispatch for zero-cost disabled path: `large_data_guardrail_base` → `noop_large_data_guardrail` / `large_data_guardrail`. Tables without guardrails enabled pay only a virtual call to a no-op. No index is built or maintained for disabled tables. - Schema storage: The per-table flag is stored as a scylla_tables column, following the tablets pattern: only write a live cell when enabled, omit entirely when disabled. The CQL feature gate prevents enabling until all nodes are upgraded. - Write-path integration: The guardrail check runs in `do_apply` after the frozen mutation is deserialized but before it is applied to the memtable. Hint replay and Paxos learn skip the check via `skip_large_data_guardrails`. Uses existing `large__warn_threshold` config options as soft limits and new `large__fail_threshold` options as hard limits. Checked dimensions: - Partition size (bytes) - Partition row count - Row size (bytes) - Collection element count Backport is not required Fixes https://scylladb.atlassian.net/browse/SCYLLADB-180 Closes scylladb/scylladb#29733 * github.com:scylladb/scylladb: test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests test/cqlpy: add large collection guardrail tests test/cqlpy: add large row guardrail tests test/cqlpy: add large partition guardrail tests test/boost: add large_data_guardrail unit tests test/cluster: add large data guardrails rolling upgrade test replica: wire large_data_guardrail into the write path schema: add per-table large_data_guardrails_enabled flag db: implement large_data_guardrail db: implement large_data_record_index sstables: add intrusive index hook to large_data_record db: add large_collection_elements_fail_threshold config option db: add large_row_fail_threshold_mb config option db: add rows_count_fail_threshold config option db: add large_partition_fail_threshold_mb config option replica: introduce large_data_exception	2026-06-01 13:26:00 +03:00
Nadav Har'El	b254a9826a	test/cluster: add pylib-style nodetool.py Tests in test/cqlpy use a tiny nodetool-like library, where calls to nodetool.flush() are translated to the parallel REST API request on Scylla - but use an external "nodetool" command when running the test against Cassandra. Some tests/cluster also began using test/cqlpy/nodetool.py, but it is NOT a good fit for test/cluster tests, because: 1. It falls back to using the external "nodetool" when it thinks the REST API is not available. In cluster tests, no such fallback is needed (these tests can't be run on Cassandra). If the REST API is down, the test should fail - not fall back to an irrelevant method. 2. The nodetool.flush() et al. functions are not async, and cluster tests are supposed (by design...) to only use async APIs. 3. test/cqlpy/nodetool.py was not written in the "style" defined for the test/cluster codebase - specifically they don't have docstrings or strong typing. This patch introduces test/pylib/nodetool.py, based on test/cqlpy/nodetool.py but fixing all the above problems - there are no Cassandra fallbacks, there are docstrings and type hints, and all the functions are async. We also fix the test/cluster tests that used test/cqlpy/nodetool.py to switch to test/pylib/nodetool.py. Of course it means the newly async functions need to be "await"ed, not just called, so this patch changes that too. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#30129	2026-06-01 13:03:29 +03:00
Piotr Szymaniak	21f1380df1	test/pylib: fix starting server cleanup race test_localnodes_joining_nodes stops a server while manager.server_add() is still waiting for that server to finish startup. Stopping the process can make the background add_server() fail and run its cleanup path first, removing the server from ScyllaCluster.starting. When the stop request later resumes, its own self.starting.pop(server_id) raises KeyError, which the manager returns as HTTP 500. The opposite ordering is possible as well: server_stop() can remove the entry before add_server() reaches its finally block. Make cleanup of ScyllaCluster.starting idempotent in both paths. add_server() remains the normal cleanup path, while server_stop() provides fallback cleanup when it wins the race. Fixes SCYLLADB-2314 Closes scylladb/scylladb#30128	2026-05-31 23:22:44 +03:00
Nadav Har'El	33dce2b7fc	Merge 'cql3: statement_restrictions: continue exploitation of predicate work' from Avi Kivity In `6165124fcc`, we changed analysis of expressions in the WHERE clause to use predicates, an annotated form of an expression that constrains a column when the expression is set to true. Here, we exploit this work to simplify the analysis further, reusing already computed attributes rather than re-analyzing the expression. Not backporting, this is a refactor with no functional change and no bugs fixed. Closes scylladb/scylladb#30049 * github.com:scylladb/scylladb: cql3: statement_restrictions: simplify find_idx to return only the index cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key cql3: statement_restrictions: replace find_clustering_order with predicate order field cql3: statement_restrictions: replace has_partition_token with variant check cql3: statement_restrictions: replace has_slice with predicate is_slice check cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column cql3: statement_restrictions: replace find_needs_filtering with predicate op check cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool cql3: statement_restrictions: replace find_binop column extraction with predicate on field cql3: statement_restrictions: set op on all binary-operator-derived predicates	2026-05-31 23:22:43 +03:00
Avi Kivity	503add224d	cql3: statement_restrictions: simplify find_idx to return only the index The expression returned as the second element of find_idx()'s pair was stored in view_indexed_table_select_statement::_used_index_restrictions but never read — dead code. Simplify find_idx() to return just the optional<index>, and remove the dead member and constructor parameter from view_indexed_table_select_statement. The now unused _idx_restrictions is also removed.	2026-05-29 17:18:21 +03:00
Taras Veretilnyk	9abf594397	test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests Per-table toggle: disabled-at-create, alter-disable, alter-reenable. LWT exemption: Paxos learn must bypass the guardrail. Multi-category independence: all three guardrails warn/reject independently when SSTable records span partition, row, and collection categories.	2026-05-29 12:51:43 +02:00
Dimitrios Symonidis	4c0a991017	test/cluster: fix proxy resource leak in internode compression test The test_internode_compression_between_datacenters test was flaky due to proxy servers and leased host IPs not being cleaned up on failure paths. If any exception occurred after proxies were started (e.g. during server_start or driver_connect), the asyncio.Server listeners remained bound and leased hosts were never released back to HostRegistry. On subsequent test runs, this caused EADDRINUSE (errno 98) when trying to bind the same address:port. Wrap the proxy/server lifecycle in try/finally to ensure proxies are always stopped and hosts are always released, regardless of whether the test succeeds or fails. Fixes: SCYLLADB-2183 Closes scylladb/scylladb#30127	2026-05-29 13:51:43 +03:00
Taras Veretilnyk	7d365844a3	test/cqlpy: add large collection guardrail tests Tests for collection element-count guardrail: hard-limit rejection, disabled-when-zero, soft-limit log warning, and no-warning below threshold.	2026-05-29 12:51:43 +02:00
Taras Veretilnyk	19a9e45da8	test/cqlpy: add large row guardrail tests Tests for row-size guardrail: hard-limit rejection, disabled-when-zero, soft-limit log warning, and no-warning below threshold.	2026-05-29 12:51:42 +02:00
Taras Veretilnyk	67b659e2bf	test/cqlpy: add large partition guardrail tests Tests for partition size and row-count guardrails: hard-limit rejection, disabled-when-zero, soft-limit log warnings, and no-warning below threshold. Includes shared helpers and log assertion utilities used by subsequent commits.	2026-05-29 12:51:42 +02:00
Taras Veretilnyk	ff84b1dbc4	test/boost: add large_data_guardrail unit tests 8 tests covering the record_compare template comparator, intrusive multiset equal_range grouping with heterogeneous lookup_key, and auto_unlink on record destruction.	2026-05-29 12:51:42 +02:00
Taras Veretilnyk	0201c1530e	test/cluster: add large data guardrails rolling upgrade test Simulated rolling upgrade: start a 2-node cluster where one node suppresses the LARGE_DATA_GUARDRAILS feature, verify that enabling guardrails is rejected, then upgrade the old node and verify that enabling guardrails succeeds.	2026-05-29 12:51:31 +02:00
Pavel Emelyanov	5d0371620d	test/backup: Reduce s3 logging from trace to debug Change s3 log level from TRACE to DEBUG in backup tests. TRACE level generates excessive log volume with too much low-level detail about S3 operations. While it was usefult in the early days of S3 client, nowadays DEBUG level likely provides sufficient diagnostic information for backup test troubleshooting. The reduced log volume significantly improves test performance, which is the main outcome of this change: - Less I/O time writing logs during test execution - Faster teardown: each test scans all server logs for errors, and smaller logs mean faster grep operations (23.3s → 9.97s for 8-node cluster teardown) Impact on test_restore_with_streaming_scopes[topology4] (8 nodes): - Log volume: 49 MB → 23 MB (reduced by half) - Test runtime: 82.55s → 57.53s (30% faster) - Teardown time: 23.3s → 9.97s (57% faster) Tests that start smaller clusters also have notable timing improvements Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#30109	2026-05-29 13:46:10 +03:00
Pavel Emelyanov	24c0ea6b19	sstables_loader: Prevent table destruction during tablet restore download Similar to `e5e6608f20` ("sstables_loader: prevent use-after-free on table drop during streaming") which fixed the same class of race for load_and_stream, the tablet restore path also holds a replica::table& reference across the download_sstable() coroutine without preventing concurrent table destruction. If DROP KEYSPACE is applied while download_sstable() is writing SSTable components to the table's data directory, the directory is removed mid-write causing ENOENT → abort (with --abort-on-internal-error). Fix by acquiring a stream_in_progress() phaser guard after find_column_family() and before download_sstable(). table::stop() calls _pending_streams_phaser.close() which blocks until all outstanding guards are released, keeping the table alive for the duration of the download. Fixes: SCYLLADB-2187 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#30094	2026-05-29 13:43:37 +03:00
Pavel Emelyanov	8b2ff16cae	schema: Move grace_period from schema_ctxt to schema_registry The schema_registry_grace_period field on schema_ctxt was only used by schema_registry itself for eviction timing. Move it to be a direct member of schema_registry, passed at init() time. This removes one db::config dependency from schema_ctxt. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#30038	2026-05-29 13:42:23 +03:00
Botond Dénes	1384c9523e	Merge 'Simplify handler injection call sites to use appropriate existing API' from Pavel Emelyanov Several error injection call sites use the verbose handler-lambda API when simpler alternatives already exist in the framework. This series converts them to use the appropriate overloads, reducing boilerplate and making the injection intent immediately obvious from the call site. Cleaning up in-code debugging facilities, no need to backport Closes scylladb/scylladb#29962 * github.com:scylladb/scylladb: error_injection: Convert handler-style breakpoints to wait_for_message sugar error_injection: Convert no-op handler injections to enter()/is_enabled() error_injection: Convert handler-throw injections to lambda-throw style utils: Add share_messages parameter to breakpoint injection API	2026-05-29 13:41:09 +03:00
Botond Dénes	3ae88e31bd	Merge 'test/pylib: stop using random ports for MinIO and JMX' from Piotr Smaron Replace random port selection in MinIO and JMX test helpers with fixed ports on unique per-test loopback IPs, eliminating TOCTOU races. Commits: - kmip_wrapper: default hostname to 127.0.0.1 - nodetool: bind JMX to the per-module loopback IP with fixed port 7199 - minio: use fixed service and console ports on a unique HostRegistry IP instead of probing the ephemeral range; raise on start failure Fixes: SCYLLADB-1817 Minor improvement, no need to backport. Closes scylladb/scylladb#29741 * github.com:scylladb/scylladb: test/pylib: use fixed MinIO ports on unique loopback IPs test/nodetool: bind JMX to per-module loopback IP test/pylib: default KMIP wrapper to loopback	2026-05-29 13:40:24 +03:00
Taras Veretilnyk	5a0974e781	schema: add per-table large_data_guardrails_enabled flag Add a per-table large_data_guardrails_enabled flag controlled via the CQL table property WITH large_data_guardrails_enabled = true\|false. Store the flag as a boolean column in system_schema_ext.scylla_tables. Only write a live cell when enabled; when disabled (the default), omit the cell entirely so that old nodes that don't know this column can still read the SSTable during rolling upgrade or rollback. When the property transitions from true to false via ALTER TABLE, a tombstone is written in make_update_table_mutations to override the previous live cell — this is safe because the CQL feature gate ensures all nodes are upgraded before the property can be set to true. Gate the CQL property behind the LARGE_DATA_GUARDRAILS cluster feature: attempting to set large_data_guardrails_enabled = true before all nodes advertise the feature raises a ConfigurationException.	2026-05-29 12:18:33 +02:00
Botond Dénes	46631692cd	mutation_fragment_stream_validator: use legacy byte order for same-token partition key comparison When two partition keys share the same token, their relative order is determined by their raw serialized bytes (legacy_tri_compare), which matches the physical on-disk order in SSTables. The validator was using partition_key::tri_compare instead — a type-aware comparator that can disagree with byte order for types like timeuuid. The result was a false-positive "out-of-order partition key" error for any two same-token partitions whose timeuuid (or other type-aware) order is the reverse of their byte order. In scrub mode this caused the second partition to be silently dropped. Fixes: SCYLLADB-2304 Closes scylladb/scylladb#30120	2026-05-29 11:54:20 +02:00
Tomasz Grabiec	5ceabcbcc5	Merge 'tablets: fix update_tablet_metadata failures during bootstrap' from Aleksandra Martyniuk When partition_split_builder splits a tablet metadata partition into multiple mutations, the first mutation gets the partition tombstone and/or static row while subsequent mutations contain only clustered rows. The hint logic would correctly clear tokens (marking a full partition read) upon seeing the tombstone in the first mutation, but then re-add tokens when processing the subsequent row-only mutations. This caused update_tablet_metadata to attempt a point update via mutate_tablet_map_async on a tablet map that doesn't exist yet during bootstrap, throwing no_such_tablet_map and failing the snapshot transfer. Fix by adding a full_read flag to table_hint. Once a full partition read is decided (due to partition tombstone, range tombstone, static row, or row deletion), the flag prevents subsequent mutations for the same table from re-adding tokens. Additionally, fall back to a full partition read when the tablet map is missing locally, which happens when the joining node receives tablet metadata for a table it has never seen before. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2303. Needs backports to 2026.1+. 2026.1 introduces the regression with `b17a36c071` Closes scylladb/scylladb#30115 * github.com:scylladb/scylladb: tablets: fall back to full partition read when tablet map is missing tablets: fix hint re-adding tokens after full partition read decision	2026-05-29 11:53:36 +02:00
Botond Dénes	091e3f5191	Merge 'test.py: reduce resource metrics gathering overhead' from Evgeniy Naydanov Only enable the memory controller in cgroup subtree_control instead of all available controllers. cpu.stat is available in cgroup v2 without enabling the cpu controller (base accounting), and enabling io/pids/cpu controllers adds unnecessary per-operation kernel overhead to Scylla processes - particularly the memory controller's per-page-cache-operation accounting combined with io controller overhead during heavy I/O. Additionally, restrict SystemResourceMonitor to the master process only. System-wide metrics (CPU%, memory) are identical from any process, so running a monitoring thread in each xdist worker was redundant and added unnecessary SQLite write contention and thread scheduling noise. Replace cpu_percent(interval=0.1) with a non-blocking cpu_percent() that returns CPU% since the previous call. Use stop_event.wait(timeout=2.0) as the loop control to both space out iterations and allow immediate shutdown responsiveness. Fixes SCYLLADB-2141 Closes scylladb/scylladb#29987 * github.com:scylladb/scylladb: test: use non-blocking cpu_percent in SystemResourceMonitor test.py: reduce cgroup overhead in resource metrics gathering	2026-05-29 10:52:17 +03:00
Nadav Har'El	7a387a499f	Merge 'cql3: extract vector search select statement into cql3/statements/external_search/' from Szymon Malewski Extract vector_indexed_table_select_statement and its filter logic out of the monolithic select_statement.cc and vector_search/ module into a dedicated directory cql3/statements/index_search/. This improves modularity and eliminates a circular dependency between cql3 and vector_search: the filter code depends heavily on cql3 types (expressions, query_options, statement_restrictions) and belongs in the cql3 layer. Follow-up to VECTOR-250 which originally addressed the same dependency but has since regressed. This is also a preparatory refactoring for full-text search select statements, which can share some implementation with the vector search. Pure refactoring, no semantic changes - no need for backporting. Closes scylladb/scylladb#30100 * github.com:scylladb/scylladb: vector_index: move filter into cql3/statements/external_search cql3: extract vector_indexed_table_select_statement into own compilation unit vector_index: split query_base_table to return raw coordinator_result	2026-05-28 11:26:49 +03:00
Piotr Dulikowski	8dfd455001	Merge 'strong consistency: fix drop table blocking on stuck writes and handle timeout in update()' from Petr Gusev - Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum - Handle unhandled timeout exception in the wait-for-leader loop during group startup When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion. Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state. SCYLLADB-2080 fix: - Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released). - Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling. - Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between). Timeout handling fix: - Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled. Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that: 1. Pauses a write on the leader before add_entry 2. Drops the table (follower destroys its group immediately) 3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds backport: no need, strong consistency is not released yet Fixes: SCYLLADB-2080 Closes scylladb/scylladb#30105 * github.com:scylladb/scylladb: strong consistency/groups_manager: handle timeout in update() wait-for-leader loop strong consistency: abort raft server before gate close when dropping a table test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080	2026-05-28 09:59:20 +02:00
Szymon Malewski	ed1006928f	vector_index: move filter into cql3/statements/external_search Move prepared_filter, prepared_restriction, prepared_rhs types and prepare_filter() from vector_search/filter.{hh,cc} into new files cql3/statements/external_search/filter.{hh,cc} under namespace cql3::statements::external_search. This eliminates a circular dependency between the cql3 and vector_search modules: the filter code depends heavily on cql3 types (expressions, query_options, statement_restrictions) and belongs in the cql3 layer. This is a follow-up to VECTOR-250 which originally addressed the same circular dependency but has since regressed.	2026-05-27 21:43:56 +02:00
Ferenc Szili	76dac2fd8e	test: fix format string typo in error logging in ldap_server.py This change fixes a typo in the error logging format string: s% -> %s Fixes: SCYLLADB-2244 Closes scylladb/scylladb#30088	2026-05-27 17:22:21 +03:00
Aleksandra Martyniuk	d6c1707a04	tablets: fix hint re-adding tokens after full partition read decision When partition_split_builder splits a tablet metadata partition into multiple mutations, the first mutation gets the partition tombstone and/or static row while subsequent mutations contain only clustered rows. The tablet metadata change hint logic would correctly clear tokens (marking a full partition read) upon seeing the tombstone in the first mutation, but then re-add tokens when processing the subsequent row-only mutations. This caused update_tablet_metadata to attempt a point update via mutate_tablet_map_async on a tablet map that doesn't exist yet during bootstrap, throwing no_such_tablet_map and failing the snapshot transfer. Fix by adding a full_read flag to table_hint. Once a full partition read is decided (due to partition tombstone, range tombstone, static row, or row deletion), the flag prevents subsequent mutations for the same table from re-adding tokens.	2026-05-27 15:36:16 +02:00
Nadav Har'El	21ecc12fc6	Merge 'index: fix local vector index locality detection after schema reload' from Michał Hudobski After schema reload, `target_parser::is_local()` did not recognize the vector-index local target format `{"pk": [...], "tc": "..."}`, causing local vector indexes to be treated as global. This broke duplicate detection when both a global and a local vector index existed on the same column. Fix by introducing `vector_index::is_local()` and dispatching to it from `create_index_from_index_row()` based on the index class. Also adds tests for local/global vector index coexistence. Fixes: SCYLLADB-987 backport reasoning: we added local vector index support in 2026.1 Closes scylladb/scylladb#29492 * github.com:scylladb/scylladb: test/cqlpy: add tests for global and local vector index coexistence index: fix local vector index locality detection after schema reload	2026-05-27 15:34:57 +03:00
Petr Gusev	d922c43358	strong consistency: abort raft server before gate close when dropping a table When a strongly consistent table is dropped, schedule_raft_group_deletion() used to call g->close() first, which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires, unnecessarily delaying group deletion. Fix: initiate gate close (prevents new operations from entering), then abort the raft server (causes in-flight add_entry/read_barrier to throw raft::stopped_error, releasing their gate holders), then await the gate future (resolves immediately since holders are now released). Handle raft::stopped_error in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return no_such_column_family (which the CQL layer converts to InvalidRequest 'unconfigured table'). Otherwise fall through to the default timeout handling. Also replace gate->hold() with try_hold() + on_internal_error in acquire_server, and handle the timeout exception in the wait-for-leader loop in update() gracefully (log + break instead of propagating). Fixes: SCYLLADB-2080	2026-05-27 12:06:46 +02:00
Petr Gusev	89307064b5	test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080 Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes the quorum-loss scenario: when a table is dropped, the follower destroys its raft group immediately while the leader's in-flight operations are still holding the gate. The test pauses both a read and a write on the leader, drops the table, then resumes them. Both are expected to fail with 'no such column family' since the raft server is aborted as part of group deletion. A 15-second timeout guard detects the old buggy behavior (write stuck forever). Marked xfail until the fix is applied in the next commit.	2026-05-27 12:06:46 +02:00
Botond Dénes	555cfbcd38	Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. - schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings. No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway) Closes scylladb/scylladb#29990 * github.com:scylladb/scylladb: treewide: replace deprecated smp::count and smp::all_cpus() with new APIs scylla-gdb: read shard count from smp::_this_smp instead of smp::count schema_builder: make shard_count an explicit constructor parameter	2026-05-27 09:42:06 +03:00
Avi Kivity	8010e408a2	treewide: replace deprecated smp::count and smp::all_cpus() with new APIs Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings.	2026-05-26 17:35:20 +03:00
Wojciech Mitros	ae0d77257f	mv: fix view_update_builder losing fragments across batch boundaries When a mutation generates more view updates than max_rows_for_view_updates (100), view_update_builder::build_some() splits the work into multiple batches. There was a bug in how fragments were read between batches: When should_stop_updates() returned true, the old code called stop() which returned stop_iteration::yes without reading the next fragments. On the next build_some() call, read_both_next_fragments() was called at the start, which advanced BOTH readers - skipping any fragment that was already read but not yet consumed. A row could be not consumed if either: - the 100th (last in the batch) update was a row insertion and we still had insertions/updates remaining - the 100th (last in the batch) update was a row deletion and we still had deletions/updates remaining For the most common case where work is split in batches, i.e. range deletions, we couldn't hit this because range delete generates only view row deletions. On tables with a single materialized view, we also couldn't get this for any batches with less than 50 statements (unless the batch also contained range deletions), because one non-range-delete update can generate up to 2 view updates. Howeveer, for a range of scenarios outside these 2, we could lose view updates, resulting in persistent inconsistencies. The fix: - read_*_next_fragment() now accept a stop_iteration parameter, so the next fragments are always read after consuming (even when stopping), but stop_iteration::yes is correctly propagated to break the loop. - build_some() no longer re-reads fragments at the start. Instead, an initialize() method performs the initial read once at construction. - because now we only advance readers after consuming, we won't advance readers after end_of_partition, so we extend the break condition to accept either readers evaluating to `false` or them being at the end_of_partition. We also handle the optimization with _skip_row_updates Fixes: scylladb/scylladb#29155 Closes scylladb/scylladb#29498	2026-05-26 14:15:12 +02:00
Pavel Emelyanov	cd7d9a63bc	error_injection: Convert handler-style breakpoints to wait_for_message sugar Replace verbose handler lambdas that only log and call wait_for_message() with the equivalent one-liner breakpoint sugar. The behavior is identical -- the sugar produces the same log messages in the format "{name}: waiting for message" / "{name}: message received". Update Python tests that waited for the old ad-hoc log messages to match the new standardized format. Converted injections: - topology_state_load_before_update_cdc (storage_service.cc) - migration_streaming_wait x2 (storage_service.cc) - pause_after_streaming_tablet (storage_service.cc) - cdc_generation_publisher_fiber (topology_coordinator.cc) - wait_after_tablet_cleanup (topology_coordinator.cc) - fast_orphan_removal_fiber (topology_coordinator.cc) - split_storage_groups_wait (table.cc) - wait_before_stop_compaction_groups (table.cc) - tasks_vt_get_children (task_manager.cc) - truncate_compaction_disabled_wait (database.cc) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-26 15:01:01 +03:00
Avi Kivity	c59985c38b	Merge 'cql3: limit large allocations when parsing queries' from Botond Dénes Queries are stored and passed around as sstring/std::string_view. While normally they are small enough to not cause problems, as the `test_cdc_large_values.TestLargeColumnsWithCDC.test_single_column_blob_max_size_with_cdc_preimage_full_postimage[unprepared_statements]` demonstrates, queries can be arbitrarily large, putting heavy strain on Scylla internals via large allocations, in the extreme case causing denial of service. This PR attempts to alleviate this by using fragmented storage for queries: read query as fragmented string from the input stream in `transport/server.cc`, propagate it as such to `query_processor::prepare()` and also store it as such in `cql3::cql_statement::raw_cql_statement`. Also avoid linearizing raw values during in the CQL expression tree: switch `cql3::expr::untyped_constant::raw_text` to fragmented storage. For this to be possible, some infrastructure code had to be made fragmented storage friendly: ascii/utf8 validation, hashers, from_hex and importantly: `abstract_type::from_string()`. Unfortunately, the query still has to be linearized for parsing itself, as ANTLR -- although allows for custom InputStream implementation -- plays pointer arithmetics games with the pointers obtained from them, so fragmented input cannot be used. Still, this PR limits the places where the query is linearized to the following: * Parsing * Audit * Logs and error messages So the normal query paths for queries that actually can get arbitrarily large (UPDATE and INSERT) should only linearize the query temporarily for parsing. Fixes #10779 Improvement, no backport Closes scylladb/scylladb#28619 * github.com:scylladb/scylladb: tracing: add_query(): change query param to utils::chunked_string cql3: store raw query string in utils::chunked_string serializer: add serializer<utils::chunked_string> utils/reusable_buffer: add get_linearized_view(managed_bytes_view) cql3/expr: use utils::chunked_string for untyped_constant::raw_text types: abstract_type::from_string() switch to fragmented buffers (implementation) types: abstract_type::from_string() switch to fragmented buffers (interface) types: use write_fragmented from utils/fragment_range.hh types: timestamp_from_string(): don't assume std::string_view is null-terminated types/duration: don't assume std::string_view is null-terminated utils/hashers: add calculate(managed_bytes_view) overload utils/ascii: add validate(managed_bytes_view) overload utils: add managed_bytes_fwd.hh utils: add chunked_string utils: add managed_bytes_basic_view::byte_iterator	2026-05-26 15:00:53 +03:00
Andrzej Jackowski	45ff773466	test: table_helper: verify detached setup failure is consumed Add test_best_effort_setup_table_failure_is_consumed which triggers a setup_table() failure via a missing keyspace and asserts no abandoned future escapes. This guards against regressions where the detached future loses its exception handler. Remove the test_skipped_no_error_injection placeholder since the new test runs unconditionally keeping the suite non-empty in all build modes.	2026-05-26 13:32:56 +02:00
Avi Kivity	f165b396fd	schema_builder: make shard_count an explicit constructor parameter A recent Seastar update deprecated smp::count and introduced this_smp_shard_count() as a replacement. One difference is that this_smp_shard_count() wants to run on a reactor thread. This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE) that nevertheless use a schema, as the schema_builder constructor references smp::count. If we replace it with this_smp_shard_count() then it will crash when running without a reactor. To fix, remove the implicit this_smp_shard_count() call from raw_schema's constructor and require callers to pass shard_count explicitly to schema_builder. This allows tests that don't run on a reactor thread to construct schemas without crashing. Production code and reactor-based tests pass this_smp_shard_count(). Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test, wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test) pass a fixed shard count of 1. Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE) but also contains one plain BOOST_AUTO_TEST_CASE (test_empty_key_view_comparison) that constructs a schema_builder without a reactor context. This test also receives a fixed shard count of 1.	2026-05-26 11:55:56 +03:00
Nikos Dragazis	54cb6d4608	test: Order task-wait before finalization in test_migration_wait_task The purpose of this test is to verify that the task manager's "wait" API works correctly for vnodes-to-tablets migration virtual tasks. It starts a `wait_task` HTTP request concurrently with a finalize (or rollback) operation, and asserts that the wait returns the correct final state ("done" or "suspended"). The test `uses asyncio.create_task()` to wrap the wait request into a task, and then immediately calls finalize. With asyncio's lazy task scheduling, the wait coroutine does not start until the event loop yields, so the finalization request reaches the server before wait, and therefore may also complete before it. Once finalization completes, the virtual migration task is no longer discoverable, causing a "task not found" error. Add a log message in Scylla's wait handler and a synchronization point in the test to ensure that the wait request lands the server before finalization. This follows the same pattern used in `test_tablet_tasks.py::check_and_abort_repair_task`. Fixes SCYLLADB-2077 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29973	2026-05-26 10:43:22 +03:00
Botond Dénes	0fd25dc47c	Merge 'Replace get_injection_parameters() with inject_parameter() where appropriate' from Pavel Emelyanov Several error injection sites use the low-level get_injection_parameters() API to fetch the entire parameters map and then manually look up a single key. The inject_parameter() API is better suited for these cases — it combines the enabled check and typed single-parameter extraction in one call, returning std::optional. Cleaning error injection usage, not backporting Closes scylladb/scylladb#29970 * github.com:scylladb/scylladb: test: Use inject_parameter() in row_cache_test sstables: Use inject_parameter() for mx reader fill buffer timeout streaming: Use inject_parameter() for order_sstables_for_streaming	2026-05-26 10:32:44 +03:00
Nadav Har'El	f65a52f3ec	Merge 'vector_search: test: migrate rescoring tests from C++/Boost to pytest' from Szymon Malewski Migrate mock-based rescoring and oversampling tests from test/vector_search/rescoring_test.cc to pytest and delete the C++ file. Index option validation tests go to test_vector_index.py; rescoring tests go to a new test_vector_search_rescoring.py which introduces shared infrastructure (EmbeddingRow dataclass, TEST_DATA dict, reversed_ann_response() helper, rescoring_test_table() context manager). Two tests have updated assertions (semantic change): filters_invalid_similarity_scores now uses per-function expected result sets including a zero-vector row, and rescoring_with_zerovector_query asserts empty results after NaN filtering (cosine only). Both are marked xfail pending SCYLLADB-924. Follow-up to #29593. Does not require backport - simple refactoring of tests Closes scylladb/scylladb#29906 * github.com:scylladb/scylladb: test/vector_search: migrate zero-vector query rescoring test to pytest; delete rescoring_test.cc test/vector_search: migrate invalid similarity score filtering test to pytest test/vector_search: migrate non-ANN similarity argument rescoring test to pytest test/vector_search: migrate wildcard select rescoring test to pytest test/vector_search: migrate similarity_function rescoring test to pytest test/vector_search: migrate rescoring and f32 quantization tests to pytest test/vector_search: migrate oversampling tests to pytest test/vector_search: migrate vector_index option validation tests to pytest	2026-05-26 09:45:40 +03:00
Botond Dénes	2c9a5f9634	types: abstract_type::from_string() switch to fragmented buffers (implementation) The previous patch changed the interface and callers, this one updates the implementation to actually work with fragmented buffers. Most types just use with_linearized() to linearize the fragmented input buffer for parsing. This is fine, as most types have a fixed or bounded-size string representation that is small. Importantly, the input is not linearized for the 3 types which have unbounded values: ascii, bytes and text. The tuple type can contain any of these types itself, so it is also converted to avoid linearization.	2026-05-26 09:08:06 +03:00

1 2 3 4 5 ...

11954 Commits