scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Radosław Cybulski	cc39b54173	alternator: use `stream_arn` instead of `std::string` in list_streams Use `stream_arn` object for storage of last returned to the user stream instead of raw `std::string`. `stream_arn` is used for parsing ARN incoming from the user, for returning `std::string` was used because of buggy copy / move operations of `stream_arn`. Those were fixed, so we're fixing usage as well. Fixes: SCYLLADB-1241 Closes scylladb/scylladb#29578	2026-04-22 14:02:53 +02:00
Artsiom Mishuta	183c6d120e	test: exclude pylib_test from default test runs Add pylib_test to norecursedirs in pytest.ini so it is not collected during ./test.py or pytest test/ runs, but can still be run directly via 'pytest test/pylib_test'. Also fix pytest log cleanup: worker log files (pytest_gw*) were not being deleted on success because cleanup was restricted to the main process only. Now each process (main and workers) cleans up its own log file on success. Closes scylladb/scylladb#29551	2026-04-22 11:38:40 +02:00
Botond Dénes	18ceeaf3ef	Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231. Closes scylladb/scylladb#29310 * github.com:scylladb/scylladb: compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode test/repair: Add tombstone GC safety tests for incremental repair	2026-04-22 10:21:37 +03:00
Avi Kivity	f5eb99f149	test: bump multishard_query_test querier_cache TTL to 60s to avoid flake Three test cases in multishard_query_test.cc set the querier_cache entry TTL to 2s and then assert, between pages of a stateful paged query, that cached queriers are still present (population >= 1) and that time_based_evictions stays 0. The 2s TTL is not load-bearing for what these tests exercise — they are checking the paging-cache handoff, not TTL semantics. But on busy CI runners (SCYLLADB-1642 was observed on aarch64 release), scheduling jitter between saving a reader and sampling the population can exceed 2s. When that happens, the TTL fires, both saved queriers are time-evicted, population drops to 0, and the assertion `require_greater_equal(saved_readers, 1u)` fails. The trailing `require_equal(time_based_evictions, 0)` check never runs because the earlier assertion has already aborted the iteration — which is why the Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93". Reproduced deterministically in test_read_with_partition_row_limits by injecting a `seastar::sleep(2500ms)` between the save and the sample: the hook then reports population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0 and the assertion fires — matching the Jenkins symptoms exactly. Bump the TTL to 60s in all three affected tests: - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642) - test_read_all (same pattern, same invariants — suspect) - test_read_all_multi_range (same pattern, same invariants — suspect) Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction) and test_evict_a_shard_reader_on_each_page (tests manual eviction via evict_one(); its TTL is not load-bearing but the fix is deferred for a separate review) unchanged. Fixes: SCYLLADB-1642 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Closes scylladb/scylladb#29564	2026-04-22 09:48:59 +03:00
Tomasz Grabiec	cddde464ca	Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature	2026-04-22 01:46:11 +02:00
Dario Mirovic	cf237e060a	test: auth_cluster: use safe_driver_shutdown() for Cluster teardown A handful of cassandra-driver Cluster.shutdown() call sites in the auth_cluster tests were missed by the previous sweep that introduced safe_driver_shutdown(), because the local variable holding the Cluster is named "c" rather than "cluster". Direct Cluster.shutdown() is racy: the driver's "Task Scheduler" thread may raise RuntimeError ("cannot schedule new futures after shutdown") during or after the call, occasionally failing tests. safe_driver_shutdown() suppresses this expected RuntimeError and joins the scheduler thread. Replace the remaining c.shutdown() calls in: - test/cluster/auth_cluster/test_startup_response.py - test/cluster/auth_cluster/test_maintenance_socket.py with safe_driver_shutdown(c) and add the corresponding import from test.pylib.driver_utils. No behavioral change to the tests; only the driver teardown is hardened against a known driver-side race. Fixes SCYLLADB-1662 Closes scylladb/scylladb#29576	2026-04-21 17:45:11 +02:00
Radosław Cybulski	6f7bf30a14	alternator: increase wait time to tablet sync When forcing tablet count change via cql command, the underlying tablet machinery takes some time to adjust. Original code waited at most 0.1s for tablet data to be synchronized. This seems to be not enough on debug builds, so we add exponential backoff and increase maximum waiting time. Now the code will wait 0.1s first time and continue waiting with each time doubling the time, up to maximum of 6 times - or total time ~6s. Fixes: SCYLLADB-1655 Closes scylladb/scylladb#29573	2026-04-21 17:38:07 +02:00
Radosław Cybulski	74b523ea20	treewide: fix spelling errors. Fix various spelling errors. Closes scylladb/scylladb#29574	2026-04-21 18:20:26 +03:00
Piotr Dulikowski	cb8253067d	Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev When DROP TABLE races with an in-flight DML on a strongly-consistent table, the node aborts in `groups_manager::acquire_server()` because the raft group has already been erased from `_raft_groups`. A concurrent `DROP TABLE` may have already removed the table from database registries and erased the raft group via `schedule_raft_group_deletion`. The `schema.table()` in `create_operation_ctx()` might not fail though because someone might be holding `lw_shared_ptr<table>`, so that the table is dropped but the table object is still alive. Fix by accepting table_id in acquire_server and checking that the table still exists in the database via `find_column_family` before looking up the raft group. If the table has been dropped, find_column_family throws no_such_column_family instead of the node aborting via on_internal_error. When the table does exist, acquire_server proceeds to acquire state.gate; schedule_raft_group_deletion co_awaits gate::close, so it will wait for the DML operation to complete before erasing the group. backport: not needed (not released feature) Fixes SCYLLADB-1450 Closes scylladb/scylladb#29430 * github.com:scylladb/scylladb: strong_consistency: fix crash when DROP TABLE races with in-flight DML test: add regression test for DROP TABLE racing with in-flight DML	2026-04-21 16:54:20 +02:00
Dario Mirovic	bcda39f716	test: audit: use set diff to identify new audit rows assert_entries_were_added asserted that new audit rows always appear at the tail of each per-node, event_time-sorted sequence. That invariant is not a property of the audit feature: audit writes are asynchronous with respect to query completion, and on a multi-node cluster QUORUM reads of audit.audit_log can reveal a row with an older event_time after a row with a newer one has already been observed. Replace the positional tail slice with a per-node set difference between the rows observed before and after the audited operation. The wait_for retry loop, noise filtering, and final by-value comparison against expected_entries are unchanged, so the test still verifies the real contract, that the expected audit entries appear, without relying on a visibility-ordering invariant that the audit log does not guarantee. Fixes SCYLLADB-1589 Closes scylladb/scylladb#29567	2026-04-21 15:33:36 +02:00
Nadav Har'El	6165124fcc	Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity The statement_restrictions code is responsible for analyzing the WHERE clause, deciding on the query plan (which index to use), and extracting the partition and clustering keys to use for the index. Currently, it suffers from repetition in making its decisions: there are 15 calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis of binary operators is done once, then reused. The key data structure introduced is the predicate. While an expression takes inputs from the row evaluated, constants, and bind variables, and produces a boolean result, predicates ask which values for a column (or a number of columns) are needed to satisfy (part of) the WHERE clause. The WHERE clause is then expressed as a conjunction of such predicates. The analyzer uses the predicates to select the index, then uses the predicates to compute the partition and clustering keys. The refactoring is composed of these parts (but patches from different parts are interspersed): 1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change 2. move computation from query time to prepare time 3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API Major refactoring, and no bugs fixed, so definitely not backporting. Closes scylladb/scylladb#29114 * github.com:scylladb/scylladb: cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn cql3: statement_restrictions: remove extract_single_column_restrictions_for_column cql3: statement_restrictions: use predicate vectors in prepare_indexed_local cql3: statement_restrictions: use predicate vector size for clustering prefix length cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions cql3: statement_restrictions: add predicate-based index support checking cql3: statement_restrictions: use pre-built single-column maps for index support checks cql3: statement_restrictions: build clustering-prefix restrictions incrementally cql3: statement_restrictions: build partition-range restrictions incrementally cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally cql3: statement_restrictions: build partition-key single-column restrictions map incrementally cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column cql3: statement_restrictions: track has-token state incrementally cql3: statement_restrictions: track partition-key-empty state incrementally cql3: statement_restrictions: track first multi-column predicate incrementally cql3: statement_restrictions: track last clustering column incrementally cql3: statement_restrictions: track clustering-has-slice incrementally cql3: statement_restrictions: track has-multi-column-clustering incrementally cql3: statement_restrictions: track clustering-empty state incrementally cql3: statement_restrictions: replace restr bridge variable with pred.filter cql3: statement_restrictions: convert single-column branch to use predicate properties cql3: statement_restrictions: convert multi-column branch to use predicate properties cql3: statement_restrictions: convert constructor loop to iterate over predicates cql3: statement_restrictions: annotate predicates with operator properties cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column cql3: statement_restrictions: complete preparation early cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column cql3: statement_restrictions: refine possible_lhs_values() function_call processing cql3: statement_restrictions: return nullptr for function solver if not token cql3: statement_restrictions: refine possible_lhs_values() subscript solving cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error cql3: statement_restrictions: convert possible_lhs_values into a solver cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion cql3: statement_restrictions: refactor IS NOT NULL processing cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions cql3: statement_restrictions: fold add_is_not_restriction() into its caller cql3: statement_restrictions: fold add_restriction() into its caller cql3: statement_restrictions: remove possible_partition_token_values() cql3: statement_restrictions: remove possible_column_values cql3: statement_restrictions: pass schema to possible_column_values() cql3: statement_restrictions: remove fallback path in solve() cql3: statement_restrictions: reorder possible_lhs_column parameters cql3: statement_restrictions: prepare solver for multi-column restrictions cql3: statement_restrictions: add solver for token restriction on index cql3: statement_restrictions: pre-analyze column in value_for() cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase cql3: statement_restrictions: adjust signature of range_from_raw_bounds cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder cql3: statement_restrictions: multi-key clustering restrictions one layer deeper cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds() cql3: statement_restrictions: pre-analyze single-column clustering key restrictions cql3: statement_restrictions: wrap value_for_index_partition_key() cql3: statement_restrictions: hide value_for() cql3: statement_restrictions: push down clustering prefix wrapper one level cql3: statement_restrictions: wrap functions that return clustering ranges cql3: statement_restrictions: do not pass view schema back and forth cql3: statement_restrictions: pre-analyze token range restrictions cql3: statement_restrictions: pre-analyze partition key columns cql3: statement_restrictions: do not collect subscripted partition key columns cql3: statement_restrictions: split _partition_range_restrictions into three cases cql3: statement_restrictions: move value_list, value_set to header file cql3: statement_restrictions: wrap get_partition_key_ranges cql3: statement_restrictions: prepare statement_restrictions for capturing `this` test: statement_restrictions: add index_selection regression test	2026-04-21 15:44:06 +03:00
Anna Stuchlik	d222e6e2a4	doc: document support for OCI Object Storage This commit extends the object storage configuration section with support for OCi object storage. Fixes SCYLLADB-502 Closes scylladb/scylladb#29503	2026-04-21 15:11:58 +03:00
Botond Dénes	cfebe17592	sstables: fix segfault in parse_assert() when message is nullptr parse_assert() accepts an optional `message` parameter that defaults to nullptr. When the assertion fails and message is nullptr, it is implicitly converted to sstring via the sstring(const char*) constructor, which calls strlen(nullptr) -- undefined behavior that manifests as a segfault in __strlen_evex. This turns what should be a graceful malformed_sstable_exception into a fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered parse_assert() during streaming (in continuous_data_consumer:: fast_forward_to()), causing a crash loop on the affected node. Fix by guarding the nullptr case with a ternary, passing an empty sstring() when message is null. on_parse_error() already handles the empty-message case by substituting "parse_assert() failed". Fixes: SCYLLADB-1329 Closes scylladb/scylladb#29285	2026-04-21 12:40:33 +02:00
Marcin Maliszkiewicz	935e6a495d	Merge 'transport: add per-service-level cql_requests_serving metric' from Piotr Smaron The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests. This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label). Fixes: SCYLLADB-1340 New feature, no backport. Closes scylladb/scylladb#29493 * github.com:scylladb/scylladb: transport: add per-service-level cql_requests_serving metric transport: move requests_serving decrement to after response is sent	2026-04-21 12:35:50 +02:00
Aleksandra Martyniuk	cd79b99112	test: fix flaky test_alter_tablets_rf_dc_drop by using read barrier The test was reading system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643. Closes scylladb/scylladb#29563	2026-04-21 09:12:51 +03:00
Raphael S. Carvalho	474e962e01	compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. Implementation: - Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view. - Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)). - Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all compaction groups in the storage group. - Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from all compaction groups across all storage groups (needed for multi-tablet tables). - Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the repaired-only optimization is active; used by get_max_purgeable_timestamp() in compaction.cc to bypass the memtable shadow check. - is_tombstone_gc_repaired_only() private helper gates both methods: requires is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion. - Add error injection "view_update_generator_pause_before_processing" in process_staging_sstables() to support testing the staging-delay scenario. - New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view landed in the repaired set via the hints-before-snapshot path. - New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before writing T_base so D_base is staged on servers[0] via row-sync; blocks the view-update-generator with an error injection; writes T_base + T_mv; runs MV repair (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view in repaired set); asserts no resurrection; releases injection; waits for staging to complete; asserts no resurrection after a second flush+compaction. Demonstrates that the read-before-write in stream_view_replica_updates() makes the optimization safe even when staging fires after T_mv has been GC'd. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-20 16:59:09 -03:00
Ferenc Szili	a50aa7e689	test/cluster: wait for ready CQL in cross-rack merge test test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately after adding the new cross-rack nodes. In the failing runs the driver is still converging on the updated topology at that point, so the control connection sees incomplete peer metadata while schema changes are in flight. That leaves a race where CREATE TABLE is sent during topology churn and the test can surface a misleading AlreadyExists error even though the table creation has already been committed. Use get_ready_cql(servers) here so the test waits for inter-node visibility and CQL readiness before creating the keyspace and table. Fixes: SCYLLADB-1635 Closes scylladb/scylladb#29561	2026-04-20 20:12:11 +02:00
Łukasz Paszkowski	d18eb9479f	cql/statement: Create keyspace_metadata with correct initial_tablets count In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count is set to 0, when tablets are enabled and the replication strategy is NetworkReplicationStrategy. This effectively sets _uses_tablets = false in abstract_replication_strategy for the remaining strategies when no `tablets = {...}` options are specified. As a consequence, it is possible to create vnode-based keyspaces even when tablets are enforced with `tablets_mode_for_new_keyspaces`. The patch sets a default initial tablets count to zero regardless of the chosen replication strategy. Then each of the replication strategy validates the options and raises a configuration exception when tablets are not supported. All tests are altered in the following way: + whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy + otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}` Fixes https://github.com/scylladb/scylladb/issues/25340 Closes scylladb/scylladb#25342	2026-04-20 17:57:38 +03:00
Botond Dénes	69c58c6589	Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage. The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it. This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901 The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions. Closes scylladb/scylladb#28873 * github.com:scylladb/scylladb: streaming: reject mutation fragments on critical disk utilization test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection sstables: clean up TemporaryHashes file in wipe() sstables: add error injection point in write_components test/cluster/storage: extract validate_data_existence to module scope test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity utils/disk_space_monitor: add error injection to suppress threshold checks	2026-04-20 17:56:36 +03:00
David Garcia	16ed338a89	Fix CODEOWNERS to cover nested docs subfolders The `docs/*` pattern only matches files directly inside `docs/`, not files in nested subfolders like `docs/folder_b/test.md` or `docs/alternator/setup.md`. Those files currently have no code owner assigned. Replace with `/docs/` and `/docs/alternator/` which match the directories and all their subdirectories recursively, per GitHub's CODEOWNERS syntax. Ref: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners Closes scylladb/scylladb#29521	2026-04-20 17:55:43 +03:00
Avi Kivity	5687a4840d	conf: pair sstable_format=ms with column_index_size_in_kb=1 One of the advantages of Trie indexes (with sstable_format=ms) is that the index is more compact, and more suitable for paging from disk (fewer pages required per search). We can exploit it by setting column_index_size_in_kb to 1 rather than 64, increasing the index file size (and requiring more index pages to be loaded and parsed) in return for smaller data file reads. To test this, I created a 1M row partition with 300-byte rows, compacted it into a single sstable, and tested reads to a single row. With column_index_size_in_kb=64: Rows.db file size 60k 3 pages read from Rows.db (4k each) 2x 32k read from Data.db With column_index_size_in_kb=1: Rows.db file size 2MB (33X) 5 pages read from Rows.db (4k each, 1.7X) 1x 4107 bytes read from Data.db (0.5X IOPS, 0.06X bandwidth) Given that Rows.db will be typically cached, or at least all but one of the levels (its size is 157X smaller than Data.db), we win on both IOPS and bandwidth. I would have expected the the Data.db read to be closer to 1k, but this is already an improvement. Given that, set column_index_size_in_kb=1, but only for new clusters where we also select sstable_format=ms. Raw data (w1, w64 are working directories with different column_index_size_in_kb): ```console $ ls -l w/data/bench/wide_partition-/{Rows,Data}.db -rw-r--r-- 1 avi avi 314964958 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db -rw-r--r-- 1 avi avi 2001227 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db -rw-r--r-- 1 avi avi 314963261 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db -rw-r--r-- 1 avi avi 59989 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db ``` column_index_size_in_kb=64 trace: ``` cqlsh> SELECT FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE; pk \| ck \| v ----+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 \| 654321 \| 9OXdwmDHRapL2w5YruWLTOtiC3PKbyctSDdQ8YpuPKtWkSYBF10G7bKo2rdnxSAd52HLI21568YM7OwK05B6qAF7X2b6910qsJEA106QBEcFWQVybMCkxkpO4VDRcAVNLRgjB3vygcDBP17GBTb2s7l47UOloy3KtZ7J5YQgKcf7zlFSKGHa49vnRrzoXZCdYexOpix6jcSV2SiwRNqgv6XmYhx43ZwGa4zUtOe0eIKJj7KTxu5bzyWUWGW7US4NLFZRD8Vdb6EasIFkOfVKdiFp2LZHMXGRvtvdF93UTFUb (1 rows) Tracing session: 19219900-3bf3-11f1-bc43-c0a4e62b53d1 activity \| timestamp \| source \| source_elapsed \| client --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+----------- Execute CQL3 query \| 2026-04-19 16:24:30.992000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0/sl:default] \| 2026-04-19 16:24:30.992643+00:00 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement for authenticated user: anonymous [shard 0/sl:default] \| 2026-04-19 16:24:30.992738+00:00 \| 127.0.0.1 \| 96 \| 127.0.0.1 Executing read query (reversed false) [shard 0/sl:default] \| 2026-04-19 16:24:30.992765+00:00 \| 127.0.0.1 \| 123 \| 127.0.0.1 Creating read executor for token -3485513579396041028 with all: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] targets: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] repair decision: NONE [shard 0/sl:default] \| 2026-04-19 16:24:30.992781+00:00 \| 127.0.0.1 \| 139 \| 127.0.0.1 Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] \| 2026-04-19 16:24:30.992782+00:00 \| 127.0.0.1 \| 140 \| 127.0.0.1 read_data: querying locally [shard 0/sl:default] \| 2026-04-19 16:24:30.992795+00:00 \| 127.0.0.1 \| 153 \| 127.0.0.1 Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] \| 2026-04-19 16:24:30.992801+00:00 \| 127.0.0.1 \| 160 \| 127.0.0.1 [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] \| 2026-04-19 16:24:30.992805+00:00 \| 127.0.0.1 \| 163 \| 127.0.0.1 [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] \| 2026-04-19 16:24:30.992814+00:00 \| 127.0.0.1 \| 172 \| 127.0.0.1 Reading key {-3485513579396041028, pk{000400000000}} from sstable w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db [shard 0/sl:default] \| 2026-04-19 16:24:30.992837+00:00 \| 127.0.0.1 \| 195 \| 127.0.0.1 page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:24:30.992851+00:00 \| 127.0.0.1 \| 209 \| 127.0.0.1 page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:24:30.995294+00:00 \| 127.0.0.1 \| 2653 \| 127.0.0.1 page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] \| 2026-04-19 16:24:30.995375+00:00 \| 127.0.0.1 \| 2733 \| 127.0.0.1 page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:24:30.995376+00:00 \| 127.0.0.1 \| 2734 \| 127.0.0.1 page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] \| 2026-04-19 16:24:30.995463+00:00 \| 127.0.0.1 \| 2821 \| 127.0.0.1 page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2 [shard 0/sl:default] \| 2026-04-19 16:24:30.995463+00:00 \| 127.0.0.1 \| 2821 \| 127.0.0.1 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206057984 [shard 0/sl:default] \| 2026-04-19 16:24:30.995471+00:00 \| 127.0.0.1 \| 2829 \| 127.0.0.1 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] \| 2026-04-19 16:24:30.995475+00:00 \| 127.0.0.1 \| 2833 \| 127.0.0.1 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206057984, successfully read 32768 bytes [shard 0/sl:default] \| 2026-04-19 16:24:30.995586+00:00 \| 127.0.0.1 \| 2945 \| 127.0.0.1 Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] \| 2026-04-19 16:24:30.995637+00:00 \| 127.0.0.1 \| 2995 \| 127.0.0.1 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206090752, successfully read 32768 bytes [shard 0/sl:default] \| 2026-04-19 16:24:30.995645+00:00 \| 127.0.0.1 \| 3003 \| 127.0.0.1 Querying is done [shard 0/sl:default] \| 2026-04-19 16:24:30.995653+00:00 \| 127.0.0.1 \| 3012 \| 127.0.0.1 Done processing - preparing a result [shard 0/sl:default] \| 2026-04-19 16:24:30.995670+00:00 \| 127.0.0.1 \| 3028 \| 127.0.0.1 Request complete \| 2026-04-19 16:24:30.995039 \| 127.0.0.1 \| 3039 \| 127.0.0.1 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] \| 2026-04-19 16:22:43.107215+00:00 \| 127.0.0.1 \| 8685 \| 127.0.0.1 ``` column_index_size_in_kb=1 trace: ``` cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE; pk \| ck \| v ----+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 \| 654321 \| FIA7X52ZqYwvDxEGlmWJUSy1I94WTuWZTdLwXr9HBQ90RJLqYKr5nInTADSI6hzofwawaXphAQK07YMoyzFfRaGeKPQPKUb35XpLEGvLJ4xu9r4es8wUEHPXaFBGdMcWUkyDJSTYCFzZAPCzUHEuPJHMXVrI6UExWrIR0Xujg4GZa9UciU9rbEvrSBwSzoPEfbXJ6qZSGiTD8gcXz5kdAblLxsAeWug8tZqslsTu04HMLKfZ8WopQvHbpR6YlGSnM99CiBgz30LMmllULV4VA4u9kMpzsRV2IE2tKmJOddEl (1 rows) Tracing session: 3953a1f0-3bf3-11f1-b976-4a3dc2a7a57f activity \| timestamp \| source \| source_elapsed \| client -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+----------- Execute CQL3 query \| 2026-04-19 16:25:25.007000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0/sl:default] \| 2026-04-19 16:25:25.007423+00:00 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement for authenticated user: anonymous [shard 0/sl:default] \| 2026-04-19 16:25:25.007511+00:00 \| 127.0.0.1 \| 89 \| 127.0.0.1 Executing read query (reversed false) [shard 0/sl:default] \| 2026-04-19 16:25:25.007536+00:00 \| 127.0.0.1 \| 114 \| 127.0.0.1 Creating read executor for token -3485513579396041028 with all: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] targets: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] repair decision: NONE [shard 0/sl:default] \| 2026-04-19 16:25:25.007551+00:00 \| 127.0.0.1 \| 129 \| 127.0.0.1 Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] \| 2026-04-19 16:25:25.007553+00:00 \| 127.0.0.1 \| 131 \| 127.0.0.1 read_data: querying locally [shard 0/sl:default] \| 2026-04-19 16:25:25.007556+00:00 \| 127.0.0.1 \| 134 \| 127.0.0.1 Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] \| 2026-04-19 16:25:25.007562+00:00 \| 127.0.0.1 \| 139 \| 127.0.0.1 [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] \| 2026-04-19 16:25:25.007564+00:00 \| 127.0.0.1 \| 142 \| 127.0.0.1 [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] \| 2026-04-19 16:25:25.007573+00:00 \| 127.0.0.1 \| 151 \| 127.0.0.1 Reading key {-3485513579396041028, pk{000400000000}} from sstable w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db [shard 0/sl:default] \| 2026-04-19 16:25:25.007594+00:00 \| 127.0.0.1 \| 172 \| 127.0.0.1 page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:25:25.007607+00:00 \| 127.0.0.1 \| 184 \| 127.0.0.1 page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:25:25.016029+00:00 \| 127.0.0.1 \| 8607 \| 127.0.0.1 page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] \| 2026-04-19 16:25:25.016109+00:00 \| 127.0.0.1 \| 8687 \| 127.0.0.1 page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:25:25.016111+00:00 \| 127.0.0.1 \| 8688 \| 127.0.0.1 page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285, readahead=1 [shard 0/sl:default] \| 2026-04-19 16:25:25.016176+00:00 \| 127.0.0.1 \| 8754 \| 127.0.0.1 page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] \| 2026-04-19 16:25:25.016260+00:00 \| 127.0.0.1 \| 8838 \| 127.0.0.1 page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486 [shard 0/sl:default] \| 2026-04-19 16:25:25.016261+00:00 \| 127.0.0.1 \| 8839 \| 127.0.0.1 page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285 [shard 0/sl:default] \| 2026-04-19 16:25:25.016261+00:00 \| 127.0.0.1 \| 8839 \| 127.0.0.1 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: scheduling bulk DMA read of size 4107 at offset 206086656 [shard 0/sl:default] \| 2026-04-19 16:25:25.016268+00:00 \| 127.0.0.1 \| 8846 \| 127.0.0.1 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: finished bulk DMA read of size 4107 at offset 206086656, successfully read 4608 bytes [shard 0/sl:default] \| 2026-04-19 16:25:25.016340+00:00 \| 127.0.0.1 \| 8918 \| 127.0.0.1 Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] \| 2026-04-19 16:25:25.016367+00:00 \| 127.0.0.1 \| 8945 \| 127.0.0.1 Querying is done [shard 0/sl:default] \| 2026-04-19 16:25:25.016385+00:00 \| 127.0.0.1 \| 8963 \| 127.0.0.1 Done processing - preparing a result [shard 0/sl:default] \| 2026-04-19 16:25:25.016401+00:00 \| 127.0.0.1 \| 8979 \| 127.0.0.1 Request complete \| 2026-04-19 16:25:25.015989 \| 127.0.0.1 \| 8989 \| 127.0.0.1 ``` Closes scylladb/scylladb#29552	2026-04-20 17:53:56 +03:00
Marcin Maliszkiewicz	9f11920b15	Merge 'alternator: fix remaining problems with new Stream ARN format' from Nadav Har'El This small series includes a few followups to the patch that changed Alternator Stream ARNs from using our own UUID format to something that resembles Amazon's Stream ARNs (and the KCL library won't reject as bogus-looking ARNs). The first patch is the most important one, fixing ListStreams's LastEvaluatedStreamArn to also use the new ARN format. It fixes SCYLLADB-539. The following patches are additional cleanups and tests for the new ARN code. Closes scylladb/scylladb#29474 * github.com:scylladb/scylladb: alternator: fix ListStreams paging if table is deleted during paging test/alternator: test DescribeStream on non-existent table alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn alternator: remove dead code stream_shard_id alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn	2026-04-20 14:42:28 +02:00
Raphael S. Carvalho	a50e6215aa	test/repair: Add tombstone GC safety tests for incremental repair Add three cluster tests that verify no data resurrection occurs when tombstone GC runs on the repaired sstable set under incremental repair with tombstone_gc=repair mode. All tests use propagation_delay_in_seconds=0 so that tombstones become GC-eligible immediately after repair_time is committed (gc_before = repair_time), allowing the scenarios to exercise the actual GC eligibility path without artificial sleeps. (test_tombstone_gc_no_resurrection_basic_ordering) Data D (ts=1) and tombstone T (ts=2) are written to all replicas and flushed before repair. Repair captures both in the repairing snapshot and promotes them to repaired. Once repair_time is committed, T is GC-eligible (T.deletion_time < gc_before = repair_time). The test verifies that compaction on the repaired set does NOT purge T, because D is already in repaired (mark_sstable_as_repaired() completes on all replicas before repair_time is committed to Raft) and clamps max_purgeable to D.timestamp=1 < T.timestamp=2. (test_tombstone_gc_no_resurrection_hints_flush_failure) The repair_flush_hints_batchlog_handler_bm_uninitialized injection causes hints flush to fail on one node. When hints flush fails, flush_time stays at gc_clock::time_point{} (epoch). This propagates as repair_time=epoch committed to system.tablets, so gc_before = epoch - propagation_delay is effectively the minimum possible time. No tombstone has a deletion_time older than epoch, so T is never GC-eligible from this repair. The test verifies that repair_time does not advance to a meaningful value after a failed hints flush, and that compaction on the repaired set does not purge T (key remains deleted, no resurrection). (test_tombstone_gc_no_resurrection_propagation_delay) Simulates a write D carrying an old CQL USING TIMESTAMP (ts_d = now-2h) that was stored as a hint while a replica was down, and a tombstone T with a higher timestamp (ts_t = now-90min, ts_t > ts_d) that was written to all live replicas. After the replica restarts, repair flushes hints synchronously before taking the repairing snapshot, guaranteeing D is delivered and captured in repairing before the snapshot. After mark_sstable_as_repaired() promotes D to repaired, the coordinator commits repair_time. gc_before = repair_time > T.deletion_time so T is GC-eligible. The test verifies that compaction on the repaired set does NOT purge T: D (ts_d < ts_t) is already in repaired, clamping max_purgeable = ts_d < ts_t = T.timestamp, so T is not purgeable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-20 09:09:39 -03:00
Wojciech Mitros	6011cb8a4c	db/view: track range tombstones in update stream during view update building The view update builder ignored range tombstone changes from the update stream when there all existing mutation fragments were already consumed. The old code assumed range tombstones 'remove nothing pre-existing, so we can ignore it', but this failed to update _update_current_tombstone. Consequently, when a range delete and an insert within that range appeared in the same batch, the range tombstone was not applied to the inserted row, or was applied to a row outside the range that it covered causing it to incorrectly survive/be deleted in the materialized view. Fix by handling is_range_tombstone_change() fragments in the update-only branch, updating _update_current_tombstone so subsequent clustering rows correctly have the range tombstone applied to them. Fixes SCYLLADB-1555 Closes scylladb/scylladb#29483	2026-04-20 13:38:52 +02:00
Wojciech Mitros	073710a661	view: apply existing range tombstones after exhausting the update reader When view_update_builder::on_results() hits the path where the update fragment reader is already exhausted, it still needs to keep tracking existing range tombstones and apply them to encountered rows. Otherwise a row covered by an existing range tombstone can appear alive while generating the view update and create a spurious view row. Update the existing tombstone state even on the exhausted-reader path and apply the effective tombstone to clustering rows before generating the row tombstone update. Add a cqlpy regression test covering the partition-delete-after-range-tombstone case. Fixes: SCYLLADB-1554 Closes scylladb/scylladb#29481	2026-04-20 13:29:05 +02:00
Dario Mirovic	40740104ab	test: use DROP KEYSPACE IF EXISTS in new_test_keyspace cleanup The new_test_keyspace context manager in test/cluster/util.py uses DROP KEYSPACE without IF EXISTS during cleanup. The Python driver has a known bug (scylladb/python-driver#317) where connection pool renewal after concurrent node bootstraps causes double statement execution. The DROP succeeds server-side, but the response is lost when the old pool is closed. The driver retries on the new pool, and gets ConfigurationException message "Cannot drop non existing keyspace". The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT EXISTS as a workaround for the same driver bug. This patch applies the same approach to fix DROP KEYSPACE. Fixes SCYLLADB-1538 Closes scylladb/scylladb#29487	2026-04-20 12:51:17 +02:00
Botond Dénes	ad7647c3c7	test/commitlog: reduce resource usage in test_commitlog_handle_replayed_segments The test was using max_size_mb = 8*1024 (8 GB) with 100 iterations, causing it to create up to 260 files of 32 MB each per iteration via fallocate. On a loaded CI machine this totals hundreds of GB of file operations, easily exceeding the 15-minute test timeout (SCYLLADB-1496). The test only needs enough files to verify that delete_segments keeps the disk footprint within [shard_size, shard_size + seg_size]. Reduce max_size_mb to 128 (8 files of 32 MB per iteration) and the iteration count to 10, which is sufficient to exercise the serialized-deletion and recycle logic without imposing excessive I/O load. Closes scylladb/scylladb#29510	2026-04-20 11:02:25 +03:00
Ernest Zaslavsky	e5e6608f20	sstables_loader: prevent use-after-free on table drop during streaming sstables_loader::load_and_stream holds a replica::table& reference via the sstable_streamer for the entire streaming operation. If the table is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the reference becomes dangling and the next access crashes with SEGV. This was observed in a longevity-50gb-12h-master test run where a keyspace was dropped while load_and_stream was still streaming SSTables from a previous batch. Fix by acquiring a stream_in_progress() phaser guard in load_and_stream before creating the streamer. table::stop() calls _pending_streams_phaser.close() which blocks until all outstanding guards are released, keeping the table alive for the duration of the streaming operation. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1352 Closes scylladb/scylladb#29403	2026-04-20 07:39:51 +03:00
Benny Halevy	34adb0e069	test/cluster/dtest: fix test_scrub_static_table flakiness Pass jvm_args=["--smp", "1"] on both cluster.start() calls to ensure consistent shard count across restarts, avoiding resharding on restart. Also pass wait_for_binary_proto=True to cluster.start() to ensure the CQL port is ready before connecting. Fixes: SCYLLADB-824 Closes scylladb/scylladb#29548	2026-04-20 06:53:49 +03:00
Piotr Szymaniak	378bcd69e3	tree: add AGENTS.md router and improve AI instruction files Add AGENTS.md as a minimal router that directs AI agents to the relevant instruction files based on what they are editing. Improve the instruction files: - cpp.instructions.md: clarify seastarx.hh scope (headers, not "many files"), explain std::atomic restriction (single-shard model, not "blocking"), scope macros prohibition to new ad-hoc only, add coroutine exception propagation pattern, add invariant checking section preferring throwing_assert() over SCYLLA_ASSERT (issue #7871) - python.instructions.md: demote PEP 8 to fallback after local style, clarify that only wildcard imports are prohibited - copilot-instructions.md: show configure.py defaults to dev mode, add frozen toolchain section, clarify --no-gather-metrics applies to test.py, fix Python test paths to use .py extension, add license header guidance for new files Closes scylladb/scylladb#29023	2026-04-19 21:59:52 +03:00
Dario Mirovic	f77ff28081	test: manager_client: use safe_driver_shutdown for exclusive_clusters Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster. The correct way is using safe_driver_shutdown. Fixes SCYLLADB-1434 Closes scylladb/scylladb#29390	2026-04-19 21:31:18 +03:00
Avi Kivity	d584bd7358	cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set has_eq_restriction_on_column() walked expression trees at prepare time to find binary_operators with op==EQ that mention a given column on the LHS. Its only caller is ORDER BY validation in select_statement, which checks that clustering columns without an explicit ordering have an EQ restriction. Replace the 50-line expression-walking free function with a precomputed unordered_set<const column_definition*> (_columns_with_eq) populated during the main predicate loop in analyze_statement_restrictions. For single-column EQ predicates the column is taken from on_column; for multi-column EQ like (ck1, ck2) = (1, 2), all columns in on_clustering_key_prefix are included. The member function becomes a single set::contains() call.	2026-04-19 20:57:09 +03:00
Avi Kivity	b7f86eaabc	cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration build_get_multi_column_clustering_bounds_fn() used expr::visit() to dispatch each restriction through a 15-handler visitor struct. Only the binary_operator handler did real work; the conjunction handler just recursed, and the remaining 13 handlers were dead-code on_internal_error calls (the filter expression of each predicate is always a binary_operator). Replace the visitor with a loop over predicates that does as<binary_operator>(pred.filter) directly, building the same query-time lambda inline. Promote intersect_all() and process_in_values() from static methods of the deleted struct to free functions in the anonymous namespace -- they are still called from the query-time lambda.	2026-04-19 20:57:09 +03:00
Avi Kivity	ece9af229d	cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn Replace find_binop(..., is_multi_column) with pred.is_multi_column in build_get_clustering_bounds_fn() and add_clustering_restrictions_to_idx_ck_prefix(). Replace is_clustering_order(binop) with pred.order == comparison_order::clustering and iterate predicates directly instead of extracting filter expressions. Remove the now-dead is_multi_column() free function.	2026-04-19 20:57:09 +03:00
Avi Kivity	72da1207d7	cql3: statement_restrictions: remove extract_single_column_restrictions_for_column The previous commit made prepare_indexed_local() use the pre-built predicate vectors instead of calling extract_single_column_restrictions_for_column(). That was the last production caller. Remove the function definition (65 lines of expression-walking visitor) and its declaration/doc-comment from the header. Replace the unit test (expression_extract_column_restrictions) which directly called the removed function with synthetic column_definitions, with per_column_restriction_routing which exercises the same routing logic through the public analyze_statement_restrictions() API. The new test verifies not just factor counts but the exact (column_name, oper_t) pairs in each per-column entry, catching misrouted restrictions that a count-only check would miss.	2026-04-19 20:57:09 +03:00
Avi Kivity	b093477cf7	cql3: statement_restrictions: use predicate vectors in prepare_indexed_local Replace the extract_single_column_restrictions_for_column(_where, ...) call in prepare_indexed_local() with a direct lookup in the pre-built predicate vectors. The old code walked the entire WHERE expression tree to extract binary operators mentioning the indexed column, wrapped them in a conjunction, translated column definitions to the index schema, then called to_predicate_on_column() which walked the expression again to convert back to predicates. The new code selects the appropriate predicate vector map (PK, CK, or non-PK) based on the indexed column's kind, looks up the column's predicates directly, applies replace_column_def to each, and folds them with make_conjunction -- producing the same result without any expression tree walks. This removes the last production caller of extract_single_column_restrictions_for_column (unit tests in statement_restrictions_test.cc still exercise it).	2026-04-19 20:57:09 +03:00
Avi Kivity	a725e39218	cql3: statement_restrictions: use predicate vector size for clustering prefix length Replace the body of num_clustering_prefix_columns_that_need_not_be_filtered() with a single return of _clustering_prefix_restrictions.size(). The old implementation called get_single_column_restrictions_map() to rebuild a per-column map from the clustering expression tree, then iterated it in schema order counting columns until it hit a gap, a needs-filtering predicate, or a slice. But _clustering_prefix_restrictions is already built with exactly that same logic during the constructor (lines 1234-1248): it iterates CK columns in schema order, appending predicates until it encounters a gap in column_id, a predicate that needs_filtering, or a slice -- at which point it stops. So the vector's size is, by construction, the answer to the same question the old code was re-deriving at query time. This makes four helper functions dead code: - get_single_column_restrictions_map(): walked the expression tree to build a map<column_definition*, expression> of per-column restrictions. Was a ~15-line function that called get_sorted_column_defs() and extract_single_column_restrictions_for_column() for each column. - get_the_only_column(): extracted the single column_value from a restriction expression, asserting it was single-column. Called by the old loop body. - is_single_column_restriction(): thin wrapper around get_single_column_restriction_column(). - get_single_column_restriction_column(): ~25-line function that walked an expression tree with for_each_expression<column_value> to determine whether all column_value nodes refer to the same column. Called by the above two. Remove all four functions and their forward declarations (-95 lines).	2026-04-19 20:57:08 +03:00
Avi Kivity	68c2e292ac	cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions Convert do_find_idx() from a member function that walks expression trees via index_restrictions()/for_each_expression/extract_single_column_restrictions to a static free function that iterates index_search_group spans using are_predicates_supported_by(). Convert calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index() to use predicate vectors instead of expression-based is_supported_by(). Remove now-dead code: is_supported_by(), is_supported_by_helper(), score() member function, and do_find_idx() member function.	2026-04-19 20:57:08 +03:00
Avi Kivity	c42397e995	cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column These functions are no longer called now that all index support checks in the constructor use predicate-based alternatives. The expression-based is_supported_by and is_supported_by_helper are still needed by choose_idx() and calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index().	2026-04-19 20:57:08 +03:00
Avi Kivity	1aafe0708a	cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions Replace clustering_columns_restrictions_have_supporting_index(), multi_column_clustering_restrictions_are_supported_by(), get_clustering_slice(), and partition_key_restrictions_have_supporting_index() with predicate-based equivalents that use the already-accumulated mc_ck_preds and sc_pk_pred_vectors locals. The new multi_column_predicates_have_supporting_index() checks each multi-column predicate's columns list directly against indexes, avoiding expression tree walks through find_in_expression and bounds_slice.	2026-04-19 20:57:08 +03:00
Avi Kivity	fa6f239cc7	cql3: statement_restrictions: add predicate-based index support checking Add `op` and `is_subscript` fields to `struct predicate` and populate them in all predicate creation sites in `to_predicates()`. These fields record the binary operator and whether the LHS is a subscript (map element access), which are the two pieces of information needed to query index support. Add `is_predicate_supported_by()` which mirrors `is_supported_by_helper()` but operates on a single predicate's fields instead of walking the expression tree. Add a predicate-vector overload of `index_supports_some_column()` and use it in the constructor to replace expression-based index support checks for single-column partition key, clustering key, and non-primary-key restrictions. The multi-column clustering key case still uses the existing expression-based path.	2026-04-19 20:57:08 +03:00
Avi Kivity	25ba3bd649	cql3: statement_restrictions: use pre-built single-column maps for index support checks Replace index_supports_some_column(expression, ...) with index_supports_some_column(single_column_restrictions_map, ...) to eliminate get_single_column_restrictions_map() tree walks when checking index support. The three call sites now use the maps already built incrementally in the constructor loop: _single_column_nonprimary_key_restrictions, _single_column_clustering_key_restrictions, and _single_column_partition_key_restrictions. Also replace contains_multi_column_restriction() tree walk in clustering_columns_restrictions_have_supporting_index() with _has_multi_column.	2026-04-19 20:57:08 +03:00
Avi Kivity	fab90224b3	cql3: statement_restrictions: build clustering-prefix restrictions incrementally Replace the extract_clustering_prefix_restrictions() tree walk with incremental collection during the main loop. Two new locals -- mc_ck_preds and sc_ck_preds -- accumulate multi-column and single-column clustering key predicates respectively. A short post-loop block computes the longest contiguous prefix from sc_ck_preds (or uses mc_ck_preds directly for multi-column), replacing the removed function. Also remove the now-unused to_predicate_on_clustering_key_prefix(), with_current_binary_operator() helper, and the visitor_with_binary_operator_context concept.	2026-04-19 20:57:08 +03:00
Avi Kivity	3bd308986a	cql3: statement_restrictions: build partition-range restrictions incrementally Replace the extract_partition_range() tree walk with incremental collection during the main loop. Two new locals before the loop -- token_pred and pk_range_preds -- accumulate token and single-column EQ/IN partition key predicates respectively. A short post-loop block materializes _partition_range_restrictions from these locals, replacing the removed function. This removes the last tree walk over partition-key restrictions.	2026-04-19 20:57:08 +03:00
Avi Kivity	db28411548	cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally Instead of accumulating all clustering-key restrictions into a conjunction tree and then decomposing it by column via get_single_column_restrictions_map() post-loop, build the per-column map incrementally as each single-column clustering-key predicate is processed. The post-loop guard (!has_mc_clustering) is no longer needed: multi-column predicates go through the is_multi_column branch and never insert into this map, and mixing multi with single-column is rejected with an exception. This eliminates a post-loop tree walk over _clustering_columns_restrictions.	2026-04-19 20:57:08 +03:00
Avi Kivity	a4608804d8	cql3: statement_restrictions: build partition-key single-column restrictions map incrementally Instead of accumulating all partition-key restrictions into a conjunction tree and then decomposing it by column via get_single_column_restrictions_map() post-loop, build the per-column map incrementally as each single-column partition-key predicate is processed. The post-loop guard (!has_token_restrictions()) is no longer needed: token predicates go through the on_partition_key_token branch and never insert into this map, and mixing token with non-token is rejected with an exception. This eliminates a post-loop tree walk over _partition_key_restrictions.	2026-04-19 20:57:08 +03:00
Avi Kivity	e9b16a11ba	cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally Instead of accumulating all non-primary-key restrictions into a conjunction tree and then decomposing it by column via get_single_column_restrictions_map() post-loop, build the per-column map incrementally as each non-primary-key predicate is processed. This eliminates a post-loop tree walk over _nonprimary_key_restrictions.	2026-04-19 20:57:08 +03:00
Avi Kivity	701366a8d1	cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column Replace the two post-loop find_binop(_clustering_columns_restrictions, is_multi_column) tree walks and the contains_multi_column_restriction() tree walk with the already-tracked local has_mc_clustering. The redundant second assignment inside the _check_indexes block is removed entirely.	2026-04-19 20:57:08 +03:00
Avi Kivity	da438507d0	cql3: statement_restrictions: track has-token state incrementally Replace the two in-loop calls to has_token_restrictions() (which walks the _partition_key_restrictions expression tree looking for token function calls) with a local bool has_token, set to true when a token predicate is processed. The member function is retained since it's used outside the constructor. With this change, the constructor loop's non-error control flow performs zero expression tree scanning. The only remaining tree walks are on error paths (get_sorted_column_defs, get_columns_in_commons for formatting exception messages) and structural (make_conjunction for building accumulated expressions).	2026-04-19 20:57:07 +03:00
Avi Kivity	1344278a19	cql3: statement_restrictions: track partition-key-empty state incrementally Replace the in-loop call to partition_key_restrictions_is_empty() (which walks the _partition_key_restrictions expression tree via is_empty_restriction()) with a local bool pk_is_empty, set to false at the two sites where partition key restrictions are added. The member function is retained since it's used outside the constructor.	2026-04-19 20:57:07 +03:00

1 2 3 4 5 ...

53495 Commits