Fixes: SCYLLADB-1523
The returned file object does not increment file pos as is. One line fix.
Added test to make sure this read path works as expected.
Closesscylladb/scylladb#29456
Wait for cql on all hosts after restarting a server in the test.
The problem that was observed is that the test restarts servers[1] and
doesn't wait for the cql to be ready on it. On test teardown it drops
the keyspace, trying to execute it on the host that is not ready, and
fails.
Fixes SCYLLADB-1632
Closesscylladb/scylladb#29562
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.
Set audit_info on the batch_statement so these batches are audited.
Fixes SCYLLADB-1652
No backport - bug introduced recently.
Closesscylladb/scylladb#29570
* github.com:scylladb/scylladb:
test/audit: add reproducer for native-protocol batch not being audited
audit: set audit_info for native-protocol BATCH messages
test/audit: rename internal test methods to avoid CI misdetection
should be merged after #29235
Complete the typed skip markers migration started in the plugin PR.
Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call
across the test suite is replaced with a typed equivalent, making skip
reasons machine-readable in JUnit XML and Allure reports.
**62 files changed** across 8 commits, covering ~127 skip sites in total.
Bare `pytest.skip` provides only a free-text reason string. CI dashboards
(JUnit, Allure) cannot distinguish between a test skipped due to a known
bug, a missing feature, a slow test, or an environment limitation. This
makes it hard to track skip debt, prioritize fixes, or filter dashboards
by skip category.
The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`,
`skip_env`) introduced by the `skip_reason_plugin` solve this by embedding
a `skip_type` field into every skip report entry.
| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_bug` | 24 | 16 | Skip reason references a known bug/issue |
| `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla |
| `skip_slow` | 4 | 3 | Test too slow for regular CI runs |
| `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) |
| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime |
| `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) |
- **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()`
- **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning
- **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression:
- AST scan for bare `@pytest.mark.skip` decorators
- AST scan for bare `pytest.skip()` runtime calls
- Real `pytest --collect-only` against all Python test directories
Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`:
```python
from test.pylib.skip_types import skip_env
```
Usage:
```python
skip_env("Tablets not enabled")
```
1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files
2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files
3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files
4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file
5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files
6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files
7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files
8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests
- All 60 plugin + guard tests pass (`test/pylib_test/`)
- No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase
- `pytest --collect-only` succeeds across all test directories with the hardened plugin
SCYLLADB-1349
Closesscylladb/scylladb#29305
* github.com:scylladb/scylladb:
test/alternator: replace bare pytest.skip() with typed skip helpers
test: migrate new bare skips introduced by upstream after rebase
test/pylib: reject bare pytest.mark.skip and add codebase guards
test: update comments referencing pytest.skip() to skip_env()
test: migrate runtime pytest.skip() to typed skip_bug()
test: migrate runtime pytest.skip() to typed skip_env()
test: migrate bare @pytest.mark.skip to skip_not_implemented
test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
Add pylib_test to norecursedirs in pytest.ini so it is not collected
during ./test.py or pytest test/ runs, but can still be run directly
via 'pytest test/pylib_test'.
Also fix pytest log cleanup: worker log files (pytest_gw*) were not
being deleted on success because cleanup was restricted to the main
process only. Now each process (main and workers) cleans up its own
log file on success.
Closesscylladb/scylladb#29551
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231.
Closesscylladb/scylladb#29310
* github.com:scylladb/scylladb:
compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
test/repair: Add tombstone GC safety tests for incremental repair
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.
The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".
Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.
Bump the TTL to 60s in all three affected tests:
- test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
- test_read_all (same pattern, same invariants — suspect)
- test_read_all_multi_range (same pattern, same invariants — suspect)
Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.
Fixes: SCYLLADB-1642
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closesscylladb/scylladb#29564
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.
In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.
In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
- next_replication is cleared;
- new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.
If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.
Fixes: SCYLLADB-567.
No backport needed; new feature.
Closesscylladb/scylladb#24421
* github.com:scylladb/scylladb:
service: fix indentation
docs: update documentation
test: test multi RF changes
service: tasks: allow aborting ongoing RF changes
cql3: allow changing RF by more than one when adding or removing a DC
service: handle multi_rf_change
service: implement make_rf_change_plan
service: add keyspace_rf_change_plan to migration_plan
service: extend tablet_migration_info to handle rebuilds
service: split update_node_load_on_migration
service: rearrange keyspace_rf_change handler
db: add columns to system_schema.keyspaces
db: service: add ongoing_rf_changes to system.topology
gms: add keyspace_multi_rf_change feature
The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a
QUERY message, which goes through the CQL parser and raw::batch_statement::
prepare() — a path that correctly sets audit_info. This missed the bug
where native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a batch_statement
without setting audit_info, causing audit to silently skip the batch.
Add _test_batch_native_protocol which uses the driver's BatchStatement
(both unprepared and prepared variants) to exercise this code path.
Refs SCYLLADB-1652
The CI heuristic picks up any function named test_* in changed files
and tries to run it as a standalone pytest test. The AuditTester class
methods (test_batch, test_dml, etc.) are not top-level pytest tests —
they are internal helpers called from the actual test functions.
Prefix them with underscore so CI does not mistake them for
standalone tests.
A handful of cassandra-driver Cluster.shutdown() call sites in the
auth_cluster tests were missed by the previous sweep that introduced
safe_driver_shutdown(), because the local variable holding the Cluster
is named "c" rather than "cluster".
Direct Cluster.shutdown() is racy: the driver's "Task Scheduler"
thread may raise RuntimeError ("cannot schedule new futures after
shutdown") during or after the call, occasionally failing tests.
safe_driver_shutdown() suppresses this expected RuntimeError and
joins the scheduler thread.
Replace the remaining c.shutdown() calls in:
- test/cluster/auth_cluster/test_startup_response.py
- test/cluster/auth_cluster/test_maintenance_socket.py
with safe_driver_shutdown(c) and add the corresponding import from
test.pylib.driver_utils.
No behavioral change to the tests; only the driver teardown is
hardened against a known driver-side race.
Fixes SCYLLADB-1662
Closesscylladb/scylladb#29576
When forcing tablet count change via cql command, the underlying
tablet machinery takes some time to adjust. Original code waited
at most 0.1s for tablet data to be synchronized. This seems to be
not enough on debug builds, so we add exponential backoff and increase
maximum waiting time. Now the code will wait 0.1s first time and
continue waiting with each time doubling the time, up to maximum of 6 times -
or total time ~6s.
Fixes: SCYLLADB-1655
Closesscylladb/scylladb#29573
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in `groups_manager::acquire_server()` because the
raft group has already been erased from `_raft_groups`.
A concurrent `DROP TABLE` may have already removed the table from database
registries and erased the raft group via `schedule_raft_group_deletion`.
The `schema.table()` in `create_operation_ctx()` might not fail though
because someone might be holding `lw_shared_ptr<table>`, so that the
table is dropped but the table object is still alive.
Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via `find_column_family` before looking up
the raft group. If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error. When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.
backport: not needed (not released feature)
Fixes SCYLLADB-1450
Closesscylladb/scylladb#29430
* github.com:scylladb/scylladb:
strong_consistency: fix crash when DROP TABLE races with in-flight DML
test: add regression test for DROP TABLE racing with in-flight DML
assert_entries_were_added asserted that new audit rows always appear at
the tail of each per-node, event_time-sorted sequence. That invariant
is not a property of the audit feature: audit writes are asynchronous
with respect to query completion, and on a multi-node cluster QUORUM
reads of audit.audit_log can reveal a row with an older event_time
after a row with a newer one has already been observed.
Replace the positional tail slice with a per-node set difference
between the rows observed before and after the audited operation.
The wait_for retry loop, noise filtering, and final by-value
comparison against expected_entries are unchanged, so the test still
verifies the real contract, that the expected audit entries appear,
without relying on a visibility-ordering invariant that the audit log
does not guarantee.
Fixes SCYLLADB-1589
Closesscylladb/scylladb#29567
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.
Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.
The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.
The refactoring is composed of these parts (but patches from different parts
are interspersed):
1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API
Major refactoring, and no bugs fixed, so definitely not backporting.
Closesscylladb/scylladb#29114
* github.com:scylladb/scylladb:
cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
cql3: statement_restrictions: use predicate vector size for clustering prefix length
cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
cql3: statement_restrictions: add predicate-based index support checking
cql3: statement_restrictions: use pre-built single-column maps for index support checks
cql3: statement_restrictions: build clustering-prefix restrictions incrementally
cql3: statement_restrictions: build partition-range restrictions incrementally
cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
cql3: statement_restrictions: track has-token state incrementally
cql3: statement_restrictions: track partition-key-empty state incrementally
cql3: statement_restrictions: track first multi-column predicate incrementally
cql3: statement_restrictions: track last clustering column incrementally
cql3: statement_restrictions: track clustering-has-slice incrementally
cql3: statement_restrictions: track has-multi-column-clustering incrementally
cql3: statement_restrictions: track clustering-empty state incrementally
cql3: statement_restrictions: replace restr bridge variable with pred.filter
cql3: statement_restrictions: convert single-column branch to use predicate properties
cql3: statement_restrictions: convert multi-column branch to use predicate properties
cql3: statement_restrictions: convert constructor loop to iterate over predicates
cql3: statement_restrictions: annotate predicates with operator properties
cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
cql3: statement_restrictions: complete preparation early
cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
cql3: statement_restrictions: refine possible_lhs_values() function_call processing
cql3: statement_restrictions: return nullptr for function solver if not token
cql3: statement_restrictions: refine possible_lhs_values() subscript solving
cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
cql3: statement_restrictions: convert possible_lhs_values into a solver
cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
cql3: statement_restrictions: refactor IS NOT NULL processing
cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
cql3: statement_restrictions: fold add_is_not_restriction() into its caller
cql3: statement_restrictions: fold add_restriction() into its caller
cql3: statement_restrictions: remove possible_partition_token_values()
cql3: statement_restrictions: remove possible_column_values
cql3: statement_restrictions: pass schema to possible_column_values()
cql3: statement_restrictions: remove fallback path in solve()
cql3: statement_restrictions: reorder possible_lhs_column parameters
cql3: statement_restrictions: prepare solver for multi-column restrictions
cql3: statement_restrictions: add solver for token restriction on index
cql3: statement_restrictions: pre-analyze column in value_for()
cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
cql3: statement_restrictions: adjust signature of range_from_raw_bounds
cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
cql3: statement_restrictions: wrap value_for_index_partition_key()
cql3: statement_restrictions: hide value_for()
cql3: statement_restrictions: push down clustering prefix wrapper one level
cql3: statement_restrictions: wrap functions that return clustering ranges
cql3: statement_restrictions: do not pass view schema back and forth
cql3: statement_restrictions: pre-analyze token range restrictions
cql3: statement_restrictions: pre-analyze partition key columns
cql3: statement_restrictions: do not collect subscripted partition key columns
cql3: statement_restrictions: split _partition_range_restrictions into three cases
cql3: statement_restrictions: move value_list, value_set to header file
cql3: statement_restrictions: wrap get_partition_key_ranges
cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
test: statement_restrictions: add index_selection regression test
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests.
This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label).
Fixes: SCYLLADB-1340
New feature, no backport.
Closesscylladb/scylladb#29493
* github.com:scylladb/scylladb:
transport: add per-service-level cql_requests_serving metric
transport: move requests_serving decrement to after response is sent
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643.
Closesscylladb/scylladb#29563
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
repaired-only optimization is active; used by get_max_purgeable_timestamp() in
compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
writing T_base so D_base is staged on servers[0] via row-sync; blocks the
view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
(fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
in repaired set); asserts no resurrection; releases injection; waits for staging to
complete; asserts no resurrection after a second flush+compaction. Demonstrates that
the read-before-write in stream_view_replica_updates() makes the optimization safe even
when staging fires after T_mv has been GC'd.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately
after adding the new cross-rack nodes. In the failing runs the driver is
still converging on the updated topology at that point, so the control
connection sees incomplete peer metadata while schema changes are in
flight.
That leaves a race where CREATE TABLE is sent during topology churn and
the test can surface a misleading AlreadyExists error even though the
table creation has already been committed. Use get_ready_cql(servers)
here so the test waits for inter-node visibility and CQL readiness
before creating the keyspace and table.
Fixes: SCYLLADB-1635
Closesscylladb/scylladb#29561
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.
This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.
The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.
All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`
Fixes https://github.com/scylladb/scylladb/issues/25340Closesscylladb/scylladb#25342
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.
The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.
This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901
The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.
Closesscylladb/scylladb#28873
* github.com:scylladb/scylladb:
streaming: reject mutation fragments on critical disk utilization
test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
sstables: clean up TemporaryHashes file in wipe()
sstables: add error injection point in write_components
test/cluster/storage: extract validate_data_existence to module scope
test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
utils/disk_space_monitor: add error injection to suppress threshold checks
This small series includes a few followups to the patch that changed Alternator Stream ARNs from using our own UUID format to something that resembles Amazon's Stream ARNs (and the KCL library won't reject as bogus-looking ARNs).
The first patch is the most important one, fixing ListStreams's LastEvaluatedStreamArn to also use the new ARN format. It fixes SCYLLADB-539.
The following patches are additional cleanups and tests for the new ARN code.
Closesscylladb/scylladb#29474
* github.com:scylladb/scylladb:
alternator: fix ListStreams paging if table is deleted during paging
test/alternator: test DescribeStream on non-existent table
alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
alternator: remove dead code stream_shard_id
alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
Add three cluster tests that verify no data resurrection occurs when
tombstone GC runs on the repaired sstable set under incremental repair
with tombstone_gc=repair mode.
All tests use propagation_delay_in_seconds=0 so that tombstones become
GC-eligible immediately after repair_time is committed (gc_before =
repair_time), allowing the scenarios to exercise the actual GC eligibility
path without artificial sleeps.
(test_tombstone_gc_no_resurrection_basic_ordering)
Data D (ts=1) and tombstone T (ts=2) are written to all replicas and
flushed before repair. Repair captures both in the repairing snapshot
and promotes them to repaired. Once repair_time is committed, T is
GC-eligible (T.deletion_time < gc_before = repair_time).
The test verifies that compaction on the repaired set does NOT purge T,
because D is already in repaired (mark_sstable_as_repaired() completes
on all replicas before repair_time is committed to Raft) and clamps
max_purgeable to D.timestamp=1 < T.timestamp=2.
(test_tombstone_gc_no_resurrection_hints_flush_failure)
The repair_flush_hints_batchlog_handler_bm_uninitialized injection causes
hints flush to fail on one node. When hints flush fails, flush_time stays
at gc_clock::time_point{} (epoch). This propagates as repair_time=epoch
committed to system.tablets, so gc_before = epoch - propagation_delay is
effectively the minimum possible time. No tombstone has a deletion_time
older than epoch, so T is never GC-eligible from this repair.
The test verifies that repair_time does not advance to a meaningful value
after a failed hints flush, and that compaction on the repaired set does
not purge T (key remains deleted, no resurrection).
(test_tombstone_gc_no_resurrection_propagation_delay)
Simulates a write D carrying an old CQL USING TIMESTAMP (ts_d = now-2h)
that was stored as a hint while a replica was down, and a tombstone T
with a higher timestamp (ts_t = now-90min, ts_t > ts_d) that was written
to all live replicas. After the replica restarts, repair flushes hints
synchronously before taking the repairing snapshot, guaranteeing D is
delivered and captured in repairing before the snapshot.
After mark_sstable_as_repaired() promotes D to repaired, the coordinator
commits repair_time. gc_before = repair_time > T.deletion_time so T is
GC-eligible. The test verifies that compaction on the repaired set does
NOT purge T: D (ts_d < ts_t) is already in repaired, clamping
max_purgeable = ts_d < ts_t = T.timestamp, so T is not purgeable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The view update builder ignored range tombstone changes from the update
stream when there all existing mutation fragments were already consumed.
The old code assumed range tombstones 'remove nothing pre-existing, so
we can ignore it', but this failed to update _update_current_tombstone.
Consequently, when a range delete and an insert within that range appeared
in the same batch, the range tombstone was not applied to the inserted row,
or was applied to a row outside the range that it covered causing it to
incorrectly survive/be deleted in the materialized view.
Fix by handling is_range_tombstone_change() fragments in the update-only
branch, updating _update_current_tombstone so subsequent clustering rows
correctly have the range tombstone applied to them.
Fixes SCYLLADB-1555
Closesscylladb/scylladb#29483
When view_update_builder::on_results() hits the path where the update
fragment reader is already exhausted, it still needs to keep tracking
existing range tombstones and apply them to encountered rows.
Otherwise a row covered by an existing range tombstone can appear
alive while generating the view update and create a spurious view row.
Update the existing tombstone state even on the exhausted-reader path
and apply the effective tombstone to clustering rows before generating
the row tombstone update. Add a cqlpy regression test covering the
partition-delete-after-range-tombstone case.
Fixes: SCYLLADB-1554
Closesscylladb/scylladb#29481
The new_test_keyspace context manager in test/cluster/util.py uses
DROP KEYSPACE without IF EXISTS during cleanup. The Python driver
has a known bug (scylladb/python-driver#317) where connection pool
renewal after concurrent node bootstraps causes double statement
execution. The DROP succeeds server-side, but the response is lost
when the old pool is closed. The driver retries on the new pool, and
gets ConfigurationException message "Cannot drop non existing keyspace".
The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT
EXISTS as a workaround for the same driver bug. This patch applies
the same approach to fix DROP KEYSPACE.
Fixes SCYLLADB-1538
Closesscylladb/scylladb#29487
The test was using max_size_mb = 8*1024 (8 GB) with 100 iterations,
causing it to create up to 260 files of 32 MB each per iteration via
fallocate. On a loaded CI machine this totals hundreds of GB of file
operations, easily exceeding the 15-minute test timeout (SCYLLADB-1496).
The test only needs enough files to verify that delete_segments keeps
the disk footprint within [shard_size, shard_size + seg_size]. Reduce
max_size_mb to 128 (8 files of 32 MB per iteration) and the iteration
count to 10, which is sufficient to exercise the serialized-deletion
and recycle logic without imposing excessive I/O load.
Closesscylladb/scylladb#29510
sstables_loader::load_and_stream holds a replica::table& reference via
the sstable_streamer for the entire streaming operation. If the table
is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the
reference becomes dangling and the next access crashes with SEGV.
This was observed in a longevity-50gb-12h-master test run where a
keyspace was dropped while load_and_stream was still streaming SSTables
from a previous batch.
Fix by acquiring a stream_in_progress() phaser guard in load_and_stream
before creating the streamer. table::stop() calls
_pending_streams_phaser.close() which blocks until all outstanding
guards are released, keeping the table alive for the duration of the
streaming operation.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1352Closesscylladb/scylladb#29403
Pass jvm_args=["--smp", "1"] on both cluster.start() calls to
ensure consistent shard count across restarts, avoiding resharding
on restart. Also pass wait_for_binary_proto=True to cluster.start()
to ensure the CQL port is ready before connecting.
Fixes: SCYLLADB-824
Closesscylladb/scylladb#29548
Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster.
The correct way is using safe_driver_shutdown.
Fixes SCYLLADB-1434
Closesscylladb/scylladb#29390
The previous commit made prepare_indexed_local() use the pre-built
predicate vectors instead of calling extract_single_column_restrictions_for_column().
That was the last production caller.
Remove the function definition (65 lines of expression-walking visitor)
and its declaration/doc-comment from the header.
Replace the unit test (expression_extract_column_restrictions) which
directly called the removed function with synthetic column_definitions,
with per_column_restriction_routing which exercises the same routing
logic through the public analyze_statement_restrictions() API. The new
test verifies not just factor counts but the exact (column_name, oper_t)
pairs in each per-column entry, catching misrouted restrictions that a
count-only check would miss.
Expressions are a tree-like structure so a single expression is sufficient
(for complicated ones, a conjunction is used), but predicates are flat.
Prepare for conversion to predicates by storing the expressions that
will correspond to predicates, namely the boolean factors of the WHERE
clause.
For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.
Avoid the back-and-forth and use the stored value. Using a different
value would be broken.
This change allows unifying the signatures of the four functions that
get clustering ranges.
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.
To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
In preparation for refactoring statement_restrictions, add a simple
and an exhaustive regression test, encoding the index selection
algorithm into the test. We cannot change the index selection algorithm
because then mixed-node clusters will alter the sorting key mid-query
(if paging takes place).
Because the exhaustive space has such a large stack frame, and
because Address Santizer bloats the stack frame, increase it
for debug builds.
Migrate 3 bare skip sites that appeared in upstream/master after the
initial migration:
- test/cluster/test_strong_consistency.py: 2 @pytest.mark.skip →
@pytest.mark.skip_bug (SCYLLADB-1056)
- test/cqlpy/conftest.py: pytest.skip() → skip_env() in
skip_on_scylla_vnodes fixture
Harden the skip_reason_plugin to reject bare @pytest.mark.skip at
collection time with pytest.UsageError instead of warnings.warn().
Add test/pylib_test/test_no_bare_skips.py with three guard tests:
- AST scan for bare pytest.skip() runtime calls
- Real pytest --collect-only against all Python test directories
Update 7 comments/docstrings across 5 files that still referenced
pytest.skip() to reference the typed skip_env() wrapper for
consistency with the migrated code.
Migrate 2 runtime pytest.skip() calls referencing known bugs to use
the typed skip_bug() wrapper from test.pylib.skip_types:
- test/alternator/test_ttl.py: Streams on tablets (#23838)
- test/scylla_gdb/test_task_commands.py: coroutine task not found (#22501)
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.
These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.
Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.
Each file gains one new import:
from test.pylib.skip_types import skip_env
Migrate 2 bare @pytest.mark.skip decorators (no reason string) to
@pytest.mark.skip_not_implemented with an explicit reason referencing
issue #3882 (COMPACT STORAGE not implemented).
Migrate 10 @pytest.mark.skip decorator sites to
@pytest.mark.skip_not_implemented across 5 test files where the
skip reason indicates a feature not yet implemented.
In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.
This patch introduces three Alternator config options:
- alternator_http_response_server_header,
- alternator_http_response_disable_date_header,
- alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.
The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.
Tests are in test/alternator/test_http_headers.py.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70Closesscylladb/scylladb#28288
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.
For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.
For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.
The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.
`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.
Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).
Refs SCYLLADB-461
Closesscylladb/scylladb#29224
* github.com:scylladb/scylladb:
topology: Wake coordinator promptly for stream enablement lifecycle
test/cluster: Test deferred stream enablement on tablet tables
alternator/streams: Block tablet merges when Alternator Streams are enabled
Currently, ListStreams paging works by looking in the list of tables
for ExclusiveStartStreamArn and starting there. But it's possible
that during the paging process, one of the tables got deleted and
ExclusiveStartStreamArn no longer points to an existing table. In
the current implementation this caused the paging to stop (think
it reached the end).
The solution is simple: ListStreams will now sort the list of tables
by name (it anyway needs to be sorted by something to be consistent
across pages), and will look with std::upper_bound for the first
table *after* the ExclusiveStartStreamArn - we don't need to find
that table name itself.
The patch also includes a test reproducing this bug. As usual, the
test passes on DynamoDB, fails on Alternator before this patch,
and passes with the patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already had a test for DescribeStream being called on a bogus ARN
returns a ValidationException. But if the stream is more legitimate-
looking but refers to a non-existent table (e.g., an ARN taken in the
past from a table that no longer exists), we should return
ResourceNotFoundException. In this patch we add a test that verifies
we indeed do this correctly.
Moreover, Alternator's current stream ARNs include both a keyspace
name and a table name, and either one being incorrect should lead
to ResourceNotFoundException, and indeed the new test validates
that it works as expected - there is no bug here (AI guessed we
have a bug in the missing *keyspace* case, but this guess was wrong).