Commit Graph

53948 Commits

Author SHA1 Message Date
Michał Jadwiszczak
b64f2d2e90 view_building: introduce task_uuid_generator
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.

This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
e5a6ed72b9 view_building: store min_alive_uuid in view building state
Because now we're limiting the range we're reading from view building
tasks table, we need to make sure that new tasks are created with larger
uuid then the `min_alive_uuid`.

In order to do it, we need to be able to see current `min_alive_uuid`
while creating new tasks.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8d0943ce35 view_building: set min_task_id when GC-ing finished tasks
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id
alongside the range tombstone in the same Raft batch. min_task_id is set
to min_alive_uuid so subsequent get_view_building_tasks() scans start
exactly at the first alive row, skipping all tombstoned rows.

When all tasks are deleted, min_task_id is set to a freshly generated UUID
to ensure future tasks (which will have larger timeuuids) are not skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
b689de0414 view_building: add min_task_id support to view_building_task_mutation_builder
Add set_min_task_id(id) which writes the min_task_id static cell to the main
"view_building" partition. The static cell is written as part of the same
mutation as the range tombstone, keeping everything in one Raft batch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8670111cd4 view_building: add min_task_id static column and bounded scan to system_keyspace
Add a min_task_id timeuuid static column to system.view_building_tasks.

When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks()
reads min_task_id first using a static-only partition slice (empty _row_ranges +
always_return_static_content). This makes the SSTable reader stop immediately
after the static row before processing any clustering tombstones, so the read
never triggers tombstone_warn_threshold warnings.

min_task_id is then used as AND id >= ? lower bound for the main task scan,
skipping all tombstoned rows below the boundary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8f741b462b view_building: use range tombstone when GC-ing finished tasks
Instead of issuing one row tombstone per finished task, collect all tasks
to delete, find the smallest timeuuid among alive tasks (min_alive_uuid),
then emit a single range tombstone [before_all, min_alive_uuid) covering
all tasks below that boundary. Tasks above the boundary (rare: finished
task interleaved with alive tasks) still get individual row tombstones.

When no alive tasks remain, del_all_tasks() covers the entire partition
with a single range tombstone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
91697d597c view_building: add range tombstone support to view_building_task_mutation_builder
Add del_tasks_before(id) which emits a range tombstone [before_all, id)
and del_all_tasks() which covers the entire clustering range. These will
be used by the coordinator to delete finished tasks in bulk instead of
issuing one row tombstone per task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
e0942bb45a view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
This feature will be used to gate the use of min_task_id static column
in system.view_building_tasks, which will be added in a subsequent commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:12 +02:00
Michał Jadwiszczak
f77c258c8e strong_consistency: wire up metrics to operations
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().

Count commit_status_unknown errors in coordinator::mutate().

Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
2026-04-22 08:59:59 +02:00
Michał Jadwiszczak
55293c34f8 strong_consistency: add stats struct and metrics registration
Introduce per-shard metrics infrastructure for strong consistency
operations under the "strong_consistency_coordinator" metrics category.

The stats struct contains latency histograms/summaries for reads and
writes (using timed_rate_moving_average_summary_and_histogram, same as
storage_proxy uses for eventual consistency), and uint64_t counters for
write_status_unknown, node bounces, and shard bounces.

Metrics are registered in the coordinator constructor but are not yet
wired to actual operations — all counters remain at zero.
2026-04-22 08:58:38 +02:00
Avi Kivity
f5eb99f149 test: bump multishard_query_test querier_cache TTL to 60s to avoid flake
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.

The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".

Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
  population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.

Bump the TTL to 60s in all three affected tests:

  - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
  - test_read_all                       (same pattern, same invariants — suspect)
  - test_read_all_multi_range           (same pattern, same invariants — suspect)

Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.

Fixes: SCYLLADB-1642

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes scylladb/scylladb#29564
2026-04-22 09:48:59 +03:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Wojciech Mitros
667a928e81 mv: deduplicate code for consuming fragments in view_update_builder
Deduplicate the fragment-consuming logic in
view_update_builder::generate_updates() by extracting it into three
private methods: consume_both_fragments(), consume_update_fragment(),
and consume_existing_fragment().

The three inlined blocks for cmp < 0, cmp > 0, and cmp == 0 were
identical to the trailing "update only" and "existing only" blocks.
The only semantic change is in the trailing "existing only" path: the
outer tombstone guard is replaced by per-branch tombstone checks inside
consume_existing_fragment(), which is both sufficient and more precise
for the static_row case (uses partition tombstone only, not range
tombstone which is irrelevant for static rows).
2026-04-22 00:26:52 +02:00
Wojciech Mitros
00be36e08f mv: avoid unnecessary copies of existing rows in generate_updates()
In the existing-only tail block of generate_updates(), the clustering
row and static row were extracted from the fragment using a deep copy
constructor (e.g. clustering_row(*_schema, fragment.as_clustering_row()))
even though the fragment is not used afterwards. Replace with moves,
matching the pattern used in all other cases.
2026-04-22 00:26:52 +02:00
Wojciech Mitros
74902dceac mv: simplify clustering row handling in generate_updates()
Two of the three clustering-row cases in generate_updates() used
mutate_as_clustering_row() to apply a tombstone to the row in-place,
then immediately moved the row out of the fragment. This triggered an
unnecessary memory usage recalculation in the reader permit, since:

1. apply(tombstone) does not change external memory usage (tombstone
   is stored inline, not heap-allocated), so the recalculation will
   yield the same result.
2. The fragment is consumed on the very next line, so the tracking
   window is effectively zero.

Simplify these two cases to match the first case (cmp < 0), which
already uses the simpler pattern of moving the row out of the fragment
first, then applying the tombstone on the extracted row.
2026-04-22 00:26:52 +02:00
Wojciech Mitros
7727a37085 mv: rename methods in view_update_builder for clarity
Rename advance_all(), advance_updates() and advance_existings() to
read_both_next_fragments(), read_next_update_fragment() and
read_next_existing_fragment(), respectively. The new names make it
clear that these methods read the next mutation fragment from the
corresponding reader into the cached fragment member.

Also rename on_results() to generate_updates(), which better describes
its role of generating view updates from the previously read fragments.
2026-04-22 00:26:52 +02:00
Wojciech Mitros
490b3f5c6f mv: rename view_update_builder readers and cached fragments
Rename the members of view_update_builder to reflect their roles
more precisely:

  _updates   -> _update_reader
  _existings -> _existing_reader
  _update    -> _update_fragment
  _existing  -> _existing_fragment

This makes the code easier to follow by distinguishing the readers
(which produce a stream of fragments) from the cached fragments
(the most recently read mutation_fragment_v2 from each reader).
2026-04-22 00:26:52 +02:00
Wojciech Mitros
6edacdea74 mv: drop redundant std::move from partition key extraction
The expression

    std::move(std::move(_update)->as_partition_start().key().key())

contains two ineffective std::move calls:

1. The inner std::move(_update) has no effect because
   there is no overload for optimized_optional::operator->()
   which takes "this" by rvalue reference.

2. The outer std::move is applied to a const partition_key&
   (decorated_key::key() returns const&), producing a
   const partition_key&& that still binds to the copy constructor,
   not the move constructor.

Drop both std::move calls to avoid misleading the reader.
2026-04-22 00:24:12 +02:00
Wojciech Mitros
a796a58a1f mv: document single-partition builder scope
Add comments to view_update_builder and make_view_update_builder()
documenting that one builder instance processes at most one base
partition, and that the readers provided should span the same single
partition.
2026-04-22 00:19:07 +02:00
Taras Veretilnyk
7cdf215999 sstables: make sstable::unlink() idempotent
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
2026-04-21 22:41:02 +02:00
Andrzej Jackowski
b6cb025e9b test/audit: add reproducer for native-protocol batch not being audited
The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a
QUERY message, which goes through the CQL parser and raw::batch_statement::
prepare() — a path that correctly sets audit_info. This missed the bug
where native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a batch_statement
without setting audit_info, causing audit to silently skip the batch.

Add _test_batch_native_protocol which uses the driver's BatchStatement
(both unprepared and prepared variants) to exercise this code path.

Refs SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
f5bb9b6282 audit: set audit_info for native-protocol BATCH messages
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
5f93d57d6e test/audit: rename internal test methods to avoid CI misdetection
The CI heuristic picks up any function named test_* in changed files
and tries to run it as a standalone pytest test. The AuditTester class
methods (test_batch, test_dml, etc.) are not top-level pytest tests —
they are internal helpers called from the actual test functions.

Prefix them with underscore so CI does not mistake them for
standalone tests.
2026-04-21 21:52:26 +02:00
Ernest Zaslavsky
9faaf1f09c test: extract object storage helpers to test/pylib/object_storage.py
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage).  All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e9724f52a9 test: add per-test bucket isolation to object_store fixtures
Create a unique S3/GCS bucket for each test function using the pytest
test name (from request.node.name), sanitized into a valid bucket name.
This ensures tests do not share state through a common bucket and makes
bucket names meaningful for debugging (e.g. test-basic-s3-a1b2c3d4).
Each fixture now calls create_test_bucket() on setup and
destroy_test_bucket() on teardown.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
8e02e99c36 s3: add client::make overload with custom retry strategy
Add a client::make overload that accepts a custom retry strategy,
allowing callers to override the default exponential backoff.
Use this in s3_test.cc with a test_retry_strategy that sleeps only
1ms between retries instead of exponential backoff, significantly
reducing test runtime for tests that encounter transient errors
during bucket creation/deletion.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e175088db5 test: add s3_test_fixture and migrate tests to per-bucket isolation
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
cc0b9791c7 s3: add create_bucket and delete_bucket to client
Add create_bucket (PUT /<bucket>) and delete_bucket (DELETE /<bucket>)
methods to s3::client, following the same make_request pattern used by
existing object operations.
These will be used by the test infrastructure to create per-test
isolated buckets.
2026-04-21 19:08:57 +03:00
Dario Mirovic
cf237e060a test: auth_cluster: use safe_driver_shutdown() for Cluster teardown
A handful of cassandra-driver Cluster.shutdown() call sites in the
auth_cluster tests were missed by the previous sweep that introduced
safe_driver_shutdown(), because the local variable holding the Cluster
is named "c" rather than "cluster".

Direct Cluster.shutdown() is racy: the driver's "Task Scheduler"
thread may raise RuntimeError ("cannot schedule new futures after
shutdown") during or after the call, occasionally failing tests.
safe_driver_shutdown() suppresses this expected RuntimeError and
joins the scheduler thread.

Replace the remaining c.shutdown() calls in:
  - test/cluster/auth_cluster/test_startup_response.py
  - test/cluster/auth_cluster/test_maintenance_socket.py
with safe_driver_shutdown(c) and add the corresponding import from
test.pylib.driver_utils.

No behavioral change to the tests; only the driver teardown is
hardened against a known driver-side race.

Fixes SCYLLADB-1662

Closes scylladb/scylladb#29576
2026-04-21 17:45:11 +02:00
Radosław Cybulski
6f7bf30a14 alternator: increase wait time to tablet sync
When forcing tablet count change via cql command, the underlying
tablet machinery takes some time to adjust. Original code waited
at most 0.1s for tablet data to be synchronized. This seems to be
not enough on debug builds, so we add exponential backoff and increase
maximum waiting time. Now the code will wait 0.1s first time and
continue waiting with each time doubling the time, up to maximum of 6 times -
or total time ~6s.

Fixes: SCYLLADB-1655

Closes scylladb/scylladb#29573
2026-04-21 17:38:07 +02:00
Radosław Cybulski
74b523ea20 treewide: fix spelling errors.
Fix various spelling errors.

Closes scylladb/scylladb#29574
2026-04-21 18:20:26 +03:00
Piotr Dulikowski
cb8253067d Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in `groups_manager::acquire_server()` because the
raft group has already been erased from `_raft_groups`.

A concurrent `DROP TABLE` may have already removed the table from database
registries and erased the raft group via `schedule_raft_group_deletion`.
The `schema.table()` in `create_operation_ctx()` might not fail though
because someone might be holding `lw_shared_ptr<table>`, so that the
table is dropped but the table object is still alive.

Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via `find_column_family` before looking up
the raft group.  If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error.  When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.

backport: not needed (not released feature)

Fixes SCYLLADB-1450

Closes scylladb/scylladb#29430

* github.com:scylladb/scylladb:
  strong_consistency: fix crash when DROP TABLE races with in-flight DML
  test: add regression test for DROP TABLE racing with in-flight DML
2026-04-21 16:54:20 +02:00
Dario Mirovic
bcda39f716 test: audit: use set diff to identify new audit rows
assert_entries_were_added asserted that new audit rows always appear at
the tail of each per-node, event_time-sorted sequence. That invariant
is not a property of the audit feature: audit writes are asynchronous
with respect to query completion, and on a multi-node cluster QUORUM
reads of audit.audit_log can reveal a row with an older event_time
after a row with a newer one has already been observed.

Replace the positional tail slice with a per-node set difference
between the rows observed before and after the audited operation.
The wait_for retry loop, noise filtering, and final by-value
comparison against expected_entries are unchanged, so the test still
verifies the real contract, that the expected audit entries appear,
without relying on a visibility-ordering invariant that the audit log
does not guarantee.

Fixes SCYLLADB-1589

Closes scylladb/scylladb#29567
2026-04-21 15:33:36 +02:00
Nadav Har'El
6165124fcc Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.

Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.

The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.

The refactoring is composed of these parts (but patches from different parts
are interspersed):

1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API

Major refactoring, and no bugs fixed, so definitely not backporting.

Closes scylladb/scylladb#29114

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
  cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
  cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
  cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
  cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
  cql3: statement_restrictions: use predicate vector size for clustering prefix length
  cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
  cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
  cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
  cql3: statement_restrictions: add predicate-based index support checking
  cql3: statement_restrictions: use pre-built single-column maps for index support checks
  cql3: statement_restrictions: build clustering-prefix restrictions incrementally
  cql3: statement_restrictions: build partition-range restrictions incrementally
  cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
  cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
  cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
  cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
  cql3: statement_restrictions: track has-token state incrementally
  cql3: statement_restrictions: track partition-key-empty state incrementally
  cql3: statement_restrictions: track first multi-column predicate incrementally
  cql3: statement_restrictions: track last clustering column incrementally
  cql3: statement_restrictions: track clustering-has-slice incrementally
  cql3: statement_restrictions: track has-multi-column-clustering incrementally
  cql3: statement_restrictions: track clustering-empty state incrementally
  cql3: statement_restrictions: replace restr bridge variable with pred.filter
  cql3: statement_restrictions: convert single-column branch to use predicate properties
  cql3: statement_restrictions: convert multi-column branch to use predicate properties
  cql3: statement_restrictions: convert constructor loop to iterate over predicates
  cql3: statement_restrictions: annotate predicates with operator properties
  cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
  cql3: statement_restrictions: complete preparation early
  cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
  cql3: statement_restrictions: refine possible_lhs_values() function_call processing
  cql3: statement_restrictions: return nullptr for function solver if not token
  cql3: statement_restrictions: refine possible_lhs_values() subscript solving
  cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
  cql3: statement_restrictions: convert possible_lhs_values into a solver
  cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
  cql3: statement_restrictions: refactor IS NOT NULL processing
  cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
  cql3: statement_restrictions: fold add_is_not_restriction() into its caller
  cql3: statement_restrictions: fold add_restriction() into its caller
  cql3: statement_restrictions: remove possible_partition_token_values()
  cql3: statement_restrictions: remove possible_column_values
  cql3: statement_restrictions: pass schema to possible_column_values()
  cql3: statement_restrictions: remove fallback path in solve()
  cql3: statement_restrictions: reorder possible_lhs_column parameters
  cql3: statement_restrictions: prepare solver for multi-column restrictions
  cql3: statement_restrictions: add solver for token restriction on index
  cql3: statement_restrictions: pre-analyze column in value_for()
  cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
  cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
  cql3: statement_restrictions: adjust signature of range_from_raw_bounds
  cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
  cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
  cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
  cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
  cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
  cql3: statement_restrictions: wrap value_for_index_partition_key()
  cql3: statement_restrictions: hide value_for()
  cql3: statement_restrictions: push down clustering prefix wrapper one level
  cql3: statement_restrictions: wrap functions that return clustering ranges
  cql3: statement_restrictions: do not pass view schema back and forth
  cql3: statement_restrictions: pre-analyze token range restrictions
  cql3: statement_restrictions: pre-analyze partition key columns
  cql3: statement_restrictions: do not collect subscripted partition key columns
  cql3: statement_restrictions: split _partition_range_restrictions into three cases
  cql3: statement_restrictions: move value_list, value_set to header file
  cql3: statement_restrictions: wrap get_partition_key_ranges
  cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
  test: statement_restrictions: add index_selection regression test
2026-04-21 15:44:06 +03:00
Anna Stuchlik
d222e6e2a4 doc: document support for OCI Object Storage
This commit extends the object storage configuration section
with support for OCi object storage.

Fixes SCYLLADB-502

Closes scylladb/scylladb#29503
2026-04-21 15:11:58 +03:00
Botond Dénes
cfebe17592 sstables: fix segfault in parse_assert() when message is nullptr
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.

This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.

Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".

Fixes: SCYLLADB-1329

Closes scylladb/scylladb#29285
2026-04-21 12:40:33 +02:00
Marcin Maliszkiewicz
935e6a495d Merge 'transport: add per-service-level cql_requests_serving metric' from Piotr Smaron
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests.
This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label).

Fixes: SCYLLADB-1340

New feature, no backport.

Closes scylladb/scylladb#29493

* github.com:scylladb/scylladb:
  transport: add per-service-level cql_requests_serving metric
  transport: move requests_serving decrement to after response is sent
2026-04-21 12:35:50 +02:00
Aleksandra Martyniuk
cd79b99112 test: fix flaky test_alter_tablets_rf_dc_drop by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643.

Closes scylladb/scylladb#29563
2026-04-21 09:12:51 +03:00
Raphael S. Carvalho
474e962e01 compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
  only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
  compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
  all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
  repaired-only optimization is active; used by get_max_purgeable_timestamp() in
  compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
  is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
  process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
  D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
  tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
  compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
  landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
  writing T_base so D_base is staged on servers[0] via row-sync; blocks the
  view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
  (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
  in repaired set); asserts no resurrection; releases injection; waits for staging to
  complete; asserts no resurrection after a second flush+compaction. Demonstrates that
  the read-before-write in stream_view_replica_updates() makes the optimization safe even
  when staging fires after T_mv has been GC'd.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 16:59:09 -03:00
Ferenc Szili
a50aa7e689 test/cluster: wait for ready CQL in cross-rack merge test
test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately
after adding the new cross-rack nodes. In the failing runs the driver is
still converging on the updated topology at that point, so the control
connection sees incomplete peer metadata while schema changes are in
flight.

That leaves a race where CREATE TABLE is sent during topology churn and
the test can surface a misleading AlreadyExists error even though the
table creation has already been committed. Use get_ready_cql(servers)
here so the test waits for inter-node visibility and CQL readiness
before creating the keyspace and table.

Fixes: SCYLLADB-1635

Closes scylladb/scylladb#29561
2026-04-20 20:12:11 +02:00
Łukasz Paszkowski
d18eb9479f cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.

This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.

The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.

All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`

Fixes https://github.com/scylladb/scylladb/issues/25340

Closes scylladb/scylladb#25342
2026-04-20 17:57:38 +03:00
Botond Dénes
69c58c6589 Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.

The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.

This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901

The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.

Closes scylladb/scylladb#28873

* github.com:scylladb/scylladb:
  streaming: reject mutation fragments on critical disk utilization
  test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
  sstables: clean up TemporaryHashes file in wipe()
  sstables: add error injection point in write_components
  test/cluster/storage: extract validate_data_existence to module scope
  test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
  utils/disk_space_monitor: add error injection to suppress threshold checks
2026-04-20 17:56:36 +03:00
David Garcia
16ed338a89 Fix CODEOWNERS to cover nested docs subfolders
The `docs/*` pattern only matches files directly inside `docs/`,
not files in nested subfolders like `docs/folder_b/test.md` or
`docs/alternator/setup.md`. Those files currently have no code
owner assigned.

Replace with `/docs/` and `/docs/alternator/` which match the
directories and all their subdirectories recursively, per GitHub's
CODEOWNERS syntax.

Ref: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners

Closes scylladb/scylladb#29521
2026-04-20 17:55:43 +03:00
Avi Kivity
5687a4840d conf: pair sstable_format=ms with column_index_size_in_kb=1
One of the advantages of Trie indexes (with sstable_format=ms) is that
the index is more compact, and more suitable for paging from disk
(fewer pages required per search). We can exploit it by setting
column_index_size_in_kb to 1 rather than 64, increasing the index file
size (and requiring more index pages to be loaded and parsed) in return
for smaller data file reads.

To test this, I created a 1M row partition with 300-byte rows, compacted
it into a single sstable, and tested reads to a single row.

With column_index_size_in_kb=64:

Rows.db file size 60k
3 pages read from Rows.db (4k each)
2x 32k read from Data.db

With column_index_size_in_kb=1:

Rows.db file size 2MB (33X)
5 pages read from Rows.db (4k each, 1.7X)
1x 4107 bytes read from Data.db (0.5X IOPS, 0.06X bandwidth)

Given that Rows.db will be typically cached, or at least all but one of the
levels (its size is 157X smaller than Data.db), we win on both IOPS
and bandwidth.

I would have expected the the Data.db read to be closer to 1k, but this
is already an improvement.

Given that, set column_index_size_in_kb=1, but only for new clusters
where we also select sstable_format=ms.

Raw data (w1, w64 are working directories with different
column_index_size_in_kb):

```console
$ ls -l w*/data/bench/wide_partition-*/*{Rows,Data}.db
-rw-r--r-- 1 avi avi 314964958 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db
-rw-r--r-- 1 avi avi   2001227 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db
-rw-r--r-- 1 avi avi 314963261 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db
-rw-r--r-- 1 avi avi     59989 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db
```

column_index_size_in_kb=64 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | 9OXdwmDHRapL2w5YruWLTOtiC3PKbyctSDdQ8YpuPKtWkSYBF10G7bKo2rdnxSAd52HLI21568YM7OwK05B6qAF7X2b6910qsJEA106QBEcFWQVybMCkxkpO4VDRcAVNLRgjB3vygcDBP17GBTb2s7l47UOloy3KtZ7J5YQgKcf7zlFSKGHa49vnRrzoXZCdYexOpix6jcSV2SiwRNqgv6XmYhx43ZwGa4zUtOe0eIKJj7KTxu5bzyWUWGW7US4NLFZRD8Vdb6EasIFkOfVKdiFp2LZHMXGRvtvdF93UTFUb

(1 rows)

Tracing session: 19219900-3bf3-11f1-bc43-c0a4e62b53d1

 activity                                                                                                                                                                                                                 | timestamp                        | source    | source_elapsed | client
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                       Execute CQL3 query |       2026-04-19 16:24:30.992000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                                 Parsing a statement [shard 0/sl:default] | 2026-04-19 16:24:30.992643+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                            Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:24:30.992738+00:00 | 127.0.0.1 |             96 | 127.0.0.1
                                                                                                                                                               Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:24:30.992765+00:00 | 127.0.0.1 |            123 | 127.0.0.1
                        Creating read executor for token -3485513579396041028 with all: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] targets: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:24:30.992781+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                           Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:24:30.992782+00:00 | 127.0.0.1 |            140 | 127.0.0.1
                                                                                                                                                                         read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:24:30.992795+00:00 | 127.0.0.1 |            153 | 127.0.0.1
                                                                                                                            Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:24:30.992801+00:00 | 127.0.0.1 |            160 | 127.0.0.1
                                                                                                                                      [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:24:30.992805+00:00 | 127.0.0.1 |            163 | 127.0.0.1
                                                                                                                                            [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:24:30.992814+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                        Reading key {-3485513579396041028, pk{000400000000}} from sstable w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db [shard 0/sl:default] | 2026-04-19 16:24:30.992837+00:00 | 127.0.0.1 |            195 | 127.0.0.1
                                         page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.992851+00:00 | 127.0.0.1 |            209 | 127.0.0.1
                                              page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995294+00:00 | 127.0.0.1 |           2653 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995375+00:00 | 127.0.0.1 |           2733 | 127.0.0.1
                                               page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995376+00:00 | 127.0.0.1 |           2734 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                                                             page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206057984 [shard 0/sl:default] | 2026-04-19 16:24:30.995471+00:00 | 127.0.0.1 |           2829 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:24:30.995475+00:00 | 127.0.0.1 |           2833 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206057984, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995586+00:00 | 127.0.0.1 |           2945 | 127.0.0.1
                            Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:24:30.995637+00:00 | 127.0.0.1 |           2995 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206090752, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995645+00:00 | 127.0.0.1 |           3003 | 127.0.0.1
                                                                                                                                                                                    Querying is done [shard 0/sl:default] | 2026-04-19 16:24:30.995653+00:00 | 127.0.0.1 |           3012 | 127.0.0.1
                                                                                                                                                                Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:24:30.995670+00:00 | 127.0.0.1 |           3028 | 127.0.0.1
                                                                                                                                                                                                         Request complete |       2026-04-19 16:24:30.995039 | 127.0.0.1 |           3039 | 127.0.0.1

                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:22:43.107215+00:00 | 127.0.0.1 |           8685 | 127.0.0.1
```

column_index_size_in_kb=1 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | FIA7X52ZqYwvDxEGlmWJUSy1I94WTuWZTdLwXr9HBQ90RJLqYKr5nInTADSI6hzofwawaXphAQK07YMoyzFfRaGeKPQPKUb35XpLEGvLJ4xu9r4es8wUEHPXaFBGdMcWUkyDJSTYCFzZAPCzUHEuPJHMXVrI6UExWrIR0Xujg4GZa9UciU9rbEvrSBwSzoPEfbXJ6qZSGiTD8gcXz5kdAblLxsAeWug8tZqslsTu04HMLKfZ8WopQvHbpR6YlGSnM99CiBgz30LMmllULV4VA4u9kMpzsRV2IE2tKmJOddEl

(1 rows)

Tracing session: 3953a1f0-3bf3-11f1-b976-4a3dc2a7a57f

 activity                                                                                                                                                                                                              | timestamp                        | source    | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                    Execute CQL3 query |       2026-04-19 16:25:25.007000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                              Parsing a statement [shard 0/sl:default] | 2026-04-19 16:25:25.007423+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                         Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:25:25.007511+00:00 | 127.0.0.1 |             89 | 127.0.0.1
                                                                                                                                                            Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:25:25.007536+00:00 | 127.0.0.1 |            114 | 127.0.0.1
                     Creating read executor for token -3485513579396041028 with all: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] targets: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:25:25.007551+00:00 | 127.0.0.1 |            129 | 127.0.0.1
                                                                        Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:25:25.007553+00:00 | 127.0.0.1 |            131 | 127.0.0.1
                                                                                                                                                                      read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:25:25.007556+00:00 | 127.0.0.1 |            134 | 127.0.0.1
                                                                                                                         Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:25:25.007562+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                                                                                   [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:25:25.007564+00:00 | 127.0.0.1 |            142 | 127.0.0.1
                                                                                                                                         [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:25:25.007573+00:00 | 127.0.0.1 |            151 | 127.0.0.1
                      Reading key {-3485513579396041028, pk{000400000000}} from sstable w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db [shard 0/sl:default] | 2026-04-19 16:25:25.007594+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                                       page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.007607+00:00 | 127.0.0.1 |            184 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016029+00:00 | 127.0.0.1 |           8607 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016109+00:00 | 127.0.0.1 |           8687 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016111+00:00 | 127.0.0.1 |           8688 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016176+00:00 | 127.0.0.1 |           8754 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016260+00:00 | 127.0.0.1 |           8838 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                             w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: scheduling bulk DMA read of size 4107 at offset 206086656 [shard 0/sl:default] | 2026-04-19 16:25:25.016268+00:00 | 127.0.0.1 |           8846 | 127.0.0.1
 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: finished bulk DMA read of size 4107 at offset 206086656, successfully read 4608 bytes [shard 0/sl:default] | 2026-04-19 16:25:25.016340+00:00 | 127.0.0.1 |           8918 | 127.0.0.1
                         Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:25:25.016367+00:00 | 127.0.0.1 |           8945 | 127.0.0.1
                                                                                                                                                                                 Querying is done [shard 0/sl:default] | 2026-04-19 16:25:25.016385+00:00 | 127.0.0.1 |           8963 | 127.0.0.1
                                                                                                                                                             Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:25:25.016401+00:00 | 127.0.0.1 |           8979 | 127.0.0.1
                                                                                                                                                                                                      Request complete |       2026-04-19 16:25:25.015989 | 127.0.0.1 |           8989 | 127.0.0.1
```

Closes scylladb/scylladb#29552
2026-04-20 17:53:56 +03:00
Marcin Maliszkiewicz
e414b2b0b9 test/cluster: scale failure_detector_timeout_in_ms by build mode
Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.

The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:

  20:10:57,049 [shard 0] raft_group0 - server 614 entered
    'join group0' transition state for 53b01f0b

The joining node begins receiving the raft snapshot 100ms later:

  20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:

  20:10:57,511 [shard 0] migration_manager - Creating keyspace
    system_auth_v2
  ...
  20:10:57,788 [shard 0] migration_manager - Creating
    system_auth_v2.role_members

Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:

  20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
    when pinging 53b01f0b: seastar::rpc::timeout_error
    (rpc call timed out)

25ms later, the coordinator marks the joining node DOWN and removes it:

  20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
    Mark node 53b01f0b as DOWN
  20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
    53b01f0b

The joining node was still retrying the snapshot transfer at that point:

  20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then receives the ban notification and aborts:

  20:11:01,844 [shard 0] raft_group0 - received notification of being
    banned from the cluster

Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).

Test measurements (before -> after fix):

  debug mode:
  test_replace_with_same_ip_twice           24.02s ->  25.02s
  test_banned_node_notification            217.22s -> 221.72s
  test_kill_coordinator_during_op          116.11s -> 127.13s
  test_node_failure_during_tablet_migration
    [streaming-source]                     183.25s -> 192.69s
  test_replace (4 tests)        skipped in debug (skip_in_debug)
  test_raft_replace_ignore_nodes  skipped in debug (run_in_dev only)

  dev mode:
  test_replace_different_ip                 10.51s ->  11.50s
  test_replace_different_ip_using_host_id   10.01s ->  12.01s
  test_replace_reuse_ip                     10.51s ->  12.03s
  test_replace_reuse_ip_using_host_id       13.01s ->  12.01s
  test_raft_replace_ignore_nodes            19.52s ->  19.52s
2026-04-20 15:28:34 +02:00
Marcin Maliszkiewicz
99ac36b353 test/cluster: add failure_detector_timeout fixture
Add a shared pytest fixture that scales the failure detector timeout
by build mode factor (e.g. 3x for debug/sanitize, 2x for dev).
2026-04-20 15:28:33 +02:00
Marcin Maliszkiewicz
c136b2e640 audit: drop sstring temporaries on the will_log() fast path
audit::will_log() is called for every CQL/Alternator request. With
non-empty keyspace it does:

    _audited_keyspaces.find(sstring(keyspace))
    should_log_table(sstring(keyspace), sstring(table))

constructing three temporary sstrings from the std::string_view
arguments on every call. Now that the underlying associative containers
use std::less<> as comparator (previous commit), find() accepts the
string_view directly. Switch should_log_table() to take string_view as
well so the temporaries disappear entirely.

For short keyspace names the temporaries stay in SSO so allocs/op is
unchanged at 58.1, but each construction still costs ~60 instructions.

perf-simple-query --smp 1 --duration 15 --audit "table"
                  --audit-keyspaces "ks-non-existing"
                  --audit-categories "DCL,DDL,AUTH,DML,QUERY"

build: --mode=release --use-profile="" (no PGO)

Before (regression introduced in 9646ee05bd):
    instructions_per_op: 36952

After:
    instructions_per_op: 36768

Brings insns/op back to the pre-regression baseline 3d0582d51e
(insns/op ~36777) within the per-run noise of ~15 insns standard
deviation, eliminating the ~180 insns/op regression.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1616
2026-04-20 15:18:22 +02:00
Marcin Maliszkiewicz
724b9e66ea audit: enable heterogeneous lookup on audited keyspaces/tables
Replace the bare std::set<sstring>/std::map<sstring, std::set<sstring>>
member types with named aliases that use std::less<> as the comparator.
The transparent comparator enables heterogeneous lookup with
string_view keys.

This commit is a pure refactor with no behavioral change: the parser
return types, constructor parameters, observer template instantiations,
and start_audit() locals are all updated to use the aliases.
2026-04-20 15:14:58 +02:00
Marcin Maliszkiewicz
9f11920b15 Merge 'alternator: fix remaining problems with new Stream ARN format' from Nadav Har'El
This small series includes a few followups to the patch that changed Alternator Stream ARNs from using our own UUID format to something that resembles Amazon's Stream ARNs (and the KCL library won't reject as bogus-looking ARNs).

The first patch is the most important one, fixing ListStreams's LastEvaluatedStreamArn to also use the new ARN format. It fixes SCYLLADB-539.

The following patches are additional cleanups and tests for the new ARN code.

Closes scylladb/scylladb#29474

* github.com:scylladb/scylladb:
  alternator: fix ListStreams paging if table is deleted during paging
  test/alternator: test DescribeStream on non-existent table
  alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
  alternator: remove dead code stream_shard_id
  alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
2026-04-20 14:42:28 +02:00
Raphael S. Carvalho
a50e6215aa test/repair: Add tombstone GC safety tests for incremental repair
Add three cluster tests that verify no data resurrection occurs when
tombstone GC runs on the repaired sstable set under incremental repair
with tombstone_gc=repair mode.

All tests use propagation_delay_in_seconds=0 so that tombstones become
GC-eligible immediately after repair_time is committed (gc_before =
repair_time), allowing the scenarios to exercise the actual GC eligibility
path without artificial sleeps.

  (test_tombstone_gc_no_resurrection_basic_ordering)

Data D (ts=1) and tombstone T (ts=2) are written to all replicas and
flushed before repair.  Repair captures both in the repairing snapshot
and promotes them to repaired.  Once repair_time is committed, T is
GC-eligible (T.deletion_time < gc_before = repair_time).

The test verifies that compaction on the repaired set does NOT purge T,
because D is already in repaired (mark_sstable_as_repaired() completes
on all replicas before repair_time is committed to Raft) and clamps
max_purgeable to D.timestamp=1 < T.timestamp=2.

  (test_tombstone_gc_no_resurrection_hints_flush_failure)

The repair_flush_hints_batchlog_handler_bm_uninitialized injection causes
hints flush to fail on one node.  When hints flush fails, flush_time stays
at gc_clock::time_point{} (epoch).  This propagates as repair_time=epoch
committed to system.tablets, so gc_before = epoch - propagation_delay is
effectively the minimum possible time.  No tombstone has a deletion_time
older than epoch, so T is never GC-eligible from this repair.

The test verifies that repair_time does not advance to a meaningful value
after a failed hints flush, and that compaction on the repaired set does
not purge T (key remains deleted, no resurrection).

  (test_tombstone_gc_no_resurrection_propagation_delay)

Simulates a write D carrying an old CQL USING TIMESTAMP (ts_d = now-2h)
that was stored as a hint while a replica was down, and a tombstone T
with a higher timestamp (ts_t = now-90min, ts_t > ts_d) that was written
to all live replicas.  After the replica restarts, repair flushes hints
synchronously before taking the repairing snapshot, guaranteeing D is
delivered and captured in repairing before the snapshot.

After mark_sstable_as_repaired() promotes D to repaired, the coordinator
commits repair_time.  gc_before = repair_time > T.deletion_time so T is
GC-eligible.  The test verifies that compaction on the repaired set does
NOT purge T: D (ts_d < ts_t) is already in repaired, clamping
max_purgeable = ts_d < ts_t = T.timestamp, so T is not purgeable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 09:09:39 -03:00