Compare commits

..

171 Commits

Author SHA1 Message Date
Anna Mikhlin
5231c77e8e Update ScyllaDB version to: 2026.2.0-rc0 2026-04-26 15:28:16 +03:00
Piotr Szymaniak
d5efd1f676 test/cluster: wait for Alternator readiness in server startup
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.

Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.

Fixes SCYLLADB-1701

Closes scylladb/scylladb#29625
2026-04-25 16:35:44 +03:00
Piotr Smaron
d14d07a079 test: fix flaky test_sstable_write_large_{row,cell} by using a fixed partition key
Commit ce00d61917 ("db: implement large_data virtual tables with feature
flag gating") changed these two tests to construct their mutation with
a randomly generated partition key (simple_schema::make_pkey()) instead
of the previously fixed pk "pv", with the comment that this avoids a
"Failed to generate sharding metadata" error.

simple_schema::make_pkey() delegates to tests::generate_partition_key(),
which defaults to key_size{1, 128}, i.e. the partition key length is
uniformly random in [1, 128] bytes. That interacts badly with the fact
that both tests pick thresholds at exact byte boundaries of the MC
sstable row encoding:

  - The large-data handler records a row's size as
      _data_writer->offset() - current_pos
    (sstables/mx/writer.cc: collect_row_stats()), i.e. the number of
    bytes the row took on disk.
  - For the first clustering row, the body includes a vint-encoded
    prev_row_size = pos - _prev_row_start.
  - _prev_row_start is captured at the start of the partition
    (consume_new_partition()) before the partition key is written to
    the data stream, so prev_row_size rolls in the partition key's
    serialized length (2-byte prefix + pk bytes) + deletion_time +
    static row size.

A random-size partition key therefore perturbs the first clustering
row's encoded size by 1-2 bytes across runs (the vint of prev_row_size
crosses the 128 boundary), flipping the test's byte-exact threshold
comparison. On seed 2104744000 this produced:

  critical check row_size_count == expected.size() has failed [3 != 2]

Fix the two byte-exact-sensitive tests by reverting their partition key
to the fixed s.new_mutation("pv") used before ce00d61917. Under smp=1
(which these tests run with, per -c1 in the test invocation) a fixed
key is always shard-local, so no sharding-metadata issue arises here.

The other tests modified by ce00d61917 (test_sstable_log_too_many_rows,
test_sstable_log_too_many_dead_rows, test_sstable_too_many_collection_elements,
test_large_data_records_round_trip, etc.) assert on row/element counts
or use thresholds with enough slack that the partition key size does
not matter, and are left unchanged.

Add an explanatory comment to each fixed site so the pitfall is not
re-introduced by a future refactor.

Verified stable with:
  ./test.py --mode=dev     test/boost/sstable_3_x_test.cc::test_sstable_write_large_row  --repeat 100 --max-failures 1
  ./test.py --mode=dev     test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
  ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_row  --repeat 100 --max-failures 1
  ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1

All four invocations: 100/100 passed.

Fixes: SCYLLADB-1685

Closes scylladb/scylladb#29621
2026-04-25 16:32:02 +03:00
Botond Dénes
70261dc674 Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz
The failure_detector_timeout_in_ms override of 2000ms in 6 cluster test files is too aggressive for debug/sanitize builds. During node joins, the coordinator's failure detector times out on RPC pings to the joining node while it is still applying schema snapshots, marks it DOWN, and bans it — causing flaky test failures.

Scale the timeout by MODES_TIMEOUT_FACTOR (3x for debug/sanitize, 2x for dev, 1x for release) via a shared failure_detector_timeout fixture in conftest.py.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1587
Backport: no, elasticsearch analyser shows only a single failure

Closes scylladb/scylladb#29522

* github.com:scylladb/scylladb:
  test/cluster: scale failure_detector_timeout_in_ms by build mode
  test/cluster: add failure_detector_timeout fixture
2026-04-24 09:10:43 +03:00
Botond Dénes
d280517e27 test/cluster/test_incremental_repair: fix flaky do_tablet_incremental_repair_and_ops
The log grep in get_sst_status searched from the beginning of the log
(no from_mark), so the second-repair assertions were checking cumulative
counts across both repairs rather than counts for the second repair alone.

The expected values (sst_add==2, sst_mark==2) relied on this cumulative
behaviour: 1 from the first repair + 1 from the second = 2. This works
when the second repair encounters exactly one unrepaired sstable, but
fails whenever the second repair sees two.

The second repair can see two unrepaired sstables when the 100 keys
inserted before it (via asyncio.gather) trigger a background auto-flush
before take_storage_snapshot runs. take_storage_snapshot always flushes
the memtable itself, so if an auto-flush already split the batch into two
sstables on disk, the second repair's snapshot contains both and logs
"Added sst" twice, making the cumulative count 3 instead of 2.

Fix: take a log mark per-server before each repair call and pass it to
get_sst_status so each check counts only the entries produced by that
repair. The expected values become 1/0/1 and 1/1/1 respectively,
independent of how many sstables happened to exist beforehand.

get_sst_status gains an optional from_mark parameter (default None)
which preserves existing call sites that intentionally grep from the
start of the log.

Fixes: SCYLLADB-1086

Closes scylladb/scylladb#29484
2026-04-23 17:17:16 +02:00
Wojciech Mitros
7634d3f7d4 test/cluster: fix flaky test_hints_consistency_during_replace
The test creates a sync point immediately after writing 100 rows
with CL=ANY, without waiting for pending hint writes to complete.

store_hint() is fire-and-forget: it submits do_store_hint() to a gate
and returns immediately. do_store_hint() updates _last_written_rp only
after writing to the commitlog. If create_sync_point() is called before
all do_store_hint() coroutines complete, the captured replay position
is stale, and await_sync_point() returns DONE before all hints are
replayed, leaving some rows missing.

Fix by waiting for the size_of_hints_in_progress metric to reach zero
before creating the sync point, ensuring all in-flight hint writes have
completed and _last_written_rp is up to date. This follows the same
pattern already used in test_sync_point.

Fixes: SCYLLADB-1560

Closes scylladb/scylladb#29623
2026-04-23 17:03:48 +02:00
Botond Dénes
b49cf6247f test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL
Tracing events are written to system_traces.events with CL=ANY, so they
are only guaranteed to be present on the local node of the query
coordinator. Reading them back with the driver default (CL=LOCAL_ONE)
may route the query to a replica that has not yet received all events,
causing the assertion on 'digest mismatch, starting read repair' to fail
intermittently.

Fix execute_with_tracing() to read tracing via the ResponseFuture API
with query_cl=ConsistencyLevel.ALL, so events from all replicas are
merged before the caller inspects them.

Fixes: SCYLLADB-1633

Closes scylladb/scylladb#29566
2026-04-23 16:57:29 +02:00
Michał Jadwiszczak
878f341338 test/cluster/test_view_building_coordinator: fix view_updates_drained predicate
The previous fix for the flakiness in test_file_streaming waited for
the scylla_database_view_update_backlog metric to drop to 0 via
wait_for(view_updates_drained, ...). However, the predicate returned
True/False, while wait_for treats any non-None result as 'done' and
keeps retrying only on None. So when the backlog was non-zero the
predicate returned False, which wait_for interpreted as success and
returned immediately - the test could then stop servers[0]/servers[1]
before the view updates generated by new_server from the migrated
staging sstable were actually delivered, leading to a partially
populated MV (e.g. 431/1000 rows) and a failing assertion.

Fix the predicate to return None instead of False when the backlog is
not yet drained, so wait_for will actually retry until the metric
reaches 0 (or the deadline is hit).

Fixes SCYLLADB-1182

Closes scylladb/scylladb#29587
2026-04-23 17:52:22 +03:00
Andrei Chekun
67b3ad94a0 test.py: enhance error output in case no tests were executed
By default, pytest produces the error if provided file is not exists. But
coupled with xdist it will produce no errors. This is due how the pytest
works with xdist. test.py always uses the parameter -n, so if something
will go wrong there will be no errors produced, only exit code 5 will be
thrown. This PR will print warning in case pytest's exit code is 5.

Closes scylladb/scylladb#29584
2026-04-23 14:03:55 +02:00
Calle Wilund
c97ce32f47 Update position in dma_read(iovec) in create_file_for_seekable_source
Fixes: SCYLLADB-1523

The returned file object does not increment file pos as is. One line fix.
Added test to make sure this read path works as expected.

Closes scylladb/scylladb#29456
2026-04-23 14:54:20 +03:00
Michael Litvak
3468e8de8b test/mv/test_mv_staging: wait for cql after restart
Wait for cql on all hosts after restarting a server in the test.

The problem that was observed is that the test restarts servers[1] and
doesn't wait for the cql to be ready on it. On test teardown it drops
the keyspace, trying to execute it on the host that is not ready, and
fails.

Fixes SCYLLADB-1632

Closes scylladb/scylladb#29562
2026-04-23 12:40:19 +02:00
Marcin Maliszkiewicz
3df951bc9c Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652

No backport - bug introduced recently.

Closes scylladb/scylladb#29570

* github.com:scylladb/scylladb:
  test/audit: add reproducer for native-protocol batch not being audited
  audit: set audit_info for native-protocol BATCH messages
  test/audit: rename internal test methods to avoid CI misdetection
2026-04-22 18:56:28 +02:00
Botond Dénes
eb3326b417 Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta
should be merged after #29235

Complete the typed skip markers migration started in the plugin PR.
Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call
across the test suite is replaced with a typed equivalent, making skip
reasons machine-readable in JUnit XML and Allure reports.

**62 files changed** across 8 commits, covering ~127 skip sites in total.

Bare `pytest.skip` provides only a free-text reason string. CI dashboards
(JUnit, Allure) cannot distinguish between a test skipped due to a known
bug, a missing feature, a slow test, or an environment limitation. This
makes it hard to track skip debt, prioritize fixes, or filter dashboards
by skip category.

The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`,
`skip_env`) introduced by the `skip_reason_plugin` solve this by embedding
a `skip_type` field into every skip report entry.

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_bug` | 24 | 16 | Skip reason references a known bug/issue |
| `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla |
| `skip_slow` | 4 | 3 | Test too slow for regular CI runs |
| `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) |

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime |
| `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) |

- **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()`
- **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning
- **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression:
  - AST scan for bare `@pytest.mark.skip` decorators
  - AST scan for bare `pytest.skip()` runtime calls
  - Real `pytest --collect-only` against all Python test directories

Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`:
```python
from test.pylib.skip_types import skip_env
```

Usage:
```python
skip_env("Tablets not enabled")
```

1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files
2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files
3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files
4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file
5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files
6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files
7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files
8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests

- All 60 plugin + guard tests pass (`test/pylib_test/`)
- No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase
- `pytest --collect-only` succeeds across all test directories with the hardened plugin

SCYLLADB-1349

Closes scylladb/scylladb#29305

* github.com:scylladb/scylladb:
  test/alternator: replace bare pytest.skip() with typed skip helpers
  test: migrate new bare skips introduced by upstream after rebase
  test/pylib: reject bare pytest.mark.skip and add codebase guards
  test: update comments referencing pytest.skip() to skip_env()
  test: migrate runtime pytest.skip() to typed skip_bug()
  test: migrate runtime pytest.skip() to typed skip_env()
  test: migrate bare @pytest.mark.skip to skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
  test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
2026-04-22 15:48:27 +03:00
Avi Kivity
e84e7dfb7a build: drop utils/rolling_max_tracker.hh from precompiled header
Added by mistake. Precompiled headers should only include library
headers that rarely change, since any dependency change causes a
full rebuild.

Closes scylladb/scylladb#29560
2026-04-22 15:46:50 +03:00
Botond Dénes
3aced88586 Merge 'audit: decrease allocations / instructions on will_log() fast path' from Marcin Maliszkiewicz
Audit::will_log() runs on every CQL/Alternator request. Since
9646ee05bd it constructs three temporary sstrings per call to look up
the audited keyspaces set / tables map with std::string_view keys,
costing ~180 insns/op and 2 allocations if sstring misses SSO.

This series switches the containers to std::less<> comparators to
enable heterogeneous lookup, then drops the sstring temporaries from
will_log().

perf-simple-query --smp 1 --duration 15 --audit "table"
                  --audit-keyspaces "ks-non-existing"
                  --audit-categories "DCL,DDL,AUTH,DML,QUERY"

  baseline         3d0582d51e          36777 insns/op
  regression     9646ee05bd          36952        (+175)
  this series                                      36768        (-184, fixed)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1616
Backport: no, offending commit is not backported

Closes scylladb/scylladb#29565

* github.com:scylladb/scylladb:
  audit: drop sstring temporaries on the will_log() fast path
  audit: enable heterogeneous lookup on audited keyspaces/tables
2026-04-22 15:46:16 +03:00
Marcin Maliszkiewicz
4043d95810 Merge 'storage_service: fix REST API races during shutdown and cross-shard forwarding' from Piotr Smaron
REST route removal unregisters handlers but does not wait for requests
that already entered storage_service.  A request can therefore suspend
inside an async operation, restart proceeds to tear the service down,
and the coroutine later resumes against destroyed members such as
_topology_state_machine, _group0, or _sys_ks — a use-after-destruction
bug that surfaces as UBSAN dynamic-type failures (e.g. the crash seen
from topology_state_load()).

Fix this by holding storage_service::_async_gate from the entry
boundary of every externally-triggered async operation so that stop()
drains them before teardown begins.  The gate is acquired in
run_with_api_lock, run_with_no_api_lock, and in individual REST
handlers that bypass those wrappers (reload_raft_topology_state,
mark_excluded, removenode, schema reload, topology-request
waits/abort, cleanup, ring/schema queries, SSTable dictionary
training/publish, and sampling).

Additionally, fix get_ownership() and abort_topology_request() which
forward work to shard 0 but were still referencing the caller-shard's
`this` pointer instead of the destination-shard instance, causing
silent cross-shard access to shard-local state.
Add a cluster regression test that repeatedly exercises the multi-shard
ownership REST path to cover the forwarding fix.

Fixes: SCYLLADB-1415

Should be backported to all branches, the code has been introduced around 2024.1 release.

Closes scylladb/scylladb#29373

* github.com:scylladb/scylladb:
  storage_service: fix shard-0 forwarding in REST helpers
  storage_service: gate REST-facing async operations during shutdown
  storage_service: prepare for async gate in REST handlers
2026-04-22 14:43:31 +02:00
Radosław Cybulski
cc39b54173 alternator: use stream_arn instead of std::string in list_streams
Use `stream_arn` object for storage of last returned to the user stream
instead of raw `std::string`. `stream_arn` is used for parsing ARN
incoming from the user, for returning `std::string` was used because of
buggy copy / move operations of `stream_arn`. Those were fixed, so we're
fixing usage as well.

Fixes: SCYLLADB-1241

Closes scylladb/scylladb#29578
2026-04-22 14:02:53 +02:00
Artsiom Mishuta
183c6d120e test: exclude pylib_test from default test runs
Add pylib_test to norecursedirs in pytest.ini so it is not collected
during ./test.py or pytest test/ runs, but can still be run directly
via 'pytest test/pylib_test'.

Also fix pytest log cleanup: worker log files (pytest_gw*) were not
being deleted on success because cleanup was restricted to the main
process only. Now each process (main and workers) cleans up its own
log file on success.

Closes scylladb/scylladb#29551
2026-04-22 11:38:40 +02:00
Piotr Smaron
dffb266b79 storage_service: fix shard-0 forwarding in REST helpers
get_ownership() and abort_topology_request() forward work to shard 0
via container().invoke_on(0, ...) but the lambda captured 'this' and
accessed members through it instead of through the shard-0 'ss'
parameter.  This means the lambda used the caller-shard's instance,
defeating the purpose of the forwarding.

Use the 'ss' parameter consistently so the operations run against the
correct shard-0 state.
2026-04-22 10:30:33 +02:00
Piotr Smaron
6a91d046f3 storage_service: gate REST-facing async operations during shutdown
Hold _async_gate in all REST-facing async operations so that stop()
drains in-flight requests before teardown, preventing use-after-free
crashes when REST calls race with shutdown.

A centralized gated() wrapper in set_storage_service (api/storage_service.cc)
automatically holds the gate for every REST handler registered there,
so new handlers get shutdown-safety by default.

run_with_api_lock_internal and run_with_no_api_lock hold _async_gate on
shard 0 as well, because REST requests arriving on any shard are forwarded
there for execution.

Methods that previously self-forwarded to shard 0 (mark_excluded,
prepare_for_tablets_migration, set_node_intended_storage_mode,
get_tablets_migration_status, finalize_tablets_migration) now assert
this_shard_id() == 0.  Their REST handlers call them via
run_with_no_api_lock, which performs the shard-0 hop and gate hold
centrally.

Fixes: SCYLLADB-1415
2026-04-22 10:30:33 +02:00
Piotr Smaron
74dd33811e storage_service: prepare for async gate in REST handlers
Add hold_async_gate() public accessor for use by the REST registration
layer in a followup commit.

Convert run_with_no_api_lock to a coroutine so a followup commit can
hold the async gate across the entire forwarded operation.

No functional changes.
2026-04-22 10:28:54 +02:00
Botond Dénes
18ceeaf3ef Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231.

Closes scylladb/scylladb#29310

* github.com:scylladb/scylladb:
  compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
  test/repair: Add tombstone GC safety tests for incremental repair
2026-04-22 10:21:37 +03:00
Avi Kivity
f5eb99f149 test: bump multishard_query_test querier_cache TTL to 60s to avoid flake
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.

The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".

Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
  population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.

Bump the TTL to 60s in all three affected tests:

  - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
  - test_read_all                       (same pattern, same invariants — suspect)
  - test_read_all_multi_range           (same pattern, same invariants — suspect)

Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.

Fixes: SCYLLADB-1642

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes scylladb/scylladb#29564
2026-04-22 09:48:59 +03:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Andrzej Jackowski
b6cb025e9b test/audit: add reproducer for native-protocol batch not being audited
The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a
QUERY message, which goes through the CQL parser and raw::batch_statement::
prepare() — a path that correctly sets audit_info. This missed the bug
where native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a batch_statement
without setting audit_info, causing audit to silently skip the batch.

Add _test_batch_native_protocol which uses the driver's BatchStatement
(both unprepared and prepared variants) to exercise this code path.

Refs SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
f5bb9b6282 audit: set audit_info for native-protocol BATCH messages
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
5f93d57d6e test/audit: rename internal test methods to avoid CI misdetection
The CI heuristic picks up any function named test_* in changed files
and tries to run it as a standalone pytest test. The AuditTester class
methods (test_batch, test_dml, etc.) are not top-level pytest tests —
they are internal helpers called from the actual test functions.

Prefix them with underscore so CI does not mistake them for
standalone tests.
2026-04-21 21:52:26 +02:00
Dario Mirovic
cf237e060a test: auth_cluster: use safe_driver_shutdown() for Cluster teardown
A handful of cassandra-driver Cluster.shutdown() call sites in the
auth_cluster tests were missed by the previous sweep that introduced
safe_driver_shutdown(), because the local variable holding the Cluster
is named "c" rather than "cluster".

Direct Cluster.shutdown() is racy: the driver's "Task Scheduler"
thread may raise RuntimeError ("cannot schedule new futures after
shutdown") during or after the call, occasionally failing tests.
safe_driver_shutdown() suppresses this expected RuntimeError and
joins the scheduler thread.

Replace the remaining c.shutdown() calls in:
  - test/cluster/auth_cluster/test_startup_response.py
  - test/cluster/auth_cluster/test_maintenance_socket.py
with safe_driver_shutdown(c) and add the corresponding import from
test.pylib.driver_utils.

No behavioral change to the tests; only the driver teardown is
hardened against a known driver-side race.

Fixes SCYLLADB-1662

Closes scylladb/scylladb#29576
2026-04-21 17:45:11 +02:00
Radosław Cybulski
6f7bf30a14 alternator: increase wait time to tablet sync
When forcing tablet count change via cql command, the underlying
tablet machinery takes some time to adjust. Original code waited
at most 0.1s for tablet data to be synchronized. This seems to be
not enough on debug builds, so we add exponential backoff and increase
maximum waiting time. Now the code will wait 0.1s first time and
continue waiting with each time doubling the time, up to maximum of 6 times -
or total time ~6s.

Fixes: SCYLLADB-1655

Closes scylladb/scylladb#29573
2026-04-21 17:38:07 +02:00
Radosław Cybulski
74b523ea20 treewide: fix spelling errors.
Fix various spelling errors.

Closes scylladb/scylladb#29574
2026-04-21 18:20:26 +03:00
Piotr Dulikowski
cb8253067d Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in `groups_manager::acquire_server()` because the
raft group has already been erased from `_raft_groups`.

A concurrent `DROP TABLE` may have already removed the table from database
registries and erased the raft group via `schedule_raft_group_deletion`.
The `schema.table()` in `create_operation_ctx()` might not fail though
because someone might be holding `lw_shared_ptr<table>`, so that the
table is dropped but the table object is still alive.

Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via `find_column_family` before looking up
the raft group.  If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error.  When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.

backport: not needed (not released feature)

Fixes SCYLLADB-1450

Closes scylladb/scylladb#29430

* github.com:scylladb/scylladb:
  strong_consistency: fix crash when DROP TABLE races with in-flight DML
  test: add regression test for DROP TABLE racing with in-flight DML
2026-04-21 16:54:20 +02:00
Dario Mirovic
bcda39f716 test: audit: use set diff to identify new audit rows
assert_entries_were_added asserted that new audit rows always appear at
the tail of each per-node, event_time-sorted sequence. That invariant
is not a property of the audit feature: audit writes are asynchronous
with respect to query completion, and on a multi-node cluster QUORUM
reads of audit.audit_log can reveal a row with an older event_time
after a row with a newer one has already been observed.

Replace the positional tail slice with a per-node set difference
between the rows observed before and after the audited operation.
The wait_for retry loop, noise filtering, and final by-value
comparison against expected_entries are unchanged, so the test still
verifies the real contract, that the expected audit entries appear,
without relying on a visibility-ordering invariant that the audit log
does not guarantee.

Fixes SCYLLADB-1589

Closes scylladb/scylladb#29567
2026-04-21 15:33:36 +02:00
Nadav Har'El
6165124fcc Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.

Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.

The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.

The refactoring is composed of these parts (but patches from different parts
are interspersed):

1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API

Major refactoring, and no bugs fixed, so definitely not backporting.

Closes scylladb/scylladb#29114

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
  cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
  cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
  cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
  cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
  cql3: statement_restrictions: use predicate vector size for clustering prefix length
  cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
  cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
  cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
  cql3: statement_restrictions: add predicate-based index support checking
  cql3: statement_restrictions: use pre-built single-column maps for index support checks
  cql3: statement_restrictions: build clustering-prefix restrictions incrementally
  cql3: statement_restrictions: build partition-range restrictions incrementally
  cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
  cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
  cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
  cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
  cql3: statement_restrictions: track has-token state incrementally
  cql3: statement_restrictions: track partition-key-empty state incrementally
  cql3: statement_restrictions: track first multi-column predicate incrementally
  cql3: statement_restrictions: track last clustering column incrementally
  cql3: statement_restrictions: track clustering-has-slice incrementally
  cql3: statement_restrictions: track has-multi-column-clustering incrementally
  cql3: statement_restrictions: track clustering-empty state incrementally
  cql3: statement_restrictions: replace restr bridge variable with pred.filter
  cql3: statement_restrictions: convert single-column branch to use predicate properties
  cql3: statement_restrictions: convert multi-column branch to use predicate properties
  cql3: statement_restrictions: convert constructor loop to iterate over predicates
  cql3: statement_restrictions: annotate predicates with operator properties
  cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
  cql3: statement_restrictions: complete preparation early
  cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
  cql3: statement_restrictions: refine possible_lhs_values() function_call processing
  cql3: statement_restrictions: return nullptr for function solver if not token
  cql3: statement_restrictions: refine possible_lhs_values() subscript solving
  cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
  cql3: statement_restrictions: convert possible_lhs_values into a solver
  cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
  cql3: statement_restrictions: refactor IS NOT NULL processing
  cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
  cql3: statement_restrictions: fold add_is_not_restriction() into its caller
  cql3: statement_restrictions: fold add_restriction() into its caller
  cql3: statement_restrictions: remove possible_partition_token_values()
  cql3: statement_restrictions: remove possible_column_values
  cql3: statement_restrictions: pass schema to possible_column_values()
  cql3: statement_restrictions: remove fallback path in solve()
  cql3: statement_restrictions: reorder possible_lhs_column parameters
  cql3: statement_restrictions: prepare solver for multi-column restrictions
  cql3: statement_restrictions: add solver for token restriction on index
  cql3: statement_restrictions: pre-analyze column in value_for()
  cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
  cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
  cql3: statement_restrictions: adjust signature of range_from_raw_bounds
  cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
  cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
  cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
  cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
  cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
  cql3: statement_restrictions: wrap value_for_index_partition_key()
  cql3: statement_restrictions: hide value_for()
  cql3: statement_restrictions: push down clustering prefix wrapper one level
  cql3: statement_restrictions: wrap functions that return clustering ranges
  cql3: statement_restrictions: do not pass view schema back and forth
  cql3: statement_restrictions: pre-analyze token range restrictions
  cql3: statement_restrictions: pre-analyze partition key columns
  cql3: statement_restrictions: do not collect subscripted partition key columns
  cql3: statement_restrictions: split _partition_range_restrictions into three cases
  cql3: statement_restrictions: move value_list, value_set to header file
  cql3: statement_restrictions: wrap get_partition_key_ranges
  cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
  test: statement_restrictions: add index_selection regression test
2026-04-21 15:44:06 +03:00
Anna Stuchlik
d222e6e2a4 doc: document support for OCI Object Storage
This commit extends the object storage configuration section
with support for OCi object storage.

Fixes SCYLLADB-502

Closes scylladb/scylladb#29503
2026-04-21 15:11:58 +03:00
Botond Dénes
cfebe17592 sstables: fix segfault in parse_assert() when message is nullptr
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.

This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.

Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".

Fixes: SCYLLADB-1329

Closes scylladb/scylladb#29285
2026-04-21 12:40:33 +02:00
Marcin Maliszkiewicz
935e6a495d Merge 'transport: add per-service-level cql_requests_serving metric' from Piotr Smaron
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests.
This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label).

Fixes: SCYLLADB-1340

New feature, no backport.

Closes scylladb/scylladb#29493

* github.com:scylladb/scylladb:
  transport: add per-service-level cql_requests_serving metric
  transport: move requests_serving decrement to after response is sent
2026-04-21 12:35:50 +02:00
Aleksandra Martyniuk
cd79b99112 test: fix flaky test_alter_tablets_rf_dc_drop by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643.

Closes scylladb/scylladb#29563
2026-04-21 09:12:51 +03:00
Raphael S. Carvalho
474e962e01 compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
  only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
  compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
  all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
  repaired-only optimization is active; used by get_max_purgeable_timestamp() in
  compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
  is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
  process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
  D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
  tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
  compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
  landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
  writing T_base so D_base is staged on servers[0] via row-sync; blocks the
  view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
  (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
  in repaired set); asserts no resurrection; releases injection; waits for staging to
  complete; asserts no resurrection after a second flush+compaction. Demonstrates that
  the read-before-write in stream_view_replica_updates() makes the optimization safe even
  when staging fires after T_mv has been GC'd.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 16:59:09 -03:00
Ferenc Szili
a50aa7e689 test/cluster: wait for ready CQL in cross-rack merge test
test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately
after adding the new cross-rack nodes. In the failing runs the driver is
still converging on the updated topology at that point, so the control
connection sees incomplete peer metadata while schema changes are in
flight.

That leaves a race where CREATE TABLE is sent during topology churn and
the test can surface a misleading AlreadyExists error even though the
table creation has already been committed. Use get_ready_cql(servers)
here so the test waits for inter-node visibility and CQL readiness
before creating the keyspace and table.

Fixes: SCYLLADB-1635

Closes scylladb/scylladb#29561
2026-04-20 20:12:11 +02:00
Łukasz Paszkowski
d18eb9479f cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.

This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.

The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.

All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`

Fixes https://github.com/scylladb/scylladb/issues/25340

Closes scylladb/scylladb#25342
2026-04-20 17:57:38 +03:00
Botond Dénes
69c58c6589 Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.

The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.

This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901

The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.

Closes scylladb/scylladb#28873

* github.com:scylladb/scylladb:
  streaming: reject mutation fragments on critical disk utilization
  test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
  sstables: clean up TemporaryHashes file in wipe()
  sstables: add error injection point in write_components
  test/cluster/storage: extract validate_data_existence to module scope
  test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
  utils/disk_space_monitor: add error injection to suppress threshold checks
2026-04-20 17:56:36 +03:00
David Garcia
16ed338a89 Fix CODEOWNERS to cover nested docs subfolders
The `docs/*` pattern only matches files directly inside `docs/`,
not files in nested subfolders like `docs/folder_b/test.md` or
`docs/alternator/setup.md`. Those files currently have no code
owner assigned.

Replace with `/docs/` and `/docs/alternator/` which match the
directories and all their subdirectories recursively, per GitHub's
CODEOWNERS syntax.

Ref: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners

Closes scylladb/scylladb#29521
2026-04-20 17:55:43 +03:00
Avi Kivity
5687a4840d conf: pair sstable_format=ms with column_index_size_in_kb=1
One of the advantages of Trie indexes (with sstable_format=ms) is that
the index is more compact, and more suitable for paging from disk
(fewer pages required per search). We can exploit it by setting
column_index_size_in_kb to 1 rather than 64, increasing the index file
size (and requiring more index pages to be loaded and parsed) in return
for smaller data file reads.

To test this, I created a 1M row partition with 300-byte rows, compacted
it into a single sstable, and tested reads to a single row.

With column_index_size_in_kb=64:

Rows.db file size 60k
3 pages read from Rows.db (4k each)
2x 32k read from Data.db

With column_index_size_in_kb=1:

Rows.db file size 2MB (33X)
5 pages read from Rows.db (4k each, 1.7X)
1x 4107 bytes read from Data.db (0.5X IOPS, 0.06X bandwidth)

Given that Rows.db will be typically cached, or at least all but one of the
levels (its size is 157X smaller than Data.db), we win on both IOPS
and bandwidth.

I would have expected the the Data.db read to be closer to 1k, but this
is already an improvement.

Given that, set column_index_size_in_kb=1, but only for new clusters
where we also select sstable_format=ms.

Raw data (w1, w64 are working directories with different
column_index_size_in_kb):

```console
$ ls -l w*/data/bench/wide_partition-*/*{Rows,Data}.db
-rw-r--r-- 1 avi avi 314964958 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db
-rw-r--r-- 1 avi avi   2001227 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db
-rw-r--r-- 1 avi avi 314963261 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db
-rw-r--r-- 1 avi avi     59989 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db
```

column_index_size_in_kb=64 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | 9OXdwmDHRapL2w5YruWLTOtiC3PKbyctSDdQ8YpuPKtWkSYBF10G7bKo2rdnxSAd52HLI21568YM7OwK05B6qAF7X2b6910qsJEA106QBEcFWQVybMCkxkpO4VDRcAVNLRgjB3vygcDBP17GBTb2s7l47UOloy3KtZ7J5YQgKcf7zlFSKGHa49vnRrzoXZCdYexOpix6jcSV2SiwRNqgv6XmYhx43ZwGa4zUtOe0eIKJj7KTxu5bzyWUWGW7US4NLFZRD8Vdb6EasIFkOfVKdiFp2LZHMXGRvtvdF93UTFUb

(1 rows)

Tracing session: 19219900-3bf3-11f1-bc43-c0a4e62b53d1

 activity                                                                                                                                                                                                                 | timestamp                        | source    | source_elapsed | client
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                       Execute CQL3 query |       2026-04-19 16:24:30.992000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                                 Parsing a statement [shard 0/sl:default] | 2026-04-19 16:24:30.992643+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                            Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:24:30.992738+00:00 | 127.0.0.1 |             96 | 127.0.0.1
                                                                                                                                                               Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:24:30.992765+00:00 | 127.0.0.1 |            123 | 127.0.0.1
                        Creating read executor for token -3485513579396041028 with all: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] targets: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:24:30.992781+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                           Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:24:30.992782+00:00 | 127.0.0.1 |            140 | 127.0.0.1
                                                                                                                                                                         read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:24:30.992795+00:00 | 127.0.0.1 |            153 | 127.0.0.1
                                                                                                                            Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:24:30.992801+00:00 | 127.0.0.1 |            160 | 127.0.0.1
                                                                                                                                      [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:24:30.992805+00:00 | 127.0.0.1 |            163 | 127.0.0.1
                                                                                                                                            [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:24:30.992814+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                        Reading key {-3485513579396041028, pk{000400000000}} from sstable w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db [shard 0/sl:default] | 2026-04-19 16:24:30.992837+00:00 | 127.0.0.1 |            195 | 127.0.0.1
                                         page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.992851+00:00 | 127.0.0.1 |            209 | 127.0.0.1
                                              page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995294+00:00 | 127.0.0.1 |           2653 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995375+00:00 | 127.0.0.1 |           2733 | 127.0.0.1
                                               page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995376+00:00 | 127.0.0.1 |           2734 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                                                             page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206057984 [shard 0/sl:default] | 2026-04-19 16:24:30.995471+00:00 | 127.0.0.1 |           2829 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:24:30.995475+00:00 | 127.0.0.1 |           2833 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206057984, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995586+00:00 | 127.0.0.1 |           2945 | 127.0.0.1
                            Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:24:30.995637+00:00 | 127.0.0.1 |           2995 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206090752, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995645+00:00 | 127.0.0.1 |           3003 | 127.0.0.1
                                                                                                                                                                                    Querying is done [shard 0/sl:default] | 2026-04-19 16:24:30.995653+00:00 | 127.0.0.1 |           3012 | 127.0.0.1
                                                                                                                                                                Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:24:30.995670+00:00 | 127.0.0.1 |           3028 | 127.0.0.1
                                                                                                                                                                                                         Request complete |       2026-04-19 16:24:30.995039 | 127.0.0.1 |           3039 | 127.0.0.1

                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:22:43.107215+00:00 | 127.0.0.1 |           8685 | 127.0.0.1
```

column_index_size_in_kb=1 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | FIA7X52ZqYwvDxEGlmWJUSy1I94WTuWZTdLwXr9HBQ90RJLqYKr5nInTADSI6hzofwawaXphAQK07YMoyzFfRaGeKPQPKUb35XpLEGvLJ4xu9r4es8wUEHPXaFBGdMcWUkyDJSTYCFzZAPCzUHEuPJHMXVrI6UExWrIR0Xujg4GZa9UciU9rbEvrSBwSzoPEfbXJ6qZSGiTD8gcXz5kdAblLxsAeWug8tZqslsTu04HMLKfZ8WopQvHbpR6YlGSnM99CiBgz30LMmllULV4VA4u9kMpzsRV2IE2tKmJOddEl

(1 rows)

Tracing session: 3953a1f0-3bf3-11f1-b976-4a3dc2a7a57f

 activity                                                                                                                                                                                                              | timestamp                        | source    | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                    Execute CQL3 query |       2026-04-19 16:25:25.007000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                              Parsing a statement [shard 0/sl:default] | 2026-04-19 16:25:25.007423+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                         Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:25:25.007511+00:00 | 127.0.0.1 |             89 | 127.0.0.1
                                                                                                                                                            Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:25:25.007536+00:00 | 127.0.0.1 |            114 | 127.0.0.1
                     Creating read executor for token -3485513579396041028 with all: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] targets: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:25:25.007551+00:00 | 127.0.0.1 |            129 | 127.0.0.1
                                                                        Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:25:25.007553+00:00 | 127.0.0.1 |            131 | 127.0.0.1
                                                                                                                                                                      read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:25:25.007556+00:00 | 127.0.0.1 |            134 | 127.0.0.1
                                                                                                                         Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:25:25.007562+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                                                                                   [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:25:25.007564+00:00 | 127.0.0.1 |            142 | 127.0.0.1
                                                                                                                                         [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:25:25.007573+00:00 | 127.0.0.1 |            151 | 127.0.0.1
                      Reading key {-3485513579396041028, pk{000400000000}} from sstable w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db [shard 0/sl:default] | 2026-04-19 16:25:25.007594+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                                       page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.007607+00:00 | 127.0.0.1 |            184 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016029+00:00 | 127.0.0.1 |           8607 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016109+00:00 | 127.0.0.1 |           8687 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016111+00:00 | 127.0.0.1 |           8688 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016176+00:00 | 127.0.0.1 |           8754 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016260+00:00 | 127.0.0.1 |           8838 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                             w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: scheduling bulk DMA read of size 4107 at offset 206086656 [shard 0/sl:default] | 2026-04-19 16:25:25.016268+00:00 | 127.0.0.1 |           8846 | 127.0.0.1
 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: finished bulk DMA read of size 4107 at offset 206086656, successfully read 4608 bytes [shard 0/sl:default] | 2026-04-19 16:25:25.016340+00:00 | 127.0.0.1 |           8918 | 127.0.0.1
                         Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:25:25.016367+00:00 | 127.0.0.1 |           8945 | 127.0.0.1
                                                                                                                                                                                 Querying is done [shard 0/sl:default] | 2026-04-19 16:25:25.016385+00:00 | 127.0.0.1 |           8963 | 127.0.0.1
                                                                                                                                                             Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:25:25.016401+00:00 | 127.0.0.1 |           8979 | 127.0.0.1
                                                                                                                                                                                                      Request complete |       2026-04-19 16:25:25.015989 | 127.0.0.1 |           8989 | 127.0.0.1
```

Closes scylladb/scylladb#29552
2026-04-20 17:53:56 +03:00
Marcin Maliszkiewicz
e414b2b0b9 test/cluster: scale failure_detector_timeout_in_ms by build mode
Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.

The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:

  20:10:57,049 [shard 0] raft_group0 - server 614 entered
    'join group0' transition state for 53b01f0b

The joining node begins receiving the raft snapshot 100ms later:

  20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:

  20:10:57,511 [shard 0] migration_manager - Creating keyspace
    system_auth_v2
  ...
  20:10:57,788 [shard 0] migration_manager - Creating
    system_auth_v2.role_members

Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:

  20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
    when pinging 53b01f0b: seastar::rpc::timeout_error
    (rpc call timed out)

25ms later, the coordinator marks the joining node DOWN and removes it:

  20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
    Mark node 53b01f0b as DOWN
  20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
    53b01f0b

The joining node was still retrying the snapshot transfer at that point:

  20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then receives the ban notification and aborts:

  20:11:01,844 [shard 0] raft_group0 - received notification of being
    banned from the cluster

Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).

Test measurements (before -> after fix):

  debug mode:
  test_replace_with_same_ip_twice           24.02s ->  25.02s
  test_banned_node_notification            217.22s -> 221.72s
  test_kill_coordinator_during_op          116.11s -> 127.13s
  test_node_failure_during_tablet_migration
    [streaming-source]                     183.25s -> 192.69s
  test_replace (4 tests)        skipped in debug (skip_in_debug)
  test_raft_replace_ignore_nodes  skipped in debug (run_in_dev only)

  dev mode:
  test_replace_different_ip                 10.51s ->  11.50s
  test_replace_different_ip_using_host_id   10.01s ->  12.01s
  test_replace_reuse_ip                     10.51s ->  12.03s
  test_replace_reuse_ip_using_host_id       13.01s ->  12.01s
  test_raft_replace_ignore_nodes            19.52s ->  19.52s
2026-04-20 15:28:34 +02:00
Marcin Maliszkiewicz
99ac36b353 test/cluster: add failure_detector_timeout fixture
Add a shared pytest fixture that scales the failure detector timeout
by build mode factor (e.g. 3x for debug/sanitize, 2x for dev).
2026-04-20 15:28:33 +02:00
Marcin Maliszkiewicz
c136b2e640 audit: drop sstring temporaries on the will_log() fast path
audit::will_log() is called for every CQL/Alternator request. With
non-empty keyspace it does:

    _audited_keyspaces.find(sstring(keyspace))
    should_log_table(sstring(keyspace), sstring(table))

constructing three temporary sstrings from the std::string_view
arguments on every call. Now that the underlying associative containers
use std::less<> as comparator (previous commit), find() accepts the
string_view directly. Switch should_log_table() to take string_view as
well so the temporaries disappear entirely.

For short keyspace names the temporaries stay in SSO so allocs/op is
unchanged at 58.1, but each construction still costs ~60 instructions.

perf-simple-query --smp 1 --duration 15 --audit "table"
                  --audit-keyspaces "ks-non-existing"
                  --audit-categories "DCL,DDL,AUTH,DML,QUERY"

build: --mode=release --use-profile="" (no PGO)

Before (regression introduced in 9646ee05bd):
    instructions_per_op: 36952

After:
    instructions_per_op: 36768

Brings insns/op back to the pre-regression baseline 3d0582d51e
(insns/op ~36777) within the per-run noise of ~15 insns standard
deviation, eliminating the ~180 insns/op regression.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1616
2026-04-20 15:18:22 +02:00
Marcin Maliszkiewicz
724b9e66ea audit: enable heterogeneous lookup on audited keyspaces/tables
Replace the bare std::set<sstring>/std::map<sstring, std::set<sstring>>
member types with named aliases that use std::less<> as the comparator.
The transparent comparator enables heterogeneous lookup with
string_view keys.

This commit is a pure refactor with no behavioral change: the parser
return types, constructor parameters, observer template instantiations,
and start_audit() locals are all updated to use the aliases.
2026-04-20 15:14:58 +02:00
Marcin Maliszkiewicz
9f11920b15 Merge 'alternator: fix remaining problems with new Stream ARN format' from Nadav Har'El
This small series includes a few followups to the patch that changed Alternator Stream ARNs from using our own UUID format to something that resembles Amazon's Stream ARNs (and the KCL library won't reject as bogus-looking ARNs).

The first patch is the most important one, fixing ListStreams's LastEvaluatedStreamArn to also use the new ARN format. It fixes SCYLLADB-539.

The following patches are additional cleanups and tests for the new ARN code.

Closes scylladb/scylladb#29474

* github.com:scylladb/scylladb:
  alternator: fix ListStreams paging if table is deleted during paging
  test/alternator: test DescribeStream on non-existent table
  alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
  alternator: remove dead code stream_shard_id
  alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
2026-04-20 14:42:28 +02:00
Raphael S. Carvalho
a50e6215aa test/repair: Add tombstone GC safety tests for incremental repair
Add three cluster tests that verify no data resurrection occurs when
tombstone GC runs on the repaired sstable set under incremental repair
with tombstone_gc=repair mode.

All tests use propagation_delay_in_seconds=0 so that tombstones become
GC-eligible immediately after repair_time is committed (gc_before =
repair_time), allowing the scenarios to exercise the actual GC eligibility
path without artificial sleeps.

  (test_tombstone_gc_no_resurrection_basic_ordering)

Data D (ts=1) and tombstone T (ts=2) are written to all replicas and
flushed before repair.  Repair captures both in the repairing snapshot
and promotes them to repaired.  Once repair_time is committed, T is
GC-eligible (T.deletion_time < gc_before = repair_time).

The test verifies that compaction on the repaired set does NOT purge T,
because D is already in repaired (mark_sstable_as_repaired() completes
on all replicas before repair_time is committed to Raft) and clamps
max_purgeable to D.timestamp=1 < T.timestamp=2.

  (test_tombstone_gc_no_resurrection_hints_flush_failure)

The repair_flush_hints_batchlog_handler_bm_uninitialized injection causes
hints flush to fail on one node.  When hints flush fails, flush_time stays
at gc_clock::time_point{} (epoch).  This propagates as repair_time=epoch
committed to system.tablets, so gc_before = epoch - propagation_delay is
effectively the minimum possible time.  No tombstone has a deletion_time
older than epoch, so T is never GC-eligible from this repair.

The test verifies that repair_time does not advance to a meaningful value
after a failed hints flush, and that compaction on the repaired set does
not purge T (key remains deleted, no resurrection).

  (test_tombstone_gc_no_resurrection_propagation_delay)

Simulates a write D carrying an old CQL USING TIMESTAMP (ts_d = now-2h)
that was stored as a hint while a replica was down, and a tombstone T
with a higher timestamp (ts_t = now-90min, ts_t > ts_d) that was written
to all live replicas.  After the replica restarts, repair flushes hints
synchronously before taking the repairing snapshot, guaranteeing D is
delivered and captured in repairing before the snapshot.

After mark_sstable_as_repaired() promotes D to repaired, the coordinator
commits repair_time.  gc_before = repair_time > T.deletion_time so T is
GC-eligible.  The test verifies that compaction on the repaired set does
NOT purge T: D (ts_d < ts_t) is already in repaired, clamping
max_purgeable = ts_d < ts_t = T.timestamp, so T is not purgeable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 09:09:39 -03:00
Wojciech Mitros
6011cb8a4c db/view: track range tombstones in update stream during view update building
The view update builder ignored range tombstone changes from the update
stream when there all existing mutation fragments were already consumed.
The old code assumed range tombstones 'remove nothing pre-existing, so
we can ignore it', but this failed to update _update_current_tombstone.
Consequently, when a range delete and an insert within that range appeared
in the same batch, the range tombstone was not applied to the inserted row,
or was applied to a row outside the range that it covered causing it to
incorrectly survive/be deleted in the materialized view.

Fix by handling is_range_tombstone_change() fragments in the update-only
branch, updating _update_current_tombstone so subsequent clustering rows
correctly have the range tombstone applied to them.

Fixes SCYLLADB-1555

Closes scylladb/scylladb#29483
2026-04-20 13:38:52 +02:00
Wojciech Mitros
073710a661 view: apply existing range tombstones after exhausting the update reader
When view_update_builder::on_results() hits the path where the update
fragment reader is already exhausted, it still needs to keep tracking
existing range tombstones and apply them to encountered rows.
Otherwise a row covered by an existing range tombstone can appear
alive while generating the view update and create a spurious view row.

Update the existing tombstone state even on the exhausted-reader path
and apply the effective tombstone to clustering rows before generating
the row tombstone update. Add a cqlpy regression test covering the
partition-delete-after-range-tombstone case.

Fixes: SCYLLADB-1554

Closes scylladb/scylladb#29481
2026-04-20 13:29:05 +02:00
Dario Mirovic
40740104ab test: use DROP KEYSPACE IF EXISTS in new_test_keyspace cleanup
The new_test_keyspace context manager in test/cluster/util.py uses
DROP KEYSPACE without IF EXISTS during cleanup. The Python driver
has a known bug (scylladb/python-driver#317) where connection pool
renewal after concurrent node bootstraps causes double statement
execution. The DROP succeeds server-side, but the response is lost
when the old pool is closed. The driver retries on the new pool, and
gets ConfigurationException message "Cannot drop non existing keyspace".

The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT
EXISTS as a workaround for the same driver bug. This patch applies
the same approach to fix DROP KEYSPACE.

Fixes SCYLLADB-1538

Closes scylladb/scylladb#29487
2026-04-20 12:51:17 +02:00
Botond Dénes
ad7647c3c7 test/commitlog: reduce resource usage in test_commitlog_handle_replayed_segments
The test was using max_size_mb = 8*1024 (8 GB) with 100 iterations,
causing it to create up to 260 files of 32 MB each per iteration via
fallocate. On a loaded CI machine this totals hundreds of GB of file
operations, easily exceeding the 15-minute test timeout (SCYLLADB-1496).

The test only needs enough files to verify that delete_segments keeps
the disk footprint within [shard_size, shard_size + seg_size]. Reduce
max_size_mb to 128 (8 files of 32 MB per iteration) and the iteration
count to 10, which is sufficient to exercise the serialized-deletion
and recycle logic without imposing excessive I/O load.

Closes scylladb/scylladb#29510
2026-04-20 11:02:25 +03:00
Ernest Zaslavsky
e5e6608f20 sstables_loader: prevent use-after-free on table drop during streaming
sstables_loader::load_and_stream holds a replica::table& reference via
the sstable_streamer for the entire streaming operation.  If the table
is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the
reference becomes dangling and the next access crashes with SEGV.

This was observed in a longevity-50gb-12h-master test run where a
keyspace was dropped while load_and_stream was still streaming SSTables
from a previous batch.

Fix by acquiring a stream_in_progress() phaser guard in load_and_stream
before creating the streamer.  table::stop() calls
_pending_streams_phaser.close() which blocks until all outstanding
guards are released, keeping the table alive for the duration of the
streaming operation.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1352

Closes scylladb/scylladb#29403
2026-04-20 07:39:51 +03:00
Benny Halevy
34adb0e069 test/cluster/dtest: fix test_scrub_static_table flakiness
Pass jvm_args=["--smp", "1"] on both cluster.start() calls to
ensure consistent shard count across restarts, avoiding resharding
on restart. Also pass wait_for_binary_proto=True to cluster.start()
to ensure the CQL port is ready before connecting.

Fixes: SCYLLADB-824

Closes scylladb/scylladb#29548
2026-04-20 06:53:49 +03:00
Avi Kivity
d584bd7358 cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
has_eq_restriction_on_column() walked expression trees at prepare time to
find binary_operators with op==EQ that mention a given column on the LHS.
Its only caller is ORDER BY validation in select_statement, which checks
that clustering columns without an explicit ordering have an EQ restriction.

Replace the 50-line expression-walking free function with a precomputed
unordered_set<const column_definition*> (_columns_with_eq) populated during
the main predicate loop in analyze_statement_restrictions.  For single-column
EQ predicates the column is taken from on_column; for multi-column EQ like
(ck1, ck2) = (1, 2), all columns in on_clustering_key_prefix are included.

The member function becomes a single set::contains() call.
2026-04-19 20:57:09 +03:00
Avi Kivity
b7f86eaabc cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
build_get_multi_column_clustering_bounds_fn() used expr::visit() to dispatch
each restriction through a 15-handler visitor struct.  Only the
binary_operator handler did real work; the conjunction handler just
recursed, and the remaining 13 handlers were dead-code on_internal_error
calls (the filter expression of each predicate is always a binary_operator).

Replace the visitor with a loop over predicates that does
as<binary_operator>(pred.filter) directly, building the same query-time
lambda inline.

Promote intersect_all() and process_in_values() from static methods of
the deleted struct to free functions in the anonymous namespace -- they
are still called from the query-time lambda.
2026-04-19 20:57:09 +03:00
Avi Kivity
ece9af229d cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
Replace find_binop(..., is_multi_column) with pred.is_multi_column in
build_get_clustering_bounds_fn() and add_clustering_restrictions_to_idx_ck_prefix().

Replace is_clustering_order(binop) with pred.order == comparison_order::clustering
and iterate predicates directly instead of extracting filter expressions.

Remove the now-dead is_multi_column() free function.
2026-04-19 20:57:09 +03:00
Avi Kivity
72da1207d7 cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
The previous commit made prepare_indexed_local() use the pre-built
predicate vectors instead of calling extract_single_column_restrictions_for_column().
That was the last production caller.

Remove the function definition (65 lines of expression-walking visitor)
and its declaration/doc-comment from the header.

Replace the unit test (expression_extract_column_restrictions) which
directly called the removed function with synthetic column_definitions,
with per_column_restriction_routing which exercises the same routing
logic through the public analyze_statement_restrictions() API.  The new
test verifies not just factor counts but the exact (column_name, oper_t)
pairs in each per-column entry, catching misrouted restrictions that a
count-only check would miss.
2026-04-19 20:57:09 +03:00
Avi Kivity
b093477cf7 cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
Replace the extract_single_column_restrictions_for_column(_where, ...) call
in prepare_indexed_local() with a direct lookup in the pre-built predicate
vectors.

The old code walked the entire WHERE expression tree to extract binary
operators mentioning the indexed column, wrapped them in a conjunction,
translated column definitions to the index schema, then called
to_predicate_on_column() which walked the expression *again* to convert
back to predicates.

The new code selects the appropriate predicate vector map (PK, CK, or
non-PK) based on the indexed column's kind, looks up the column's
predicates directly, applies replace_column_def to each, and folds them
with make_conjunction -- producing the same result without any expression
tree walks.

This removes the last production caller of
extract_single_column_restrictions_for_column (unit tests in
statement_restrictions_test.cc still exercise it).
2026-04-19 20:57:09 +03:00
Avi Kivity
a725e39218 cql3: statement_restrictions: use predicate vector size for clustering prefix length
Replace the body of num_clustering_prefix_columns_that_need_not_be_filtered()
with a single return of _clustering_prefix_restrictions.size().

The old implementation called get_single_column_restrictions_map() to rebuild
a per-column map from the clustering expression tree, then iterated it in
schema order counting columns until it hit a gap, a needs-filtering predicate,
or a slice.  But _clustering_prefix_restrictions is already built with exactly
that same logic during the constructor (lines 1234-1248): it iterates CK
columns in schema order, appending predicates until it encounters a gap in
column_id, a predicate that needs_filtering, or a slice -- at which point it
stops.  So the vector's size is, by construction, the answer to the same
question the old code was re-deriving at query time.

This makes four helper functions dead code:

- get_single_column_restrictions_map(): walked the expression tree to build
  a map<column_definition*, expression> of per-column restrictions.  Was a
  ~15-line function that called get_sorted_column_defs() and
  extract_single_column_restrictions_for_column() for each column.

- get_the_only_column(): extracted the single column_value from a restriction
  expression, asserting it was single-column.  Called by the old loop body.

- is_single_column_restriction(): thin wrapper around
  get_single_column_restriction_column().

- get_single_column_restriction_column(): ~25-line function that walked an
  expression tree with for_each_expression<column_value> to determine whether
  all column_value nodes refer to the same column.  Called by the above two.

Remove all four functions and their forward declarations (-95 lines).
2026-04-19 20:57:08 +03:00
Avi Kivity
68c2e292ac cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
Convert do_find_idx() from a member function that walks expression trees
via index_restrictions()/for_each_expression/extract_single_column_restrictions
to a static free function that iterates index_search_group spans using
are_predicates_supported_by().

Convert calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index()
to use predicate vectors instead of expression-based is_supported_by().

Remove now-dead code: is_supported_by(), is_supported_by_helper(), score()
member function, and do_find_idx() member function.
2026-04-19 20:57:08 +03:00
Avi Kivity
c42397e995 cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
These functions are no longer called now that all index support checks
in the constructor use predicate-based alternatives. The expression-based
is_supported_by and is_supported_by_helper are still needed by choose_idx()
and calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index().
2026-04-19 20:57:08 +03:00
Avi Kivity
1aafe0708a cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
Replace clustering_columns_restrictions_have_supporting_index(),
multi_column_clustering_restrictions_are_supported_by(),
get_clustering_slice(), and partition_key_restrictions_have_supporting_index()
with predicate-based equivalents that use the already-accumulated mc_ck_preds
and sc_pk_pred_vectors locals.

The new multi_column_predicates_have_supporting_index() checks each
multi-column predicate's columns list directly against indexes, avoiding
expression tree walks through find_in_expression and bounds_slice.
2026-04-19 20:57:08 +03:00
Avi Kivity
fa6f239cc7 cql3: statement_restrictions: add predicate-based index support checking
Add `op` and `is_subscript` fields to `struct predicate` and populate them
in all predicate creation sites in `to_predicates()`. These fields record the
binary operator and whether the LHS is a subscript (map element access), which
are the two pieces of information needed to query index support.

Add `is_predicate_supported_by()` which mirrors `is_supported_by_helper()`
but operates on a single predicate's fields instead of walking the expression
tree.

Add a predicate-vector overload of `index_supports_some_column()` and use it
in the constructor to replace expression-based index support checks for
single-column partition key, clustering key, and non-primary-key restrictions.
The multi-column clustering key case still uses the existing expression-based
path.
2026-04-19 20:57:08 +03:00
Avi Kivity
25ba3bd649 cql3: statement_restrictions: use pre-built single-column maps for index support checks
Replace index_supports_some_column(expression, ...) with
index_supports_some_column(single_column_restrictions_map, ...) to
eliminate get_single_column_restrictions_map() tree walks when checking
index support.  The three call sites now use the maps already built
incrementally in the constructor loop:
_single_column_nonprimary_key_restrictions,
_single_column_clustering_key_restrictions, and
_single_column_partition_key_restrictions.

Also replace contains_multi_column_restriction() tree walk in
clustering_columns_restrictions_have_supporting_index() with
_has_multi_column.
2026-04-19 20:57:08 +03:00
Avi Kivity
fab90224b3 cql3: statement_restrictions: build clustering-prefix restrictions incrementally
Replace the extract_clustering_prefix_restrictions() tree walk with
incremental collection during the main loop.  Two new locals --
mc_ck_preds and sc_ck_preds -- accumulate multi-column and single-column
clustering key predicates respectively.  A short post-loop block
computes the longest contiguous prefix from sc_ck_preds (or uses
mc_ck_preds directly for multi-column), replacing the removed function.

Also remove the now-unused to_predicate_on_clustering_key_prefix(),
with_current_binary_operator() helper, and the
visitor_with_binary_operator_context concept.
2026-04-19 20:57:08 +03:00
Avi Kivity
3bd308986a cql3: statement_restrictions: build partition-range restrictions incrementally
Replace the extract_partition_range() tree walk with incremental
collection during the main loop.  Two new locals before the loop --
token_pred and pk_range_preds -- accumulate token and single-column
EQ/IN partition key predicates respectively.  A short post-loop block
materializes _partition_range_restrictions from these locals, replacing
the removed function.

This removes the last tree walk over partition-key restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
db28411548 cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
Instead of accumulating all clustering-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each single-column clustering-key
predicate is processed.

The post-loop guard (!has_mc_clustering) is no longer needed:
multi-column predicates go through the is_multi_column branch
and never insert into this map, and mixing multi with single-column
is rejected with an exception.

This eliminates a post-loop tree walk over
_clustering_columns_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
a4608804d8 cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
Instead of accumulating all partition-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each single-column partition-key
predicate is processed.

The post-loop guard (!has_token_restrictions()) is no longer needed:
token predicates go through the on_partition_key_token branch and
never insert into this map, and mixing token with non-token is
rejected with an exception.

This eliminates a post-loop tree walk over
_partition_key_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
e9b16a11ba cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
Instead of accumulating all non-primary-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each non-primary-key predicate
is processed.

This eliminates a post-loop tree walk over _nonprimary_key_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
701366a8d1 cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
Replace the two post-loop find_binop(_clustering_columns_restrictions,
is_multi_column) tree walks and the contains_multi_column_restriction()
tree walk with the already-tracked local has_mc_clustering.

The redundant second assignment inside the _check_indexes block is
removed entirely.
2026-04-19 20:57:08 +03:00
Avi Kivity
da438507d0 cql3: statement_restrictions: track has-token state incrementally
Replace the two in-loop calls to has_token_restrictions() (which
walks the _partition_key_restrictions expression tree looking for
token function calls) with a local bool has_token, set to true
when a token predicate is processed.

The member function is retained since it's used outside the
constructor.

With this change, the constructor loop's non-error control flow
performs zero expression tree scanning.  The only remaining tree
walks are on error paths (get_sorted_column_defs,
get_columns_in_commons for formatting exception messages) and
structural (make_conjunction for building accumulated expressions).
2026-04-19 20:57:07 +03:00
Avi Kivity
1344278a19 cql3: statement_restrictions: track partition-key-empty state incrementally
Replace the in-loop call to partition_key_restrictions_is_empty()
(which walks the _partition_key_restrictions expression tree via
is_empty_restriction()) with a local bool pk_is_empty, set to false
at the two sites where partition key restrictions are added.

The member function is retained since it's used outside the
constructor.
2026-04-19 20:57:07 +03:00
Avi Kivity
14812ea1e0 cql3: statement_restrictions: track first multi-column predicate incrementally
Replace find_in_expression<binary_operator>(_clustering_columns_restrictions,
always_true), which walks the accumulated expression tree to find the
first binary_operator, with a tracked pointer first_mc_pred set when
the first multi-column predicate is added. This eliminates the tree
scan, the null check, and the is_lower_bound/is_upper_bound lambdas,
replacing them with direct predicate field accesses: first_mc_pred->order,
first_mc_pred->is_lower_bound, first_mc_pred->is_upper_bound, and
first_mc_pred->filter for error messages.
2026-04-19 20:57:07 +03:00
Avi Kivity
ef005c10ba cql3: statement_restrictions: track last clustering column incrementally
Replace get_last_column_def(_clustering_columns_restrictions), which
walks the entire accumulated expression tree to collect and sort all
column definitions, with a local pointer ck_last_column that tracks
the column with the highest schema position as single-column
clustering restrictions are added.
2026-04-19 20:57:07 +03:00
Avi Kivity
88bd5ea1b7 cql3: statement_restrictions: track clustering-has-slice incrementally
Replace has_slice(_clustering_columns_restrictions), which walks the
accumulated expression tree looking for slice operators, with a local
bool ck_has_slice set when any clustering predicate with is_slice is
added. Updated at all three clustering insertion points: multi-column
first assignment, multi-column slice conjunction, and single-column
conjunction.
2026-04-19 20:57:07 +03:00
Avi Kivity
1071c39f17 cql3: statement_restrictions: track has-multi-column-clustering incrementally
Replace find_binop(_clustering_columns_restrictions, is_tuple_constructor),
which walks the accumulated expression tree looking for multi-column
restrictions, with a local bool has_mc_clustering set when a multi-column
predicate is first added. This serves both the multi-column branch
(checking existing restrictions are also multi-column) and the
single-column branch (checking no multi-column restrictions exist).
2026-04-19 20:57:07 +03:00
Avi Kivity
aa6a0ad326 cql3: statement_restrictions: track clustering-empty state incrementally
Replace is_empty_restriction(_clustering_columns_restrictions), which
recursively walks the accumulated expression tree, with a local bool
ck_is_empty that is set to false when a clustering restriction is
first added. Updated at both insertion points: multi-column first
assignment and single-column make_conjunction.
2026-04-19 20:57:07 +03:00
Avi Kivity
d4ff613c0a cql3: statement_restrictions: replace restr bridge variable with pred.filter
The constructor loop no longer needs to extract a binary_operator
reference from each predicate. All remaining uses (make_conjunction,
get_columns_in_commons, assignment to accumulated restriction members,
_where.push_back, and error formatting) accept expression directly,
which is what pred.filter already is. This eliminates the unnecessary
as<binary_operator> cast at the top of the loop.
2026-04-19 20:57:07 +03:00
Avi Kivity
44b18f3399 cql3: statement_restrictions: convert single-column branch to use predicate properties
In the single-column partition-key and clustering-key sub-branches,
replace direct binary_operator field inspections with pre-computed
predicate booleans: !pred.equality && !pred.is_in instead of
restr.op != EQ && restr.op != IN, pred.is_in instead of
find(restr, IN), and pred.is_slice instead of has_slice(restr).
Also fix a leftover restr.order in the multi-column branch error
message.
2026-04-19 20:57:07 +03:00
Avi Kivity
b0c5eed384 cql3: statement_restrictions: convert multi-column branch to use predicate properties
Replace direct operator comparisons with predicate boolean fields:
pred.equality, pred.is_in, pred.is_slice, pred.is_lower_bound,
pred.is_upper_bound, and pred.order.
2026-04-19 20:57:07 +03:00
Avi Kivity
afd68187ea cql3: statement_restrictions: convert constructor loop to iterate over predicates
Convert the constructor loop to first build predicates from the
prepared where clause, then iterate over the predicates.

The IS_NOT branch now uses pred.is_not_null_single_column and pred.on
instead of inspecting the expression directly. The branch conditions
for multi-column (pred.is_multi_column), token
(on_partition_key_token), and single-column (on_column) now use
predicate properties instead of expression helpers.

Remove extract_column_from_is_not_null_restriction() which is no
longer needed.
2026-04-19 20:57:07 +03:00
Avi Kivity
440d9f2d82 cql3: statement_restrictions: annotate predicates with operator properties
Add boolean fields to struct predicate that describe the operator:
equality, is_in, is_slice, is_upper_bound, is_lower_bound, and
comparison_order. Populate them in all to_predicates() return sites.

These fields will allow the constructor loop to inspect predicate
properties directly instead of re-examining the expression.
2026-04-19 20:57:07 +03:00
Avi Kivity
e0eb3bde8d cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
To avoid having to dig deep into the expression, compute is_not_null
and is_multicolumn early and store them in the predicate.
2026-04-19 20:57:06 +03:00
Avi Kivity
6892642176 cql3: statement_restrictions: complete preparation early
We want to move away from the unprepared domain to the prepared
domain to avoid confusion. Ideally we'd receive prepared expressions
via the constructor, but that is left for later.
2026-04-19 20:57:06 +03:00
Avi Kivity
ed5dd645e8 cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
Currently, possible_lhs_values accepts a column_definition parameter
that tells it which column we are interested in. This works
because callers pre-analyze the expression and only pass a
subexpression that contains the specified columns.

We wish to convert expressions to predicates early, and so won't
have the benefit of knowing which columns we're interested in.

Generally, this is simple: a binary operator contains a column on the
left-hand side, so use that. If the expression is on a token, use that.

When the expression is a boolean constant (not expressible by
the grammar, but somehow found its way into the code). We invent
a new `on_row` designator meaning it's not about a specific column.
It will be useful one day when we allow things like
`WHERE some_boolean_function(c1, c2)` that aren't specific to any
single column.

Finally, we introduce helpers that, given such an expression decomposed
into predicates and a column_definition, extract the predicate related
to the given column. This mimics the possible_lhs_values API and allows
us to make minimal changes to callers, deferring that until later.

possible_lhs_values() is renamed to to_predicates() and loses the
column_definition parameter to indicate its new role.
2026-04-19 20:57:06 +03:00
Avi Kivity
bfd1302311 cql3: statement_restrictions: refine possible_lhs_values() function_call processing
Currently, we are careful to call possible_lhs_values() for a token
function only when slice/equality operators are used. We wish to relax
this, so return nullptr (must filter) for the other cases instead of
raising an internal error.
2026-04-19 20:57:06 +03:00
Avi Kivity
736011b663 cql3: statement_restrictions: return nullptr for function solver if not token
Currently, possible_lhs_values() for a function call expression will
only be called when we're sure it's the token() function. But soon this
will no longer be the case. Return nullptr for non-token functions to
indicate we can't solve for a column value instead of an internal
error.
2026-04-19 20:57:06 +03:00
Avi Kivity
8faf62a1aa cql3: statement_restrictions: refine possible_lhs_values() subscript solving
Do more work at prepare time.
2026-04-19 20:57:06 +03:00
Avi Kivity
a28689a99a cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
Since we're a first-resort call now, and there's a last-restort (evaluate)

Logically should be part of previous patch, but the rest of the code is still
careful enough not to call here when not expecting a solution, so the split
is not breaking bisectability.
2026-04-19 20:57:06 +03:00
Avi Kivity
370f3fd2e8 cql3: statement_restrictions: convert possible_lhs_values into a solver
Convert from an execute-time function to a prepare-time function
by returning a solver function instead of directly solving.

When not possible to solve, but still possible to evaluate (filter),
return nullptr.
2026-04-19 20:57:06 +03:00
Avi Kivity
92a43557dc cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
Expressions are a tree-like structure so a single expression is sufficient
(for complicated ones, a conjunction is used), but predicates are flat.
Prepare for conversion to predicates by storing the expressions that
will correspond to predicates, namely the boolean factors of the WHERE
clause.
2026-04-19 20:57:06 +03:00
Avi Kivity
694c1aed98 cql3: statement_restrictions: refactor IS NOT NULL processing
Move some code to a helper, but don't let it mutate state.
2026-04-19 20:57:06 +03:00
Avi Kivity
35f14544dc cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:06 +03:00
Avi Kivity
1965741914 cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:06 +03:00
Avi Kivity
1d631f7bac cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
24cd98e454 cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
be3239fc58 cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
8990346c75 cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
Prepare for inlining it into its caller, which doesn't work easily if there's
an early return.
2026-04-19 20:57:05 +03:00
Avi Kivity
fa130051a6 cql3: statement_restrictions: fold add_is_not_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
63f9362c89 cql3: statement_restrictions: fold add_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
9cbb1b851e cql3: statement_restrictions: remove possible_partition_token_values()
It's just a call to possible_lhs_values() with a different signature.

Now possible_lhs_values() is our only solver.
2026-04-19 20:57:05 +03:00
Avi Kivity
c1fc596203 cql3: statement_restrictions: remove possible_column_values
replace with now-identical possible_lhs_values. This paves the way
to have only one solver function (after we remove
possible_partition_token_values).
2026-04-19 20:57:05 +03:00
Avi Kivity
b26e6f7330 cql3: statement_restrictions: pass schema to possible_column_values()
This unifies the signature with possible_lhs_values(), paving the way
to deduplicating the two functions. We always have the schema and may as
well pass it.
2026-04-19 20:57:05 +03:00
Avi Kivity
c6f6e81fe5 cql3: statement_restrictions: remove fallback path in solve()
All query plans that try to solve for the possible values a column
(or token, or column-tuple) can take have been converted to set
analyzed_column::solve_for. Recognize that by removing the
fallback path.

This removes the last possible_column_values() call that isn't bound
(using std::bind_front), and will allow moving it to prepare time.
2026-04-19 20:57:05 +03:00
Avi Kivity
e0445269e5 cql3: statement_restrictions: reorder possible_lhs_column parameters
By moving query_options to the end, we can use std::bind_front to
convert it from a build-time to a run-time function that depends
only on the query_options.
2026-04-19 20:57:05 +03:00
Avi Kivity
e42ad62561 cql3: statement_restrictions: prepare solver for multi-column restrictions
Multi-column restrictions (a, b) > (:v1, :v2) do not obey normal
comparison rules. For example, given

 (a, b) > (5, 1) AND a <= 5

We see that (a, b) = (5, 2) satisfies the constraint, but if we tried
to solve for the interval

 ( (5, 1), (5) ]

We'd have to conclude that (5,1) <= (5).

It's possible to extend the CQL type system to support this, but
that would be a lot of work, and in fact the current code doesn't
depend on it (by solving these intersections in its own code path
(multi_column_range_accumulator_builder's prefix3cmp).

So, we just mark such solvers as non-comparable, and generate an
internal error if we try to compare them in make_conjunction.
2026-04-19 20:57:05 +03:00
Avi Kivity
96e8414963 cql3: statement_restrictions: add solver for token restriction on index
possible_column_values() knows how to find the values that the token can
take, so add a solve_for implementation for tokens.
2026-04-19 20:57:04 +03:00
Avi Kivity
135809d97b cql3: statement_restrictions: pre-analyze column in value_for()
Since we pre-analyze the column, return a built function, and remove
the corresponding lambda from the caller.
2026-04-19 20:57:04 +03:00
Avi Kivity
0a16d90acb cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
In statement_restriction's constructor, we check that all the boolean factors
are relations. This means the code to handle a constant here is dead code.

Remove it; while it's good to handle it, it should be handled at the top
level, not in multi-column restriction processing.
2026-04-19 20:57:04 +03:00
Avi Kivity
56ae02d8a3 cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
range_from_raw_bound processes restrictions of the form

   (a, b) > SCYLLA_CLUSTERING_BOUND(?, ?)

indicating that comparisons respect whether columns are reversed or not.

Iterate over expressions during the prepare phase only; generating
"builder" functions to be executed during the query phase.
2026-04-19 20:57:04 +03:00
Avi Kivity
2c75123bbd cql3: statement_restrictions: adjust signature of range_from_raw_bounds
The get_clustering_bounds() family works in terms of vectors of
clustering ranges (to support IN) and in fact the only caller converts
it to a vector. Converting it immediately simplifies later patching.
2026-04-19 20:57:04 +03:00
Avi Kivity
e646b763e7 cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
multi_column_range_accumulator analyzes an expression containing multi-column
restrictions of the form (a, b) > (?, ?) and simultaneously analyzes
them and solves for the set of intervals that satisfy those restrictions.

Split this into prepare-time phase (that generates "builders", functions
that operator on the accumulator), and a query phase that executes
the builders. Importantly, the expression visitor ends up on the prepare
phase, so it can be merged with other parts of the analysis.

Helper functions of the visitor are made static, since they need to
run during the query phase but the visitor only exists during the
prepare phase.
2026-04-19 20:57:04 +03:00
Avi Kivity
ea26186043 cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
Lay the groundwork for analyzing multi column clustering bounds by
splitting the function into prepare-time and execute-time parts.
To start with, all of the work is done at query time, but later
patches will move bits into prepare time.
2026-04-19 20:57:04 +03:00
Avi Kivity
c60e3d5cf7 cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
For the multi column binary operator case, perform more of the work at
prepare time in preparation for consolidating the analysis.
2026-04-19 20:57:04 +03:00
Avi Kivity
b520e74128 cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
Doing this splits the multi-column processing code into a preparation
phase and an evaluation phase in a single call, making it easier to
further split prepare/evaluate.
2026-04-19 20:57:04 +03:00
Avi Kivity
c4ab0ddb85 cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
Change _clustering_prefix_restrictions and _idx_tbl_ck_prefix
(the latter is the equivalent of the former, for indexed queries),
to use predicate instead of expressions. This lets us do
more of the work of solving restrictions during prepare time.

We only handle single-column restrictions here. Multi-column
restrictions use the existing path.

We introduce two helpers:
 - value_set_to_singleton() converts a restriction solution to a singleton
   when we know that's the only possible answer
 - replace_column_def() overload for predicate, similar to the
   existing overload for expressions

There is a wart in get_single_column_clustering_bounds(): we arrive at
his point with the two vectors possibly pointing at different
columns. Previously, possible_lhs_values() did this check while solving.
We now check for it here.

The predicate::on variant gets another member, for clustering key prefixes.
Since everything is still handled by the legacy paths, we mostly
error out.
2026-04-19 20:57:04 +03:00
Avi Kivity
201ed53837 cql3: statement_restrictions: wrap value_for_index_partition_key()
To allow more work to be carried out during prepare time, wrap
the body in an std::function, which will be called at execution time.

Currently we actually do the work during execution time; but the
way is prepared.
2026-04-19 20:57:04 +03:00
Avi Kivity
325497d460 cql3: statement_restrictions: hide value_for()
value_for() is a general function that solves for values that
satisfy an expression set to TRUE. This goes against our goal to
prepare solvers for all the expressions we use. Fortunately, it's only
called with one expression, which comes from statement_restrictions, so
we can add an accessor that provides the expression from our own state.
Later, we'll be able to do prepare-time work on it.
2026-04-19 20:57:04 +03:00
Avi Kivity
dcdd2f7e72 cql3: statement_restrictions: push down clustering prefix wrapper one level
This allows us to tackle each case separately.
2026-04-19 20:57:03 +03:00
Avi Kivity
1039ed9ed2 cql3: statement_restrictions: wrap functions that return clustering ranges
During prepare time, build functions for use during execution time.

Currently, the wrappers are very shallow, and practically all the
work is done at execution time. But the stage is set for more peeling.

The index clustering ranges had on_internal_error()s if an index
was not used. They're converted to returning a null function. If
executed (which is never supposed to happen), it will throw
a bad_function_call.
2026-04-19 20:57:03 +03:00
Avi Kivity
620df7103f cql3: statement_restrictions: do not pass view schema back and forth
For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.

Avoid the back-and-forth and use the stored value. Using a different
value would be broken.

This change allows unifying the signatures of the four functions that
get clustering ranges.
2026-04-19 20:57:03 +03:00
Avi Kivity
6fce090e30 cql3: statement_restrictions: pre-analyze token range restrictions
Convert token range restrictions to the predicate format we
introduced earlier, where we have a function to solve for the token
range rather than running the analysis at runtime. Again the truth is
that the function will delegate to possible_partition_token_values()
which actually will do the analysis at runtime, but it's one step closer.

We add a new variant element for predicate::on, since it doesn't
fit the existing element (the token isn't a column).
2026-04-19 20:57:03 +03:00
Avi Kivity
941011bb4a cql3: statement_restrictions: pre-analyze partition key columns
The expression tree for partition keys is analyzed during runtime:
in partition_range_from_singles() (for example), we call find_binop
and get_subscripted_column() to understand the expression structure.

This analysis is problematic because it has to match the analysis
during prepare time; and they have to evolve in lock step.

Here, we move the analysis to the prepare stage. This is done
by augmenting the expression into a new predicate struct. It
contains the original expression (as a fallback for paths not yet
converted), as well as a solve_for function which contains
a function built at prepare time that embeds all the necessary analysis.

We introduce the `predicate` type which is an augmentation
of boolean expressions. In addition to the expression, we remember
what column the expression is on, and a function that computes
what values the column can take on that would make the expression
true.

The field that says what column the predicate is about is typed
as a variant since later on we will have predicates on non-columns
(the token, or a clustering prefix).

Note that currently the function engages in some run-time analysis of
its own, since it calls possible_lhs_values that itself does analysis,
but this is a step in the right direction.
2026-04-19 20:57:03 +03:00
Avi Kivity
c73f3ac55f cql3: statement_restrictions: do not collect subscripted partition key columns
An indexed SELECT of the from

SELECT ...
WHERE pk['sub'] = ?

is impossible because our indexes do not support frozen maps, and
partition key collections must be frozen. Stop collecting such constructs
for the purpose of determining the partition range. This reduces having
to deal with combinations of restrictions on the column and its entries
later on.

In case we start supporting indexes on frozen maps, leave an
on_internal_error to remind us.
2026-04-19 20:57:03 +03:00
Avi Kivity
531f137ed3 cql3: statement_restrictions: split _partition_range_restrictions into three cases
_partition_range_restrictions are a vector of expressions, one per
partition key column, except that it can be empty if there is no
restriction on the partition that can be translated to a read command,
and if the restriction is on a token range, the first element only
is used.

Separate the three cases into distinct structs. After this, additional
work can be done utilizing the specialization.
2026-04-19 20:57:03 +03:00
Avi Kivity
fcf7c4c90d cql3: statement_restrictions: move value_list, value_set to header file
They don't really need to be public, but will be used in intermediate
storage.
2026-04-19 20:57:03 +03:00
Avi Kivity
926886fcfb cql3: statement_restrictions: wrap get_partition_key_ranges
statement_restrictions::get_partition_key_ranges() re-interprets
the expressions used to specify the partition key. This means that
the analysis phase (determining what those expressions are and how
they are to be used) and the execution phase (using them) are in separate
places. This makes it very hard to refactor while preserving correctness.

As a first step in unifying the two phases, we move the selection
of the strategy (using token, cartesian product, or single partition)
from execution to analysis, by making the if-tree return a function to
be executed at execution time, rather than running the if-tree itself
at execution time.
2026-04-19 20:57:03 +03:00
Avi Kivity
eec0b20dbc cql3: statement_restrictions: prepare statement_restrictions for capturing this
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.

To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
2026-04-19 20:57:03 +03:00
Avi Kivity
374be94faa test: statement_restrictions: add index_selection regression test
In preparation for refactoring statement_restrictions, add a simple
and an exhaustive regression test, encoding the index selection
algorithm into the test. We cannot change the index selection algorithm
because then mixed-node clusters will alter the sorting key mid-query
(if paging takes place).

Because the exhaustive space has such a large stack frame, and
because Address Santizer bloats the stack frame, increase it
for debug builds.
2026-04-19 20:57:01 +03:00
Artsiom Mishuta
dce0c24a02 test/alternator: replace bare pytest.skip() with typed skip helpers 2026-04-19 17:34:41 +02:00
Artsiom Mishuta
b078cd1e72 test: migrate new bare skips introduced by upstream after rebase
Migrate 3 bare skip sites that appeared in upstream/master after the
initial migration:

- test/cluster/test_strong_consistency.py: 2 @pytest.mark.skip →
  @pytest.mark.skip_bug (SCYLLADB-1056)
- test/cqlpy/conftest.py: pytest.skip() → skip_env() in
  skip_on_scylla_vnodes fixture
2026-04-19 17:34:41 +02:00
Artsiom Mishuta
9c4d3ce097 test/pylib: reject bare pytest.mark.skip and add codebase guards
Harden the skip_reason_plugin to reject bare @pytest.mark.skip at
collection time with pytest.UsageError instead of warnings.warn().

Add test/pylib_test/test_no_bare_skips.py with three guard tests:
- AST scan for bare pytest.skip() runtime calls
- Real pytest --collect-only against all Python test directories
2026-04-19 17:34:31 +02:00
Artsiom Mishuta
0b6b380b80 test: update comments referencing pytest.skip() to skip_env()
Update 7 comments/docstrings across 5 files that still referenced
pytest.skip() to reference the typed skip_env() wrapper for
consistency with the migrated code.
2026-04-19 11:14:03 +02:00
Artsiom Mishuta
b10028e556 test: migrate runtime pytest.skip() to typed skip_bug()
Migrate 2 runtime pytest.skip() calls referencing known bugs to use
the typed skip_bug() wrapper from test.pylib.skip_types:

- test/alternator/test_ttl.py: Streams on tablets (#23838)
- test/scylla_gdb/test_task_commands.py: coroutine task not found (#22501)
2026-04-19 11:10:42 +02:00
Artsiom Mishuta
8a80e2c3be test: migrate runtime pytest.skip() to typed skip_env()
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.

These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.

Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.

Each file gains one new import:
  from test.pylib.skip_types import skip_env
2026-04-19 11:09:29 +02:00
Artsiom Mishuta
fb0974a329 test: migrate bare @pytest.mark.skip to skip_not_implemented
Migrate 2 bare @pytest.mark.skip decorators (no reason string) to
@pytest.mark.skip_not_implemented with an explicit reason referencing
issue #3882 (COMPACT STORAGE not implemented).
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
a39fb9d29a test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
Migrate 4 @pytest.mark.skip decorator sites to @pytest.mark.skip_slow
across 3 test files where the skip reason indicates a slow test.
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
638efedc3c test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
Migrate 10 @pytest.mark.skip decorator sites to
@pytest.mark.skip_not_implemented across 5 test files where the
skip reason indicates a feature not yet implemented.
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
465636bc53 test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
Migrate 24 @pytest.mark.skip decorator sites to @pytest.mark.skip_bug
across 16 test files where the reason references a known bug or issue.
2026-04-19 11:06:30 +02:00
Nadav Har'El
0d05e3b4a4 alternator: fix ListStreams paging if table is deleted during paging
Currently, ListStreams paging works by looking in the list of tables
for ExclusiveStartStreamArn and starting there. But it's possible
that during the paging process, one of the tables got deleted and
ExclusiveStartStreamArn no longer points to an existing table. In
the current implementation this caused the paging to stop (think
it reached the end).

The solution is simple: ListStreams will now sort the list of tables
by name (it anyway needs to be sorted by something to be consistent
across pages), and will look with std::upper_bound for the first
table *after* the ExclusiveStartStreamArn - we don't need to find
that table name itself.

The patch also includes a test reproducing this bug. As usual, the
test passes on DynamoDB, fails on Alternator before this patch,
and passes with the patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:02 +03:00
Nadav Har'El
930fb4c330 test/alternator: test DescribeStream on non-existent table
We already had a test for DescribeStream being called on a bogus ARN
returns a ValidationException. But if the stream is more legitimate-
looking but refers to a non-existent table (e.g., an ARN taken in the
past from a table that no longer exists), we should return
ResourceNotFoundException. In this patch we add a test that verifies
we indeed do this correctly.

Moreover, Alternator's current stream ARNs include both a keyspace
name and a table name, and either one being incorrect should lead
to ResourceNotFoundException, and indeed the new test validates
that it works as expected - there is no bug here (AI guessed we
have a bug in the missing *keyspace* case, but this guess was wrong).
2026-04-19 09:12:02 +03:00
Nadav Har'El
02d474fca8 alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
When ListStreams is on its last page and ran out streams to list,
it shouldn't return a paging cookie (LastEvaluatedStreamArn) at all.
Before this patch it does, and forces the user to make another call
just to get another empty page, which is silly.

This patch includes a fix and a reproducer test (that, as usual, passes
on DynamoDB and fails on Alternator before the patch and succeeds
after).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:02 +03:00
Nadav Har'El
68b783103e alternator: remove dead code stream_shard_id
The class "stream_shard_id" was used in the past (with the old name
stream_arn) for representing stream ARNs. It was renamed
"stream_shard_id" under the mistaken believe that it will be used to
represent DynamoDB Streams "shards" - but it wasn't used for that
either (we have a separate "struct shard_id" in the code).

So this class is now dead code and can be removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:01 +03:00
Nadav Har'El
1ac910c2ab alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
Alternator Streams' "ListStreams" does paging by returning a "cookie"
LastEvaluatedStreamArn from one request, that the user passes to the
next request as ExclusiveStartStreamArn.

In the past, Alternator's stream ARNs were UUIDs, but we recently
changed them to match DynamoDB's ARN format which the KCL library
requires. However, we didn't change ListStream's cookie format,
and it remained UUIDs.

This, however, goes against the documentation of DynamoDB, which
states that LastEvaluatedStreamArn should be "the stream ARN of
the item where the operation stopped". It shouldn't be some weird
opaque cookie.

So in this patch we add a test that confirms that indeed, in DynamoDB
the LastEvaluatedStreamARN is really the last returned ARN and not
an opaque cookie. The new test passes on DynamoDB, and fails on
Alternator before the simple fix that this patch then does.

Fixes SCYLLADB-539.
2026-04-19 09:12:01 +03:00
Piotr Smaron
218f8adc8f transport: add per-service-level cql_requests_serving metric
Add a per-scheduling-group gauge that tracks the number of in-flight CQL
requests for each service level. The existing scylla_transport_requests_serving
metric is a single global per-shard counter; the new metric breaks it down
by scheduling group so operators can see which service level contributes
the most in-flight requests when debugging latency.

The metric is named cql_requests_serving (exposed as
scylla_transport_cql_requests_serving) following the cql_ prefix convention
used by all other per-scheduling-group transport metrics (cql_requests_count,
cql_request_bytes, cql_response_bytes, cql_pending_response_memory). Using
a cql_ prefix avoids Prometheus confusion with the global requests_serving
metric, which lacks the scheduling_group_name label.

The counter is incremented when a request enters process_request() and
decremented in the same 'leave' defer block as the global requests_serving,
ensuring the request is counted as in-flight until the response is sent.
2026-04-17 15:07:14 +02:00
Piotr Smaron
4988077249 transport: move requests_serving decrement to after response is sent
The requests_serving metric was decremented right after query processing
completed, but before the response was written to the client. This means
requests whose responses were queued in the write pipeline were no longer
counted as in-flight, understating the actual load.

Move the decrement into the 'leave' defer block, which fires after the
response is fully sent via _ready_to_respond. This makes the shedding
check (max_concurrent_requests_per_shard) more accurate: requests that
have finished processing but are still waiting in the response queue now
correctly count toward the in-flight limit.
2026-04-17 15:05:29 +02:00
Aleksandra Martyniuk
b4c0ad20cf service: fix indentation 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
88c55cf7ed docs: update documentation 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
2c0de7d9b3 test: test multi RF changes 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1b2b453782 service: tasks: allow aborting ongoing RF changes
Allow aborting an ongoing RF change using task manager.

RF change can only be aborted if:
- it is currently paused (existing);
- it is a multi-RF change that still has replicas to be added.

In the second case, we set error for the request in system.topology_requests
and set next_replication to replication_v2. This makes load balancer
roll back the RF change.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
38bad5f316 cql3: allow changing RF by more than one when adding or removing a DC
rf_rack_valid_keyspaces relies on the fact that replicas of base
table and mv are streamed concurrently. This is no longer true
for newly introduced method of adding a DC. Disable rf_rack_valid_keyspaces
in test_mv_first_replica_in_dc to force the old method.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1bafc8394c service: handle multi_rf_change
Extend keyspace_rf_change handler to handle multi_rf_change.
multi_rf_change is allowed only if we add or remove DCs and
the keyspace uses rack list replication factor. The handler
adds the request id to topology::ongoing_rf_changes.
The request is further processed by load balancer.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
8fb91e245f service: implement make_rf_change_plan
In make_rf_change_plan, load balancer schedules necessary migrations,
considering the load of nodes and other pending tablet transitions.
Requests from ongoing_rf_changes are processed concurrently, independently
from one another. In each request racks are processed concurrently.
No tablet replica will be removed until all required replicas are added.
While adding replicas to each rack we always start with base tables
and won't proceed with views until they are done (while removing - the other
way around).

Node availability is checked at two levels for extending actions:

1) In prepare_per_rack_rf_change_plan: the entire RF change request is
   aborted if any node in the target dc+rack is down, or if there are
   no live (non-excluded) nodes at all. Shrinking is never aborted.

2) In make_rf_change_plan: extending is skipped for a given round if
   any normal, non-excluded node in the target dc+rack is missing from
   the balanced node set. Shrinking always proceeds regardless.

The resulting behavior per node state combination (extending only):
  - all up                  -> proceed
  - some excluded + some up -> proceed (excluded nodes are skipped)
  - any down node           -> abort
  - all excluded (no live)  -> abort

When the last step is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved (if request succeeded);
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
89a17491db service: add keyspace_rf_change_plan to migration_plan
Add keyspace_rf_change_plan to migration_plan.

The keyspace_rf_change_plan consists of:
- completion - info about the request for which all migrations are done. Only one
  request can be completed at the time, even if more have finished migrations
  (the rest will be completed later). Based on it:
    - next_replication is cleared;
    - new keyspace properties are saved (only if succeeded);
    - request is removed from ongoing_rf_changes;
    - the request is marked as done in system.topology_requests.
- aborts - info about requests that cannot complete because the required
  rf change is impossible (e.g. no available nodes in a required rack).
  Multiple requests can be aborted in a single plan. Based on each:
    - next_replication is set to current_replication (rolling back);
    - the request is marked as aborted with an error in system.topology_requests.

The scheduled rebuilds will be kept in migration_plan::_migrations.

Based on that the canonical_mutations are generated.

Add update_topology_state_with_mixed_change and use it if any schema
changes are required, i.e. if plan contains keyspace_rf_change_plan::completion.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
bcdab2e012 service: extend tablet_migration_info to handle rebuilds
Make tablet_migration_info::{src,dst} optional, so that it can be
reused by rebuild, for respectively leaving and pending replica.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
d41c5a7db4 service: split update_node_load_on_migration
Split update_node_load_on_migration into decrease_node_load and
increase_node_load - in the following changes for rebuilds we will
need only one of those at the time.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
dd83666733 service: rearrange keyspace_rf_change handler
In the following changes, keyspace_rf_change handler will also consider
a change of RF by more than one. Rearrange the handler, so that it
first chooses a kind of RF change and then creates relevant updates.

Do not wrap the code in schedule_migration function, as we no longer
need a quick return possibility.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
72bb3113ac db: add columns to system_schema.keyspaces
Add a new next_replication column to system_schema.keyspaces table.

While there is an ongoing RF change:
- next_replication keeps the target RF values;
- existing replication_v2 column keeps initial RF values - the ones we
  started the RF change with.

DESCRIBE KEYSPACE statement shows replication_v2.

When there is no ongoing RF change for this keyspace, its
next_replication is empty.

In this commit no data is kept in the new column.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
751af38f2a db: service: add ongoing_rf_changes to system.topology
Following changes, will allow adding or removing all keyspace
replicas in a DC with a single ALTER KEYSPACE. For such operations,
the tablet load balancer needs to schedule rebuilds. To track
which RF change requests require rebuilds, we maintain a vector
of RF changes along with their ongoing rebuild phases.

Add a new ongoing_rf_changes column to system.topology to keep track
of those requests.

In this commit no data is kept in the new column.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
7cdf7d62a2 gms: add keyspace_multi_rf_change feature 2026-04-17 09:58:05 +02:00
Łukasz Paszkowski
4657d9e32c streaming: reject mutation fragments on critical disk utilization
The stream_mutation_fragments RPC handler did not check
is_in_critical_disk_utilization_mode before accepting incoming mutation
fragments. This meant load-and-stream (nodetool refresh --load-and-stream)
could push data onto a node at critical disk utilization, potentially
filling the disk completely.

Add a critical disk utilization check in the get_next_mutation_fragment
lambda, throwing critical_disk_utilization_exception when the node is in
critical mode. This mirrors the existing protection in stream_blob.cc.

Also remove the xfail marker from the corresponding test added in the
previous commit.
2026-04-17 09:31:26 +02:00
Łukasz Paszkowski
61877e9dfb test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
Add `test_load_and_stream_rejected_on_critical_disk` which verifies
that `nodetool refresh --load-and-stream` is rejected when the target
node reaches critical disk utilization during streaming. The test is
marked xfail because the stream_mutation_fragments handler does not
yet check whether the node is in the critical disk utilization mode
(introduced in the next patch).

The test sets up a 3-node cluster, writes data and snapshots SSTables
on one node, wipes another node's data, and copies the snapshot to its
upload directory. It then starts load-and-stream and uses the
`write_components_writer_created` error injection to pause SSTable writing.
While paused, the test fills the disk past the critical threshold, then
releases the injection. The next streamed mutation fragment is rejected
with critical_disk_utilization_exception.

The test verifies that:

- The operation fails with the expected error.
- No data is persisted on the target node.
- Partial SSTable files created during streaming are deleted (via the
  implicit mark-for-deletion mechanism in the SSTable lifecycle).
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
8d34127684 sstables: clean up TemporaryHashes file in wipe()
The TemporaryHashes.db.tmp file is created during SSTable writing to
store intermediate bloom filter hashes and is deleted before the SSTable
is sealed. Since it is not tracked in the TOC, it is also absent from
_recognized_components and all_components().

When an SSTable write fails before sealing (e.g. streaming rejected due
to critical disk utilization), wipe() is called to clean up the partial
SSTable. However, wipe() only iterates over all_components(), so the
TemporaryHashes file was left behind as an orphan.

Previously, the only cleanup mechanism for this file was the
startup-time directory scanner in sstable_directory, which would not
help when the orphan needs to be cleaned up at runtime.

Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT
for the common case where the file was already removed before sealing.
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
159675e975 sstables: add error injection point in write_components
Add a `write_components_writer_created` error injection point in
`sstable::write_components()` between writer creation and fragment
consumption.

This injection is needed by the out-of-space streaming test (added in
the next patch) to reliably pause SSTable writing at the right moment:
after the SSTable writer has been created and files exist on disk, but
before mutation fragments are consumed.

Pausing earlier (before writer creation) would not work because there
are no files on disk yet, while pausing later (after consuming fragments)
would be too late to reliably push the node into critical disk utilization.
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
d1a24aa16a test/cluster/storage: extract validate_data_existence to module scope
Move validate_data_existence out of test_user_writes_rejection into
module scope so it can be reused by other tests in the file. No
functional change.
2026-04-16 08:38:33 +02:00
Łukasz Paszkowski
9c82b76755 test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
Tests that override disk capacity via the data_file_capacity config
option trigger the disk space monitor's critical utilization mode and
as a consequence activate out-of-space prevention mechanisms.

This will cause bootstrap failures with critical_disk_utilization_exception
during mutation-based streaming introduced later in the series.

Enable the `suppress_disk_space_threshold_checks` error injection at
startup in the affected tests to prevent the disk space monitor from
interfering with the test-configured capacity values.

Affected tests:
- test_balance_empty_tablets (test/cluster/test_size_based_load_balancing.py)
- test_load_stats_on_coordinator_failover (test/cluster/test_tablet_stats.py)
2026-04-16 08:38:33 +02:00
Łukasz Paszkowski
3726e31c03 utils/disk_space_monitor: add error injection to suppress threshold checks
Add the `suppress_disk_space_threshold_checks` error injection point
to the disk space monitor. When enabled, the threshold listener
short-circuits without evaluating disk utilization.

This is useful for tests that override disk capacity via `data_file_capacity`,
where the real disk usage causes the monitor to incorrectly report
critical utilization and activate out-of-space prevention mechanisms.
2026-04-16 08:38:33 +02:00
Petr Gusev
8a16746e55 strong_consistency: fix crash when DROP TABLE races with in-flight DML
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in groups_manager::acquire_server() because the
raft group has already been erased from _raft_groups.

A concurrent DROP TABLE may have already removed the table from database
registries and erased the raft group via schedule_raft_group_deletion.
The schema.table() in create_operation_ctx() might not fail though
because someone might be holding lw_shared_ptr<table>, so that the
table is dropped but the table object is still alive.

Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via find_column_family before looking up
the raft group.  If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error.  When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.

Fixes SCYLLADB-1450
2026-04-10 22:56:16 +02:00
Petr Gusev
82460e7a38 test: add regression test for DROP TABLE racing with in-flight DML
Add test_drop_table_during_insert that reproduces a crash when DROP TABLE
races with an in-flight INSERT on a strongly-consistent table.  The test
uses an error injection to pause INSERT between obtaining the ERM and
calling acquire_server, then drops the table (which destroys the raft
group), then resumes the INSERT.  Without a fix, the node aborts in
acquire_server via on_internal_error.

The test is marked as skip until the fix is in place.
2026-04-10 22:56:16 +02:00
263 changed files with 6879 additions and 3041 deletions

4
.github/CODEOWNERS vendored
View File

@@ -32,8 +32,8 @@ counters* @nuivall
tests/counter_test* @nuivall
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh
/docs/ @annastuchlik @tzach
/docs/alternator/ @annastuchlik @tzach @nyh
# GOSSIP
gms/* @tgrabiec @asias @kbr-scylla

View File

@@ -234,15 +234,11 @@ generate_scylla_version()
option(Scylla_USE_PRECOMPILED_HEADER "Use precompiled header for Scylla" ON)
add_library(scylla-precompiled-header STATIC exported_templates.cc)
target_include_directories(scylla-precompiled-header PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}"
"${scylla_gen_build_dir}")
target_link_libraries(scylla-precompiled-header PRIVATE
absl::headers
absl::btree
absl::hash
absl::raw_hash_set
idl
Seastar::seastar
Snappy::snappy
systemd

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.2.0-dev
VERSION=2026.2.0-rc0
if test -f version
then

View File

@@ -1892,7 +1892,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
}
if (vector_index_updates->Size() > 1) {
// VectorIndexUpdates mirrors GlobalSecondaryIndexUpdates.
// Since DynamoDB artifically limits the latter to just a
// Since DynamoDB artificially limits the latter to just a
// single operation (one Create or one Delete), we also
// place the same artificial limit on VectorIndexUpdates,
// and throw the same LimitExceeded error if the client

View File

@@ -1354,7 +1354,7 @@ static future<executor::request_return_type> query_vector(
std::unordered_set<std::string> used_attribute_values;
// Parse the Select parameter and determine which attributes to return.
// For a vector index, the default Select is ALL_ATTRIBUTES (full items).
// ALL_PROJECTED_ATTRIBUTES is significantly more efficent because it
// ALL_PROJECTED_ATTRIBUTES is significantly more efficient because it
// returns what the vector store returned without looking up additional
// base-table data. Currently only the primary key attributes are projected
// but in the future we'll implement projecting additional attributes into

View File

@@ -167,46 +167,8 @@ static schema_ptr get_schema_from_arn(service::storage_proxy& proxy, const strea
}
}
// ShardId. Must be between 28 and 65 characters inclusive.
// UUID is 36 bytes as string (including dashes).
// Prepend a version/type marker (`S`) -> 37
class stream_shard_id : public utils::UUID {
public:
using UUID = utils::UUID;
static constexpr char marker = 'S';
stream_shard_id() = default;
stream_shard_id(const UUID& uuid)
: UUID(uuid)
{}
stream_shard_id(const table_id& tid)
: UUID(tid.uuid())
{}
stream_shard_id(std::string_view v)
: UUID(v.substr(1))
{
if (v[0] != marker) {
throw std::invalid_argument(std::string(v));
}
}
friend std::ostream& operator<<(std::ostream& os, const stream_shard_id& arn) {
const UUID& uuid = arn;
return os << marker << uuid;
}
friend std::istream& operator>>(std::istream& is, stream_shard_id& arn) {
std::string s;
is >> s;
arn = stream_shard_id(s);
return is;
}
};
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_shard_id>
: public from_string_helper<ValueType, alternator::stream_shard_id>
{};
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>
: public from_string_helper<ValueType, alternator::stream_arn>
@@ -218,7 +180,8 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
_stats.api_operations.list_streams++;
auto limit = rjson::get_opt<int>(request, "Limit").value_or(100);
auto streams_start = rjson::get_opt<stream_shard_id>(request, "ExclusiveStartStreamArn");
auto streams_start = rjson::get_opt<stream_arn>(request, "ExclusiveStartStreamArn");
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
@@ -244,34 +207,34 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
cfs = db.get_tables();
}
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
// We need to sort the tables to ensure a stable order for paging.
// We sort by keyspace and table name, which will also allow us to skip to
// the right position by ExclusiveStartStreamArn.
auto cmp = [](std::string_view ks1, std::string_view cf1, std::string_view ks2, std::string_view cf2) {
return ks1 == ks2 ? cf1 < cf2 : ks1 < ks2;
};
if (std::cmp_less(limit, cfs.size()) || streams_start) {
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id().uuid() < t2.schema()->id().uuid();
});
std::sort(cfs.begin(), cfs.end(),
[&cmp](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return cmp(t1.schema()->ks_name(), t1.schema()->cf_name(),
t2.schema()->ks_name(), t2.schema()->cf_name());
});
}
auto i = cfs.begin();
auto e = cfs.end();
if (streams_start) {
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id().uuid() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())
;
});
if (i != e) {
++i;
}
i = std::upper_bound(i, e, *streams_start,
[&cmp](const stream_arn& arn, const data_dictionary::table& t) {
return cmp(arn.keyspace_name(), arn.table_name(),
t.schema()->ks_name(), t.schema()->cf_name());
});
}
auto ret = rjson::empty_object();
auto streams = rjson::empty_array();
std::optional<stream_shard_id> last;
std::optional<stream_arn> last;
for (;limit > 0 && i != e; ++i) {
auto s = i->schema();
@@ -282,19 +245,24 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
}
if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {
rjson::value new_entry = rjson::empty_object();
last = i->schema()->id();
auto arn = stream_arn{ i->schema(), cdc::get_base_table(db.real_database(), *i->schema()) };
rjson::add(new_entry, "StreamArn", arn);
rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));
rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(s->cf_name())));
rjson::push_back(streams, std::move(new_entry));
last = std::move(arn);
--limit;
}
}
rjson::add(ret, "Streams", std::move(streams));
if (last) {
// Only emit LastEvaluatedStreamArn when we stopped because we hit the
// limit (limit == 0), meaning there may be more streams to list.
// If we exhausted all tables naturally (limit > 0), there are no more
// streams, so we must not emit a cookie.
if (last && limit == 0) {
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
@@ -614,7 +582,7 @@ void stream_id_range::prepare_for_iterating()
// the function returns `stream_id_range` that will allow iteration over children Streams shards for the Streams shard `parent`
// a child Streams shard is defined as a Streams shard that touches token range that was previously covered by `parent` Streams shard
// Streams shard contains a token, that represents end of the token range for that Streams shard (inclusive)
// begginning of the token range is defined by previous Streams shard's token + 1
// beginning of the token range is defined by previous Streams shard's token + 1
// NOTE: With vnodes, ranges of Streams' shards wrap, while with tablets the biggest allowed token number is always a range end.
// NOTE: both streams generation are guaranteed to cover whole range and be non-empty
// NOTE: it's possible to get more than one stream shard with the same token value (thus some of those stream shards will be empty) -

View File

@@ -856,7 +856,9 @@ rest_exclude_node(sharded<service::storage_service>& ss, std::unique_ptr<http::r
}
apilog.info("exclude_node: hosts={}", hosts);
co_await ss.local().mark_excluded(hosts);
co_await ss.local().run_with_no_api_lock([hosts = std::move(hosts)] (service::storage_service& ss) {
return ss.mark_excluded(hosts);
});
co_return json_void();
}
@@ -1731,7 +1733,9 @@ rest_create_vnode_tablet_migration(http_context& ctx, sharded<service::storage_s
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
co_await ss.local().prepare_for_tablets_migration(keyspace);
co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.prepare_for_tablets_migration(keyspace);
});
co_return json_void();
}
@@ -1743,7 +1747,9 @@ rest_get_vnode_tablet_migration(http_context& ctx, sharded<service::storage_serv
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
auto status = co_await ss.local().get_tablets_migration_status_with_node_details(keyspace);
auto status = co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.get_tablets_migration_status_with_node_details(keyspace);
});
ss::vnode_tablet_migration_status result;
result.keyspace = status.keyspace;
@@ -1768,7 +1774,9 @@ rest_set_vnode_tablet_migration_node_storage_mode(http_context& ctx, sharded<ser
}
auto mode_str = req->get_query_param("intended_mode");
auto mode = service::intended_storage_mode_from_string(mode_str);
co_await ss.local().set_node_intended_storage_mode(mode);
co_await ss.local().run_with_no_api_lock([mode] (service::storage_service& ss) {
return ss.set_node_intended_storage_mode(mode);
});
co_return json_void();
}
@@ -1782,7 +1790,9 @@ rest_finalize_vnode_tablet_migration(http_context& ctx, sharded<service::storage
auto keyspace = validate_keyspace(ctx, req);
validate_keyspace(ctx, keyspace);
co_await ss.local().finalize_tablets_migration(keyspace);
co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.finalize_tablets_migration(keyspace);
});
co_return json_void();
}
@@ -1859,90 +1869,106 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
// Hold the storage_service async gate for the duration of async REST
// handlers so stop() drains in-flight requests before teardown.
// Synchronous handlers don't yield and need no gate.
static seastar::httpd::future_json_function
gated(sharded<service::storage_service>& ss, seastar::httpd::future_json_function fn) {
return [fn = std::move(fn), &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto holder = ss.local().hold_async_gate();
co_return co_await fn(std::move(req));
};
}
static seastar::httpd::json_request_function
gated(sharded<service::storage_service>&, seastar::httpd::json_request_function fn) {
return fn;
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));
ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));
ss::get_range_to_endpoint_map.set(r, rest_bind(rest_get_range_to_endpoint_map, ctx, ss));
ss::get_pending_range_to_endpoint_map.set(r, rest_bind(rest_get_pending_range_to_endpoint_map, ctx));
ss::describe_ring.set(r, rest_bind(rest_describe_ring, ctx, ss));
ss::get_current_generation_number.set(r, rest_bind(rest_get_current_generation_number, ss));
ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));
ss::get_natural_endpoints_v2.set(r, rest_bind(rest_get_natural_endpoints_v2, ctx, ss));
ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));
ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));
ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
ss::get_removal_status.set(r, rest_bind(rest_get_removal_status, ss));
ss::force_remove_completion.set(r, rest_bind(rest_force_remove_completion, ss));
ss::set_logging_level.set(r, rest_bind(rest_set_logging_level));
ss::get_logging_levels.set(r, rest_bind(rest_get_logging_levels));
ss::get_operation_mode.set(r, rest_bind(rest_get_operation_mode, ss));
ss::is_starting.set(r, rest_bind(rest_is_starting, ss));
ss::get_drain_progress.set(r, rest_bind(rest_get_drain_progress, ss));
ss::drain.set(r, rest_bind(rest_drain, ss));
ss::stop_gossiping.set(r, rest_bind(rest_stop_gossiping, ss));
ss::start_gossiping.set(r, rest_bind(rest_start_gossiping, ss));
ss::is_gossip_running.set(r, rest_bind(rest_is_gossip_running, ss));
ss::stop_daemon.set(r, rest_bind(rest_stop_daemon));
ss::is_initialized.set(r, rest_bind(rest_is_initialized, ss));
ss::join_ring.set(r, rest_bind(rest_join_ring));
ss::is_joined.set(r, rest_bind(rest_is_joined, ss));
ss::is_incremental_backups_enabled.set(r, rest_bind(rest_is_incremental_backups_enabled, ctx));
ss::set_incremental_backups_enabled.set(r, rest_bind(rest_set_incremental_backups_enabled, ctx));
ss::rebuild.set(r, rest_bind(rest_rebuild, ss));
ss::bulk_load.set(r, rest_bind(rest_bulk_load));
ss::bulk_load_async.set(r, rest_bind(rest_bulk_load_async));
ss::reschedule_failed_deletions.set(r, rest_bind(rest_reschedule_failed_deletions));
ss::sample_key_range.set(r, rest_bind(rest_sample_key_range));
ss::reset_local_schema.set(r, rest_bind(rest_reset_local_schema, ss));
ss::set_trace_probability.set(r, rest_bind(rest_set_trace_probability));
ss::get_trace_probability.set(r, rest_bind(rest_get_trace_probability));
ss::get_slow_query_info.set(r, rest_bind(rest_get_slow_query_info));
ss::set_slow_query.set(r, rest_bind(rest_set_slow_query));
ss::deliver_hints.set(r, rest_bind(rest_deliver_hints));
ss::get_cluster_name.set(r, rest_bind(rest_get_cluster_name, ss));
ss::get_partitioner_name.set(r, rest_bind(rest_get_partitioner_name, ss));
ss::get_tombstone_warn_threshold.set(r, rest_bind(rest_get_tombstone_warn_threshold));
ss::set_tombstone_warn_threshold.set(r, rest_bind(rest_set_tombstone_warn_threshold));
ss::get_tombstone_failure_threshold.set(r, rest_bind(rest_get_tombstone_failure_threshold));
ss::set_tombstone_failure_threshold.set(r, rest_bind(rest_set_tombstone_failure_threshold));
ss::get_batch_size_failure_threshold.set(r, rest_bind(rest_get_batch_size_failure_threshold));
ss::set_batch_size_failure_threshold.set(r, rest_bind(rest_set_batch_size_failure_threshold));
ss::set_hinted_handoff_throttle_in_kb.set(r, rest_bind(rest_set_hinted_handoff_throttle_in_kb));
ss::get_exceptions.set(r, rest_bind(rest_get_exceptions, ss));
ss::get_total_hints_in_progress.set(r, rest_bind(rest_get_total_hints_in_progress));
ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));
ss::get_ownership.set(r, rest_bind(rest_get_ownership, ctx, ss));
ss::get_effective_ownership.set(r, rest_bind(rest_get_effective_ownership, ctx, ss));
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
ss::repair_tablet.set(r, rest_bind(rest_repair_tablet, ctx, ss));
ss::tablet_balancing_enable.set(r, rest_bind(rest_tablet_balancing_enable, ss));
ss::create_vnode_tablet_migration.set(r, rest_bind(rest_create_vnode_tablet_migration, ctx, ss));
ss::get_vnode_tablet_migration.set(r, rest_bind(rest_get_vnode_tablet_migration, ctx, ss));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss));
ss::finalize_vnode_tablet_migration.set(r, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss));
ss::quiesce_topology.set(r, rest_bind(rest_quiesce_topology, ss));
sp::get_schema_versions.set(r, rest_bind(rest_get_schema_versions, ss));
ss::drop_quarantined_sstables.set(r, rest_bind(rest_drop_quarantined_sstables, ctx, ss));
ss::get_token_endpoint.set(r, gated(ss, rest_bind(rest_get_token_endpoint, ctx, ss)));
ss::get_release_version.set(r, gated(ss, rest_bind(rest_get_release_version, ss)));
ss::get_scylla_release_version.set(r, gated(ss, rest_bind(rest_get_scylla_release_version, ss)));
ss::get_schema_version.set(r, gated(ss, rest_bind(rest_get_schema_version, ss)));
ss::get_range_to_endpoint_map.set(r, gated(ss, rest_bind(rest_get_range_to_endpoint_map, ctx, ss)));
ss::get_pending_range_to_endpoint_map.set(r, gated(ss, rest_bind(rest_get_pending_range_to_endpoint_map, ctx)));
ss::describe_ring.set(r, gated(ss, rest_bind(rest_describe_ring, ctx, ss)));
ss::get_current_generation_number.set(r, gated(ss, rest_bind(rest_get_current_generation_number, ss)));
ss::get_natural_endpoints.set(r, gated(ss, rest_bind(rest_get_natural_endpoints, ctx, ss)));
ss::get_natural_endpoints_v2.set(r, gated(ss, rest_bind(rest_get_natural_endpoints_v2, ctx, ss)));
ss::cdc_streams_check_and_repair.set(r, gated(ss, rest_bind(rest_cdc_streams_check_and_repair, ss)));
ss::cleanup_all.set(r, gated(ss, rest_bind(rest_cleanup_all, ctx, ss)));
ss::reset_cleanup_needed.set(r, gated(ss, rest_bind(rest_reset_cleanup_needed, ctx, ss)));
ss::force_flush.set(r, gated(ss, rest_bind(rest_force_flush, ctx)));
ss::force_keyspace_flush.set(r, gated(ss, rest_bind(rest_force_keyspace_flush, ctx)));
ss::decommission.set(r, gated(ss, rest_bind(rest_decommission, ss, ssc)));
ss::logstor_compaction.set(r, gated(ss, rest_bind(rest_logstor_compaction, ctx)));
ss::logstor_flush.set(r, gated(ss, rest_bind(rest_logstor_flush, ctx)));
ss::move.set(r, gated(ss, rest_bind(rest_move, ss)));
ss::remove_node.set(r, gated(ss, rest_bind(rest_remove_node, ss)));
ss::exclude_node.set(r, gated(ss, rest_bind(rest_exclude_node, ss)));
ss::get_removal_status.set(r, gated(ss, rest_bind(rest_get_removal_status, ss)));
ss::force_remove_completion.set(r, gated(ss, rest_bind(rest_force_remove_completion, ss)));
ss::set_logging_level.set(r, gated(ss, rest_bind(rest_set_logging_level)));
ss::get_logging_levels.set(r, gated(ss, rest_bind(rest_get_logging_levels)));
ss::get_operation_mode.set(r, gated(ss, rest_bind(rest_get_operation_mode, ss)));
ss::is_starting.set(r, gated(ss, rest_bind(rest_is_starting, ss)));
ss::get_drain_progress.set(r, gated(ss, rest_bind(rest_get_drain_progress, ss)));
ss::drain.set(r, gated(ss, rest_bind(rest_drain, ss)));
ss::stop_gossiping.set(r, gated(ss, rest_bind(rest_stop_gossiping, ss)));
ss::start_gossiping.set(r, gated(ss, rest_bind(rest_start_gossiping, ss)));
ss::is_gossip_running.set(r, gated(ss, rest_bind(rest_is_gossip_running, ss)));
ss::stop_daemon.set(r, gated(ss, rest_bind(rest_stop_daemon)));
ss::is_initialized.set(r, gated(ss, rest_bind(rest_is_initialized, ss)));
ss::join_ring.set(r, gated(ss, rest_bind(rest_join_ring)));
ss::is_joined.set(r, gated(ss, rest_bind(rest_is_joined, ss)));
ss::is_incremental_backups_enabled.set(r, gated(ss, rest_bind(rest_is_incremental_backups_enabled, ctx)));
ss::set_incremental_backups_enabled.set(r, gated(ss, rest_bind(rest_set_incremental_backups_enabled, ctx)));
ss::rebuild.set(r, gated(ss, rest_bind(rest_rebuild, ss)));
ss::bulk_load.set(r, gated(ss, rest_bind(rest_bulk_load)));
ss::bulk_load_async.set(r, gated(ss, rest_bind(rest_bulk_load_async)));
ss::reschedule_failed_deletions.set(r, gated(ss, rest_bind(rest_reschedule_failed_deletions)));
ss::sample_key_range.set(r, gated(ss, rest_bind(rest_sample_key_range)));
ss::reset_local_schema.set(r, gated(ss, rest_bind(rest_reset_local_schema, ss)));
ss::set_trace_probability.set(r, gated(ss, rest_bind(rest_set_trace_probability)));
ss::get_trace_probability.set(r, gated(ss, rest_bind(rest_get_trace_probability)));
ss::get_slow_query_info.set(r, gated(ss, rest_bind(rest_get_slow_query_info)));
ss::set_slow_query.set(r, gated(ss, rest_bind(rest_set_slow_query)));
ss::deliver_hints.set(r, gated(ss, rest_bind(rest_deliver_hints)));
ss::get_cluster_name.set(r, gated(ss, rest_bind(rest_get_cluster_name, ss)));
ss::get_partitioner_name.set(r, gated(ss, rest_bind(rest_get_partitioner_name, ss)));
ss::get_tombstone_warn_threshold.set(r, gated(ss, rest_bind(rest_get_tombstone_warn_threshold)));
ss::set_tombstone_warn_threshold.set(r, gated(ss, rest_bind(rest_set_tombstone_warn_threshold)));
ss::get_tombstone_failure_threshold.set(r, gated(ss, rest_bind(rest_get_tombstone_failure_threshold)));
ss::set_tombstone_failure_threshold.set(r, gated(ss, rest_bind(rest_set_tombstone_failure_threshold)));
ss::get_batch_size_failure_threshold.set(r, gated(ss, rest_bind(rest_get_batch_size_failure_threshold)));
ss::set_batch_size_failure_threshold.set(r, gated(ss, rest_bind(rest_set_batch_size_failure_threshold)));
ss::set_hinted_handoff_throttle_in_kb.set(r, gated(ss, rest_bind(rest_set_hinted_handoff_throttle_in_kb)));
ss::get_exceptions.set(r, gated(ss, rest_bind(rest_get_exceptions, ss)));
ss::get_total_hints_in_progress.set(r, gated(ss, rest_bind(rest_get_total_hints_in_progress)));
ss::get_total_hints.set(r, gated(ss, rest_bind(rest_get_total_hints)));
ss::get_ownership.set(r, gated(ss, rest_bind(rest_get_ownership, ctx, ss)));
ss::get_effective_ownership.set(r, gated(ss, rest_bind(rest_get_effective_ownership, ctx, ss)));
ss::retrain_dict.set(r, gated(ss, rest_bind(rest_retrain_dict, ctx, ss, group0_client)));
ss::estimate_compression_ratios.set(r, gated(ss, rest_bind(rest_estimate_compression_ratios, ctx, ss)));
ss::sstable_info.set(r, gated(ss, rest_bind(rest_sstable_info, ctx)));
ss::logstor_info.set(r, gated(ss, rest_bind(rest_logstor_info, ctx)));
ss::reload_raft_topology_state.set(r, gated(ss, rest_bind(rest_reload_raft_topology_state, ss, group0_client)));
ss::upgrade_to_raft_topology.set(r, gated(ss, rest_bind(rest_upgrade_to_raft_topology, ss)));
ss::raft_topology_upgrade_status.set(r, gated(ss, rest_bind(rest_raft_topology_upgrade_status, ss)));
ss::raft_topology_get_cmd_status.set(r, gated(ss, rest_bind(rest_raft_topology_get_cmd_status, ss)));
ss::move_tablet.set(r, gated(ss, rest_bind(rest_move_tablet, ctx, ss)));
ss::add_tablet_replica.set(r, gated(ss, rest_bind(rest_add_tablet_replica, ctx, ss)));
ss::del_tablet_replica.set(r, gated(ss, rest_bind(rest_del_tablet_replica, ctx, ss)));
ss::repair_tablet.set(r, gated(ss, rest_bind(rest_repair_tablet, ctx, ss)));
ss::tablet_balancing_enable.set(r, gated(ss, rest_bind(rest_tablet_balancing_enable, ss)));
ss::create_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_create_vnode_tablet_migration, ctx, ss)));
ss::get_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_get_vnode_tablet_migration, ctx, ss)));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, gated(ss, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss)));
ss::finalize_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss)));
ss::quiesce_topology.set(r, gated(ss, rest_bind(rest_quiesce_topology, ss)));
sp::get_schema_versions.set(r, gated(ss, rest_bind(rest_get_schema_versions, ss)));
ss::drop_quarantined_sstables.set(r, gated(ss, rest_bind(rest_drop_quarantined_sstables, ctx, ss)));
}
void unset_storage_service(http_context& ctx, routes& r) {

View File

@@ -113,8 +113,8 @@ static category_set parse_audit_categories(const sstring& data) {
return result;
}
static std::map<sstring, std::set<sstring>> parse_audit_tables(const sstring& data) {
std::map<sstring, std::set<sstring>> result;
static audit::audited_tables_t parse_audit_tables(const sstring& data) {
audit::audited_tables_t result;
if (!data.empty()) {
std::vector<sstring> tokens;
boost::split(tokens, data, boost::is_any_of(","));
@@ -139,8 +139,8 @@ static std::map<sstring, std::set<sstring>> parse_audit_tables(const sstring& da
return result;
}
static std::set<sstring> parse_audit_keyspaces(const sstring& data) {
std::set<sstring> result;
static audit::audited_keyspaces_t parse_audit_keyspaces(const sstring& data) {
audit::audited_keyspaces_t result;
if (!data.empty()) {
std::vector<sstring> tokens;
boost::split(tokens, data, boost::is_any_of(","));
@@ -156,8 +156,8 @@ audit::audit(locator::shared_token_metadata& token_metadata,
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
audited_keyspaces_t&& audited_keyspaces,
audited_tables_t&& audited_tables,
category_set&& audited_categories,
const db::config& cfg)
: _token_metadata(token_metadata)
@@ -165,8 +165,8 @@ audit::audit(locator::shared_token_metadata& token_metadata,
, _audited_tables(std::move(audited_tables))
, _audited_categories(std::move(audited_categories))
, _cfg(cfg)
, _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<std::set<sstring>>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))
, _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<std::map<sstring, std::set<sstring>>>(new_value, parse_audit_tables, _audited_tables); }))
, _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<audited_keyspaces_t>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))
, _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<audited_tables_t>(new_value, parse_audit_tables, _audited_tables); }))
, _cfg_categories_observer(cfg.audit_categories.observe([this] (sstring const& new_value){ update_config<category_set>(new_value, parse_audit_categories, _audited_categories); }))
{
_storage_helper_ptr = create_storage_helper(std::move(audit_modes), qp, mm);
@@ -181,8 +181,8 @@ future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token
return make_ready_future<>();
}
category_set audited_categories = parse_audit_categories(cfg.audit_categories());
std::map<sstring, std::set<sstring>> audited_tables = parse_audit_tables(cfg.audit_tables());
std::set<sstring> audited_keyspaces = parse_audit_keyspaces(cfg.audit_keyspaces());
audit::audited_tables_t audited_tables = parse_audit_tables(cfg.audit_tables());
audit::audited_keyspaces_t audited_keyspaces = parse_audit_keyspaces(cfg.audit_keyspaces());
logger.info("Audit is enabled. Auditing to: \"{}\", with the following categories: \"{}\", keyspaces: \"{}\", and tables: \"{}\"",
cfg.audit(), cfg.audit_categories(), cfg.audit_keyspaces(), cfg.audit_tables());
@@ -304,7 +304,7 @@ future<> inspect_login(const sstring& username, socket_address client_ip, bool e
return audit::local_audit_instance().log_login(username, client_ip, error);
}
bool audit::should_log_table(const sstring& keyspace, const sstring& name) const {
bool audit::should_log_table(std::string_view keyspace, std::string_view name) const {
auto keyspace_it = _audited_tables.find(keyspace);
return keyspace_it != _audited_tables.cend() && keyspace_it->second.find(name) != keyspace_it->second.cend();
}
@@ -319,8 +319,8 @@ bool audit::will_log(statement_category cat, std::string_view keyspace, std::str
// so it is logged whenever the category matches.
return _audited_categories.contains(cat)
&& (keyspace.empty()
|| _audited_keyspaces.find(sstring(keyspace)) != _audited_keyspaces.cend()
|| should_log_table(sstring(keyspace), sstring(table))
|| _audited_keyspaces.find(keyspace) != _audited_keyspaces.cend()
|| should_log_table(keyspace, table)
|| cat == statement_category::AUTH
|| cat == statement_category::ADMIN
|| cat == statement_category::DCL);

View File

@@ -129,10 +129,15 @@ public:
class storage_helper;
class audit final : public seastar::async_sharded_service<audit> {
public:
// Transparent comparator (std::less<>) enables heterogeneous lookup with
// string_view keys.
using audited_keyspaces_t = std::set<sstring, std::less<>>;
using audited_tables_t = std::map<sstring, std::set<sstring, std::less<>>, std::less<>>;
private:
locator::shared_token_metadata& _token_metadata;
std::set<sstring> _audited_keyspaces;
// Maps keyspace name to set of table names in that keyspace
std::map<sstring, std::set<sstring>> _audited_tables;
audited_keyspaces_t _audited_keyspaces;
audited_tables_t _audited_tables;
category_set _audited_categories;
std::unique_ptr<storage_helper> _storage_helper_ptr;
@@ -145,7 +150,7 @@ class audit final : public seastar::async_sharded_service<audit> {
template<class T>
void update_config(const sstring & new_value, std::function<T(const sstring&)> parse_func, T& cfg_parameter);
bool should_log_table(const sstring& keyspace, const sstring& name) const;
bool should_log_table(std::string_view keyspace, std::string_view name) const;
public:
static seastar::sharded<audit>& audit_instance() {
// FIXME: leaked intentionally to avoid shutdown problems, see #293
@@ -164,8 +169,8 @@ public:
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
audited_keyspaces_t&& audited_keyspaces,
audited_tables_t&& audited_tables,
category_set&& audited_categories,
const db::config& cfg);
~audit();

View File

@@ -1625,7 +1625,7 @@ struct process_change_visitor {
if (_enable_updating_state) {
if (_request_options.alternator && _alternator_schema_has_no_clustering_key && _clustering_row_states.empty()) {
// Alternator's table can be with or without clustering key. If the clustering key exists,
// delete request will be `clustered_row_delete` and will be hanlded there.
// delete request will be `clustered_row_delete` and will be handled there.
// If the clustering key doesn't exist, delete request will be `partition_delete` and will be handled here.
// The no-clustering-key case is slightly tricky, because insert of such item is handled by `clustered_row_cells`
// and has some value as clustering_key (the value currently seems to be empty bytes object).
@@ -1933,7 +1933,7 @@ public:
if (_options.alternator && !_alternator_clustering_keys_to_ignore.empty()) {
// we filter mutations for Alternator's changes here.
// We do it per mutation object (user might submit a batch of those in one go
// and some might be splitted because of different timestamps),
// and some might be split because of different timestamps),
// ignore key set is cleared afterwards.
// If single mutation object contains two separate changes to the same row
// and at least one of them is ignored, all of them will be ignored.

View File

@@ -240,7 +240,7 @@ static max_purgeable get_max_purgeable_timestamp(const compaction_group_view& ta
// and if the memtable also contains the key we're calculating max purgeable timestamp for.
// First condition helps to not penalize the common scenario where memtable only contains
// newer data.
if (memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
if (!table_s.skip_memtable_for_tombstone_gc() && memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
timestamp = memtable_min_timestamp;
source = max_purgeable::timestamp_source::memtable_possibly_shadowing_data;
}

View File

@@ -39,6 +39,9 @@ public:
virtual future<lw_shared_ptr<const sstables::sstable_set>> main_sstable_set() const = 0;
virtual future<lw_shared_ptr<const sstables::sstable_set>> maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
// Returns true when tombstone GC considers only the repaired sstable set, meaning the
// memtable does not need to be consulted (its data is always newer than any GC-eligible tombstone).
virtual bool skip_memtable_for_tombstone_gc() const noexcept = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual compaction_strategy& get_compaction_strategy() const noexcept = 0;

View File

@@ -406,7 +406,11 @@ commitlog_total_space_in_mb: -1
# In short, `ms` needs more CPU during sstable writes,
# but should behave better during reads,
# although it might behave worse for very long clustering keys.
#
# `ms` sstable format works even better with `column_index_size_in_kb` set to 1,
# so keep those two settings in sync (either both set, or both unset).
sstable_format: ms
column_index_size_in_kb: 1
# Auto-scaling of the promoted index prevents running out of memory
# when the promoted index grows too large (due to partitions with many rows

View File

@@ -2769,25 +2769,6 @@ def write_build_file(f,
f.write('build {}: rust_source {}\n'.format(cc, src))
obj = cc.replace('.cc', '.o')
compiles[obj] = cc
# Sources shared between scylla (compiled with PCH) and small tests
# (with custom deps and partial link sets) must not use the PCH,
# because -fpch-instantiate-templates injects symbol references that
# the small test link sets cannot satisfy.
small_test_srcs = set()
for test_binary, test_deps in deps.items():
if not test_binary.startswith('test/'):
continue
# Only exclude PCH for tests with truly small/partial link sets.
# Tests that include scylla_core or similar large dep sets link
# against enough objects to satisfy PCH-injected symbol refs.
if len(test_deps) > 50:
continue
for src in test_deps:
if src.endswith('.cc'):
small_test_srcs.add(src)
for src in small_test_srcs:
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')
compiles_with_pch.discard(obj)
for obj in compiles:
src = compiles[obj]
seastar_dep = f'$builddir/{mode}/seastar/libseastar.{seastar_lib_ext}'

View File

@@ -1,84 +0,0 @@
/*
* Copyright (C) 2017-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.1 and Apache-2.0)
*/
#pragma once
#include "bytes.hh"
#include "utils/hash.hh"
#include "cql3/dialect.hh"
namespace cql3 {
typedef bytes cql_prepared_id_type;
/// \brief The key of the prepared statements cache
///
/// TODO: consolidate prepared_cache_key_type and the nested cache_key_type
/// the latter was introduced for unifying the CQL and Thrift prepared
/// statements so that they can be stored in the same cache.
class prepared_cache_key_type {
public:
// derive from cql_prepared_id_type so we can customize the formatter of
// cache_key_type
struct cache_key_type : public cql_prepared_id_type {
cache_key_type(cql_prepared_id_type&& id, cql3::dialect d) : cql_prepared_id_type(std::move(id)), dialect(d) {}
cql3::dialect dialect; // Not part of hash, but we don't expect collisions because of that
bool operator==(const cache_key_type& other) const = default;
};
private:
cache_key_type _key;
public:
explicit prepared_cache_key_type(cql_prepared_id_type cql_id, dialect d) : _key(std::move(cql_id), d) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
static const cql_prepared_id_type& cql_id(const prepared_cache_key_type& key) {
return key.key();
}
bool operator==(const prepared_cache_key_type& other) const = default;
};
}
namespace std {
template<>
struct hash<cql3::prepared_cache_key_type::cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type::cache_key_type& k) const {
return std::hash<cql3::cql_prepared_id_type>()(k);
}
};
template<>
struct hash<cql3::prepared_cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type& k) const {
return std::hash<cql3::cql_prepared_id_type>()(k.key());
}
};
}
// for prepared_statements_cache log printouts
template <> struct fmt::formatter<cql3::prepared_cache_key_type::cache_key_type> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const cql3::prepared_cache_key_type::cache_key_type& p, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{{cql_id: {}, dialect: {}}}", static_cast<const cql3::cql_prepared_id_type&>(p), p.dialect);
}
};
template <> struct fmt::formatter<cql3::prepared_cache_key_type> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const cql3::prepared_cache_key_type& p, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", p.key());
}
};

View File

@@ -12,7 +12,6 @@
#include "utils/loading_cache.hh"
#include "utils/hash.hh"
#include "cql3/prepared_cache_key_type.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/column_specification.hh"
#include "cql3/dialect.hh"
@@ -28,6 +27,39 @@ struct prepared_cache_entry_size {
}
};
typedef bytes cql_prepared_id_type;
/// \brief The key of the prepared statements cache
///
/// TODO: consolidate prepared_cache_key_type and the nested cache_key_type
/// the latter was introduced for unifying the CQL and Thrift prepared
/// statements so that they can be stored in the same cache.
class prepared_cache_key_type {
public:
// derive from cql_prepared_id_type so we can customize the formatter of
// cache_key_type
struct cache_key_type : public cql_prepared_id_type {
cache_key_type(cql_prepared_id_type&& id, cql3::dialect d) : cql_prepared_id_type(std::move(id)), dialect(d) {}
cql3::dialect dialect; // Not part of hash, but we don't expect collisions because of that
bool operator==(const cache_key_type& other) const = default;
};
private:
cache_key_type _key;
public:
explicit prepared_cache_key_type(cql_prepared_id_type cql_id, dialect d) : _key(std::move(cql_id), d) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
static const cql_prepared_id_type& cql_id(const prepared_cache_key_type& key) {
return key.key();
}
bool operator==(const prepared_cache_key_type& other) const = default;
};
class prepared_statements_cache {
public:
struct stats {
@@ -132,3 +164,35 @@ public:
}
};
}
namespace std {
template<>
struct hash<cql3::prepared_cache_key_type::cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type::cache_key_type& k) const {
return std::hash<cql3::cql_prepared_id_type>()(k);
}
};
template<>
struct hash<cql3::prepared_cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type& k) const {
return std::hash<cql3::cql_prepared_id_type>()(k.key());
}
};
}
// for prepared_statements_cache log printouts
template <> struct fmt::formatter<cql3::prepared_cache_key_type::cache_key_type> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const cql3::prepared_cache_key_type::cache_key_type& p, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{{cql_id: {}, dialect: {}}}", static_cast<const cql3::cql_prepared_id_type&>(p), p.dialect);
}
};
template <> struct fmt::formatter<cql3::prepared_cache_key_type> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const cql3::prepared_cache_key_type& p, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", p.key());
}
};

View File

@@ -17,9 +17,6 @@
#include <seastar/coroutine/as_future.hh>
#include <seastar/coroutine/try_future.hh>
#include "cql3/prepared_statements_cache.hh"
#include "cql3/authorized_prepared_statements_cache.hh"
#include "transport/messages/result_message.hh"
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
#include "service/mapreduce_service.hh"
@@ -80,7 +77,7 @@ static service::query_state query_state_for_internal_call() {
return {service::client_state::for_internal_calls(), empty_service_permit()};
}
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, const utils::loading_cache_config& auth_prep_cache_cfg, lang::manager& langm)
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg, lang::manager& langm)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
@@ -89,7 +86,7 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
, _mcfg(mcfg)
, _cql_config(cql_cfg)
, _prepared_cache(prep_cache_log, _mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(auth_prep_cache_cfg, authorized_prepared_statements_cache_log)
, _authorized_prepared_cache(std::move(auth_prep_cache_cfg), authorized_prepared_statements_cache_log)
, _auth_prepared_cache_cfg_cb([this] (uint32_t) { (void) _authorized_prepared_cache_config_action.trigger_later(); })
, _authorized_prepared_cache_config_action([this] { update_authorized_prepared_cache_config(); return make_ready_future<>(); })
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
@@ -1077,7 +1074,7 @@ query_processor::execute_batch_without_checking_exception_message(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options,
std::unordered_map<prepared_cache_key_type, statements::prepared_statement::checked_weak_ptr> pending_authorization_entries) {
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
auto access_future = co_await coroutine::as_future(batch->check_access(*this, query_state.get_client_state()));
bool failed = access_future.failed();
co_await audit::inspect(batch, query_state, options, failed);

View File

@@ -22,14 +22,13 @@
#include "cql3/statements/prepared_statement.hh"
#include "cql3/cql_statement.hh"
#include "cql3/dialect.hh"
#include "cql3/query_options.hh"
#include "cql3/stats.hh"
#include "exceptions/exceptions.hh"
#include "service/migration_listener.hh"
#include "mutation/timestamp.hh"
#include "transport/messages/result_message.hh"
#include "service/client_state.hh"
#include "service/broadcast_tables/experimental/query_result.hh"
#include "vector_search/vector_store_client.hh"
#include "utils/assert.hh"
#include "utils/observable.hh"
#include "utils/rolling_max_tracker.hh"
@@ -42,9 +41,6 @@
namespace lang { class manager; }
namespace vector_search {
class vector_store_client;
}
namespace service {
class migration_manager;
class query_state;
@@ -62,9 +58,6 @@ struct query;
namespace cql3 {
class prepared_statements_cache;
class authorized_prepared_statements_cache;
namespace statements {
class batch_statement;
class schema_altering_statement;
@@ -191,7 +184,7 @@ public:
static std::vector<std::unique_ptr<statements::raw::parsed_statement>> parse_statements(std::string_view queries, dialect d);
query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, vector_search::vector_store_client& vsc,
memory_config mcfg, cql_config& cql_cfg, const utils::loading_cache_config& auth_prep_cache_cfg, lang::manager& langm);
memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg, lang::manager& langm);
~query_processor();
@@ -481,7 +474,7 @@ public:
::shared_ptr<statements::batch_statement> stmt,
service::query_state& query_state,
query_options& options,
std::unordered_map<prepared_cache_key_type, statements::prepared_statement::checked_weak_ptr> pending_authorization_entries) {
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
return execute_batch_without_checking_exception_message(
std::move(stmt),
query_state,
@@ -497,7 +490,7 @@ public:
::shared_ptr<statements::batch_statement>,
service::query_state& query_state,
query_options& options,
std::unordered_map<prepared_cache_key_type, statements::prepared_statement::checked_weak_ptr> pending_authorization_entries);
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries);
future<service::broadcast_tables::query_result>
execute_broadcast_table_query(const service::broadcast_tables::query&);

File diff suppressed because it is too large Load Diff

View File

@@ -23,15 +23,113 @@ namespace cql3 {
namespace restrictions {
/// A set of discrete values.
using value_list = std::vector<managed_bytes>; // Sorted and deduped using value comparator.
/// General set of values. Empty set and single-element sets are always value_list. interval is
/// never singular and never has start > end. Universal set is a interval with both bounds null.
using value_set = std::variant<value_list, interval<managed_bytes>>;
// For some boolean expression (say (X = 3) = TRUE, this represents a function that solves for X.
// (here, it would return 3). The expression is obtained by equating some factors of the WHERE
// clause to TRUE.
using solve_for_t = std::function<value_set (const query_options&)>;
struct on_row {
bool operator==(const on_row&) const = default;
};
struct on_column {
const column_definition* column;
bool operator==(const on_column&) const = default;
};
// Placeholder type indicating we're solving for the partition key token.
struct on_partition_key_token {
const ::schema* schema;
bool operator==(const on_partition_key_token&) const = default;
};
struct on_clustering_key_prefix {
std::vector<const column_definition*> columns;
bool operator==(const on_clustering_key_prefix&) const = default;
};
// A predicate on a column or a combination of columns. The WHERE clause analyzer
// will attempt to convert predicates (that return true or false for a particular row)
// to solvers (that return the set of column values that satisfy the predicate) when possible.
struct predicate {
// A function that returns the set of values that satisfy the filter. Can be unset,
// in which case the filter must be interpreted.
solve_for_t solve_for;
// The original filter for this column.
expr::expression filter;
// What column the predicate can be solved for
std::variant<
on_row, // cannot determine, so predicate is on entire row
on_column, // solving for a single column: e.g. c1 = 3
on_partition_key_token, // solving for the token, e.g. token(pk1, pk2) >= :var
on_clustering_key_prefix // solving for a clustering key prefix: e.g. (ck1, ck2) >= (3, 4)
> on;
// Whether the returned value_set will resolve to a single value.
bool is_singleton = false;
// Whether the returned value_set follows CQL comparison semantics
bool comparable = true;
bool is_multi_column = false;
bool is_not_null_single_column = false;
bool equality = false; // operator is EQ
bool is_in = false; // operator is IN
bool is_slice = false; // operator is LT/LTE/GT/GTE
bool is_upper_bound = false; // operator is LT/LTE
bool is_lower_bound = false; // operator is GT/GTE
expr::comparison_order order = expr::comparison_order::cql;
std::optional<expr::oper_t> op; // the binary operator, if any
bool is_subscript = false; // whether the LHS is a subscript (map element access)
};
///In some cases checking if columns have indexes is undesired of even
///impossible, because e.g. the query runs on a pseudo-table, which does not
///have an index-manager, or even a table object.
using check_indexes = bool_class<class check_indexes_tag>;
// A function that returns the partition key ranges for a query. It is the solver of
// WHERE clause fragments such as WHERE token(pk) > 1 or WHERE pk1 IN :list1 AND pk2 IN :list2.
using get_partition_key_ranges_fn_t = std::function<dht::partition_range_vector (const query_options&)>;
// A function that returns the clustering key ranges for a query. It is the solver of
// WHERE clause fragments such as WHERE ck > 1 or WHERE (ck1, ck2) > (1, 2).
using get_clustering_bounds_fn_t = std::function<std::vector<query::clustering_range> (const query_options& options)>;
// A function that returns a singleton value, usable for a key (e.g. bytes_opt)
using get_singleton_value_fn_t = std::function<bytes_opt (const query_options&)>;
struct no_partition_range_restrictions {
};
struct token_range_restrictions {
predicate token_restrictions;
};
struct single_column_partition_range_restrictions {
std::vector<predicate> per_column_restrictions;
};
using partition_range_restrictions = std::variant<
no_partition_range_restrictions,
token_range_restrictions,
single_column_partition_range_restrictions>;
// A map of per-column predicate vectors, ordered by schema position.
using single_column_predicate_vectors = std::map<const column_definition*, std::vector<predicate>, expr::schema_pos_column_definition_comparator>;
/**
* The restrictions corresponding to the relations specified on the where-clause of CQL query.
*/
class statement_restrictions {
struct private_tag {}; // Tag for private constructor
private:
schema_ptr _schema;
@@ -81,7 +179,7 @@ private:
bool _has_queriable_regular_index = false, _has_queriable_pk_index = false, _has_queriable_ck_index = false;
bool _has_multi_column; ///< True iff _clustering_columns_restrictions has a multi-column restriction.
std::optional<expr::expression> _where; ///< The entire WHERE clause.
std::vector<expr::expression> _where; ///< The entire WHERE clause (factorized).
/// Parts of _where defining the clustering slice.
///
@@ -96,7 +194,7 @@ private:
/// 4.4 elements other than the last have only EQ or IN atoms
/// 4.5 the last element has only EQ, IN, or is_slice() atoms
/// 5. if multi-column, then each element is a binary_operator
std::vector<expr::expression> _clustering_prefix_restrictions;
std::vector<predicate> _clustering_prefix_restrictions;
/// Like _clustering_prefix_restrictions, but for the indexing table (if this is an index-reading statement).
/// Recall that the index-table CK is (token, PK, CK) of the base table for a global index and (indexed column,
@@ -105,7 +203,7 @@ private:
/// Elements are conjunctions of single-column binary operators with the same LHS.
/// Element order follows the indexing-table clustering key.
/// In case of a global index the first element's (token restriction) RHS is a dummy value, it is filled later.
std::optional<std::vector<expr::expression>> _idx_tbl_ck_prefix;
std::optional<std::vector<predicate>> _idx_tbl_ck_prefix;
/// Parts of _where defining the partition range.
///
@@ -113,16 +211,25 @@ private:
/// binary_operators on token. If single-column restrictions define the partition range, each element holds
/// restrictions for one partition column. Each partition column has a corresponding element, but the elements
/// are in arbitrary order.
std::vector<expr::expression> _partition_range_restrictions;
partition_range_restrictions _partition_range_restrictions;
bool _partition_range_is_simple; ///< False iff _partition_range_restrictions imply a Cartesian product.
check_indexes _check_indexes = check_indexes::yes;
/// Columns that appear on the LHS of an EQ restriction (not IN).
/// For multi-column EQ like (ck1, ck2) = (1, 2), all columns in the tuple are included.
std::unordered_set<const column_definition*> _columns_with_eq;
std::vector<const column_definition*> _column_defs_for_filtering;
schema_ptr _view_schema;
std::optional<secondary_index::index> _idx_opt;
expr::expression _idx_restrictions = expr::conjunction({});
get_partition_key_ranges_fn_t _get_partition_key_ranges_fn;
get_clustering_bounds_fn_t _get_clustering_bounds_fn;
get_clustering_bounds_fn_t _get_global_index_clustering_ranges_fn;
get_clustering_bounds_fn_t _get_global_index_token_clustering_ranges_fn;
get_clustering_bounds_fn_t _get_local_index_clustering_ranges_fn;
get_singleton_value_fn_t _value_for_index_partition_key_fn;
public:
/**
* Creates a new empty <code>StatementRestrictions</code>.
@@ -130,9 +237,10 @@ public:
* @param cfm the column family meta data
* @return a new empty <code>StatementRestrictions</code>.
*/
statement_restrictions(schema_ptr schema, bool allow_filtering);
statement_restrictions(private_tag, schema_ptr schema, bool allow_filtering);
friend statement_restrictions analyze_statement_restrictions(
public:
friend shared_ptr<const statement_restrictions> analyze_statement_restrictions(
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
@@ -142,9 +250,15 @@ public:
bool for_view,
bool allow_filtering,
check_indexes do_check_indexes);
friend shared_ptr<const statement_restrictions> make_trivial_statement_restrictions(
schema_ptr schema,
bool allow_filtering);
private:
statement_restrictions(data_dictionary::database db,
// Important: objects of this class captures `this` extensively and so must remain non-copyable.
statement_restrictions(const statement_restrictions&) = delete;
statement_restrictions& operator=(const statement_restrictions&) = delete;
statement_restrictions(private_tag,
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
const expr::expression& where_clause,
@@ -211,10 +325,7 @@ public:
bool has_token_restrictions() const;
// Checks whether the given column has an EQ restriction.
// EQ restriction is `col = ...` or `(col, col2) = ...`
// IN restriction is NOT an EQ restriction, this function will not look for IN restrictions.
// Uses column_defintion::operator== for comparison, columns with the same name but different schema will not be equal.
// Checks whether the given column has an EQ restriction (not IN).
bool has_eq_restriction_on_column(const column_definition&) const;
/**
@@ -224,12 +335,6 @@ public:
*/
std::vector<const column_definition*> get_column_defs_for_filtering(data_dictionary::database db) const;
/**
* Gives a score that the index has - index with the highest score will be chosen
* in find_idx()
*/
int score(const secondary_index::index& index) const;
/**
* Determines the index to be used with the restriction.
* @param db - the data_dictionary::database context (for extracting index manager)
@@ -250,18 +355,8 @@ public:
size_t partition_key_restrictions_size() const;
bool parition_key_restrictions_have_supporting_index(const secondary_index::secondary_index_manager& index_manager, expr::allow_local_index allow_local) const;
size_t clustering_columns_restrictions_size() const;
bool clustering_columns_restrictions_have_supporting_index(
const secondary_index::secondary_index_manager& index_manager,
expr::allow_local_index allow_local) const;
bool multi_column_clustering_restrictions_are_supported_by(const secondary_index::index& index) const;
bounds_slice get_clustering_slice() const;
/**
* Checks if the clustering key has some unrestricted components.
* @return <code>true</code> if the clustering key has some unrestricted components, <code>false</code> otherwise.
@@ -279,15 +374,6 @@ public:
schema_ptr get_view_schema() const { return _view_schema; }
private:
std::pair<std::optional<secondary_index::index>, expr::expression> do_find_idx(const secondary_index::secondary_index_manager& sim) const;
void add_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering, bool for_view);
void add_is_not_restriction(const expr::binary_operator& restr, schema_ptr schema, bool for_view);
void add_single_column_parition_key_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering, bool for_view);
void add_token_partition_key_restriction(const expr::binary_operator& restr);
void add_single_column_clustering_key_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering);
void add_multi_column_clustering_key_restriction(const expr::binary_operator& restr);
void add_single_column_nonprimary_key_restriction(const expr::binary_operator& restr);
void process_partition_key_restrictions(bool for_view, bool allow_filtering, statements::statement_type type);
/**
@@ -315,7 +401,17 @@ private:
void add_clustering_restrictions_to_idx_ck_prefix(const schema& idx_tbl_schema);
unsigned int num_clustering_prefix_columns_that_need_not_be_filtered() const;
void calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index(data_dictionary::database db);
void calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index(
data_dictionary::database db,
const single_column_predicate_vectors& sc_pk_pred_vectors,
const single_column_predicate_vectors& sc_ck_pred_vectors,
const single_column_predicate_vectors& sc_nonpk_pred_vectors);
get_partition_key_ranges_fn_t build_partition_key_ranges_fn() const;
get_clustering_bounds_fn_t build_get_clustering_bounds_fn() const;
get_clustering_bounds_fn_t build_get_global_index_clustering_ranges_fn() const;
get_clustering_bounds_fn_t build_get_global_index_token_clustering_ranges_fn() const;
get_clustering_bounds_fn_t build_get_local_index_clustering_ranges_fn() const;
get_singleton_value_fn_t build_value_for_index_partition_key_fn() const;
public:
/**
* Returns the specified range of the partition key.
@@ -389,7 +485,10 @@ public:
private:
/// Prepares internal data for evaluating index-table queries. Must be called before
/// get_local_index_clustering_ranges().
void prepare_indexed_local(const schema& idx_tbl_schema);
void prepare_indexed_local(const schema& idx_tbl_schema,
const single_column_predicate_vectors& sc_pk_pred_vectors,
const single_column_predicate_vectors& sc_ck_pred_vectors,
const single_column_predicate_vectors& sc_nonpk_pred_vectors);
/// Prepares internal data for evaluating index-table queries. Must be called before
/// get_global_index_clustering_ranges() or get_global_index_token_clustering_ranges().
@@ -398,15 +497,18 @@ private:
public:
/// Calculates clustering ranges for querying a global-index table.
std::vector<query::clustering_range> get_global_index_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Calculates clustering ranges for querying a global-index table for queries with token restrictions present.
std::vector<query::clustering_range> get_global_index_token_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Calculates clustering ranges for querying a local-index table.
std::vector<query::clustering_range> get_local_index_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Finds the value of partition key of the index table
bytes_opt value_for_index_partition_key(const query_options&) const;
sstring to_string() const;
@@ -416,7 +518,7 @@ public:
bool is_empty() const;
};
statement_restrictions analyze_statement_restrictions(
shared_ptr<const statement_restrictions> analyze_statement_restrictions(
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
@@ -427,23 +529,14 @@ statement_restrictions analyze_statement_restrictions(
bool allow_filtering,
check_indexes do_check_indexes);
// Extracts all binary operators which have the given column on their left hand side.
// Extracts only single-column restrictions.
// Does not include multi-column restrictions.
// Does not include token() restrictions.
// Does not include boolean constant restrictions.
// For example "WHERE c = 1 AND (a, c) = (2, 1) AND token(p) < 2 AND FALSE" will return {"c = 1"}.
std::vector<expr::expression> extract_single_column_restrictions_for_column(const expr::expression&, const column_definition&);
shared_ptr<const statement_restrictions> make_trivial_statement_restrictions(
schema_ptr schema,
bool allow_filtering);
// Checks whether this expression is empty - doesn't restrict anything
bool is_empty_restriction(const expr::expression&);
// Finds the value of the given column in the expression
// In case of multpiple possible values calls on_internal_error
bytes_opt value_for(const column_definition&, const expr::expression&, const query_options&);
}
}

View File

@@ -90,6 +90,20 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
auto& current_rf_per_dc = ks.metadata()->strategy_options();
auto new_rf_per_dc = _attrs->get_replication_options();
new_rf_per_dc.erase(ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY);
// Check if multi-RF change is allowed: all DC changes must be 0->N or N->0.
auto all_changes_are_0_N = [&] {
for (const auto& [dc, new_rf] : new_rf_per_dc) {
auto old_rf_val = size_t(0);
if (auto it = current_rf_per_dc.find(dc); it != current_rf_per_dc.end()) {
old_rf_val = locator::get_replication_factor(it->second);
}
auto new_rf_val = locator::get_replication_factor(new_rf);
if (old_rf_val != new_rf_val && old_rf_val != 0 && new_rf_val != 0) {
return false;
}
}
return true;
};
unsigned total_abs_rfs_diff = 0;
for (const auto& [new_dc, new_rf] : new_rf_per_dc) {
auto old_rf = locator::replication_strategy_config_option(sstring("0"));
@@ -103,7 +117,9 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
// first we need to report non-existing DCs, then if RFs aren't changed by too much.
continue;
}
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2) {
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2 &&
!(qp.proxy().features().keyspace_multi_rf_change && locator::uses_rack_list_exclusively(current_rf_per_dc)
&& locator::uses_rack_list_exclusively(new_ks->strategy_options()) && all_changes_are_0_N())) {
throw exceptions::invalid_request_exception("Only one DC's RF can be changed at a time and not by more than 1");
}
}

View File

@@ -16,7 +16,6 @@
#include <seastar/core/execution_stage.hh>
#include "cas_request.hh"
#include "cql3/query_processor.hh"
#include "transport/messages/result_message.hh"
#include "service/storage_proxy.hh"
#include "tracing/trace_state.hh"
#include "utils/unique_view.hh"

View File

@@ -89,6 +89,10 @@ public:
const std::vector<single_statement>& statements() const { return _statements; }
audit::audit_info_ptr audit_info() const {
return audit::audit::create_audit_info(audit::statement_category::DML, sstring(), sstring(), true);
}
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -22,7 +22,6 @@
#include "cql3/expr/evaluate.hh"
#include "cql3/query_options.hh"
#include "cql3/query_processor.hh"
#include "transport/messages/result_message.hh"
#include "cql3/values.hh"
#include "timeout_config.hh"
#include "service/broadcast_tables/experimental/lang.hh"

View File

@@ -14,7 +14,6 @@
#include "auth/service.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/query_processor.hh"
#include "unimplemented.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "transport/event.hh"

View File

@@ -411,10 +411,10 @@ bool ks_prop_defs::get_durable_writes() const {
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat, const db::config& cfg) {
auto sc = get_replication_strategy_class().value();
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0. The strategy will
// validate it and throw an error if it does not support tablets.
auto enable_tablets = feat.tablets && cfg.enable_tablets_by_default();
std::optional<unsigned> default_initial_tablets = enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy"
? std::optional<unsigned>(0) : std::nullopt;
std::optional<unsigned> default_initial_tablets = enable_tablets ? std::optional<unsigned>(0) : std::nullopt;
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
bool uses_tablets = initial_tablets.has_value();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
@@ -440,7 +440,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
sc = old->strategy_name();
options = old_options;
}
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options(), {}, old->next_strategy_options_opt());
}
namespace {

View File

@@ -626,7 +626,7 @@ modification_statement::prepare(data_dictionary::database db, prepare_context& c
// Since this cache is only meaningful for LWT queries, just clear the ids
// if it's not a conditional statement so that the AST nodes don't
// participate in the caching mechanism later.
if (!prepared_stmt->has_conditions() && prepared_stmt->_restrictions.has_value()) {
if (!prepared_stmt->has_conditions() && prepared_stmt->_restrictions) {
ctx.clear_pk_function_calls_cache();
}
prepared_stmt->_may_use_token_aware_routing = ctx.get_partition_key_bind_indexes(*schema).size() != 0;

View File

@@ -94,7 +94,7 @@ private:
std::optional<bool> _is_raw_counter_shard_write;
protected:
std::optional<restrictions::statement_restrictions> _restrictions;
shared_ptr<const restrictions::statement_restrictions> _restrictions;
public:
typedef std::optional<std::unordered_map<sstring, bytes_opt>> json_cache_opt;

View File

@@ -19,7 +19,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,

View File

@@ -109,7 +109,7 @@ public:
std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg, bool for_view);
private:
std::vector<selection::prepared_selector> maybe_jsonize_select_clause(std::vector<selection::prepared_selector> select, data_dictionary::database db, schema_ptr schema);
::shared_ptr<restrictions::statement_restrictions> prepare_restrictions(
::shared_ptr<const restrictions::statement_restrictions> prepare_restrictions(
data_dictionary::database db,
schema_ptr schema,
prepare_context& ctx,

View File

@@ -1027,7 +1027,7 @@ view_indexed_table_select_statement::prepare(data_dictionary::database db,
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -1139,7 +1139,7 @@ lw_shared_ptr<const service::pager::paging_state> view_indexed_table_select_stat
auto& last_base_pk = last_pos.partition;
auto* last_base_ck = last_pos.position.has_key() ? &last_pos.position.key() : nullptr;
bytes_opt indexed_column_value = restrictions::value_for(*cdef, _used_index_restrictions, options);
bytes_opt indexed_column_value = _restrictions->value_for_index_partition_key(options);
auto index_pk = [&]() {
if (_index.metadata().local()) {
@@ -1350,12 +1350,7 @@ dht::partition_range_vector view_indexed_table_select_statement::get_partition_r
dht::partition_range_vector view_indexed_table_select_statement::get_partition_ranges_for_global_index_posting_list(const query_options& options) const {
dht::partition_range_vector partition_ranges;
const column_definition* cdef = _schema->get_column_definition(to_bytes(_index.target_column()));
if (!cdef) {
throw exceptions::invalid_request_exception("Indexed column not found in schema");
}
bytes_opt value = restrictions::value_for(*cdef, _used_index_restrictions, options);
bytes_opt value = _restrictions->value_for_index_partition_key(options);
if (value) {
auto pk = partition_key::from_single_value(*_view_schema, *value);
auto dk = dht::decorate_key(*_view_schema, pk);
@@ -1374,11 +1369,11 @@ query::partition_slice view_indexed_table_select_statement::get_partition_slice_
// Only EQ restrictions on base partition key can be used in an index view query
if (pk_restrictions_is_single && _restrictions->partition_key_restrictions_is_all_eq()) {
partition_slice_builder.with_ranges(
_restrictions->get_global_index_clustering_ranges(options, *_view_schema));
_restrictions->get_global_index_clustering_ranges(options));
} else if (_restrictions->has_token_restrictions()) {
// Restrictions like token(p1, p2) < 0 have all partition key components restricted, but require special handling.
partition_slice_builder.with_ranges(
_restrictions->get_global_index_token_clustering_ranges(options, *_view_schema));
_restrictions->get_global_index_token_clustering_ranges(options));
}
}
@@ -1389,7 +1384,7 @@ query::partition_slice view_indexed_table_select_statement::get_partition_slice_
partition_slice_builder partition_slice_builder{*_view_schema};
partition_slice_builder.with_ranges(
_restrictions->get_local_index_clustering_ranges(options, *_view_schema));
_restrictions->get_local_index_clustering_ranges(options));
return partition_slice_builder.build();
}
@@ -1607,7 +1602,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -1645,7 +1640,7 @@ private:
uint32_t bound_terms,
lw_shared_ptr<const select_statement::parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
parallelized_select_statement::ordering_comparator_type ordering_comparator,
@@ -2076,7 +2071,7 @@ static select_statement::ordering_comparator_type get_similarity_ordering_compar
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
::shared_ptr<const restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs) {
@@ -2589,7 +2584,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
return make_unique<prepared_statement>(audit_info(), std::move(stmt), ctx, std::move(partition_key_bind_indices), std::move(warnings));
}
::shared_ptr<restrictions::statement_restrictions>
::shared_ptr<const restrictions::statement_restrictions>
select_statement::prepare_restrictions(data_dictionary::database db,
schema_ptr schema,
prepare_context& ctx,
@@ -2599,8 +2594,8 @@ select_statement::prepare_restrictions(data_dictionary::database db,
restrictions::check_indexes do_check_indexes)
{
try {
return ::make_shared<restrictions::statement_restrictions>(restrictions::analyze_statement_restrictions(db, schema, statement_type::SELECT, _where_clause, ctx,
selection->contains_only_static_columns(), for_view, allow_filtering, do_check_indexes));
return restrictions::analyze_statement_restrictions(db, schema, statement_type::SELECT, _where_clause, ctx,
selection->contains_only_static_columns(), for_view, allow_filtering, do_check_indexes);
} catch (const exceptions::unrecognized_entity_exception& e) {
if (contains_alias(e.entity)) {
throw exceptions::invalid_request_exception(format("Aliases aren't allowed in the WHERE clause (name: '{}')", e.entity));

View File

@@ -200,7 +200,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -372,7 +372,7 @@ public:
static ::shared_ptr<cql3::statements::select_statement> prepare(data_dictionary::database db, schema_ptr schema, uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
::shared_ptr<const restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);

View File

@@ -9,6 +9,7 @@
#pragma once
#include "cql3/cql_statement.hh"
#include "cql3/query_processor.hh"
#include "raw/parsed_statement.hh"
#include "service/qos/qos_common.hh"
#include "service/query_state.hh"

View File

@@ -15,7 +15,6 @@
#include "cql3/cql_statement.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "unimplemented.hh"
#include "service/storage_proxy.hh"
#include <optional>
#include "validation.hh"

View File

@@ -66,7 +66,7 @@ public:
: update_statement(std::move(audit_info), statement_type::INSERT, bound_terms, s, std::move(attrs), stats)
, _value(std::move(v))
, _default_unset(default_unset) {
_restrictions = restrictions::statement_restrictions(s, false);
_restrictions = cql3::restrictions::make_trivial_statement_restrictions(s, false);
}
private:
virtual void execute_operations_for_key(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const json_cache_opt& json_cache) const override;

View File

@@ -224,10 +224,12 @@ keyspace_metadata::keyspace_metadata(std::string_view name,
bool durable_writes,
std::vector<schema_ptr> cf_defs,
user_types_metadata user_types,
storage_options storage_opts)
storage_options storage_opts,
std::optional<locator::replication_strategy_config_options> next_options)
: _name{name}
, _strategy_name{locator::abstract_replication_strategy::to_qualified_class_name(strategy_name.empty() ? "NetworkTopologyStrategy" : strategy_name)}
, _strategy_options{std::move(strategy_options)}
, _next_strategy_options{std::move(next_options)}
, _initial_tablets(initial_tablets)
, _durable_writes{durable_writes}
, _user_types{std::move(user_types)}
@@ -273,14 +275,15 @@ keyspace_metadata::new_keyspace(std::string_view name,
std::optional<consistency_config_option> consistency_option,
bool durables_writes,
storage_options storage_opts,
std::vector<schema_ptr> cf_defs)
std::vector<schema_ptr> cf_defs,
std::optional<locator::replication_strategy_config_options> next_options)
{
return ::make_lw_shared<keyspace_metadata>(name, strategy_name, options, initial_tablets, consistency_option, durables_writes, cf_defs, user_types_metadata{}, storage_opts);
return ::make_lw_shared<keyspace_metadata>(name, strategy_name, options, initial_tablets, consistency_option, durables_writes, cf_defs, user_types_metadata{}, storage_opts, next_options);
}
lw_shared_ptr<keyspace_metadata>
keyspace_metadata::new_keyspace(const keyspace_metadata& ksm) {
return new_keyspace(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(), ksm.get_storage_options());
return new_keyspace(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(), ksm.get_storage_options(), {}, ksm.next_strategy_options_opt());
}
void keyspace_metadata::add_user_type(const user_type ut) {
@@ -649,8 +652,8 @@ struct fmt::formatter<data_dictionary::user_types_metadata> {
};
auto fmt::formatter<data_dictionary::keyspace_metadata>::format(const data_dictionary::keyspace_metadata& m, fmt::format_context& ctx) const -> decltype(ctx.out()) {
fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, cfMetaData={}, durable_writes={}, tablets=",
m.name(), m.strategy_name(), m.strategy_options(), m.cf_meta_data(), m.durable_writes());
fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, nextStrategyOptions={}, cfMetaData={}, durable_writes={}, tablets=",
m.name(), m.strategy_name(), m.strategy_options(), m.next_strategy_options_opt(), m.cf_meta_data(), m.durable_writes());
if (m.initial_tablets()) {
if (auto initial_tablets = m.initial_tablets().value()) {
fmt::format_to(ctx.out(), "{{\"initial\":{}}}", initial_tablets);

View File

@@ -28,7 +28,9 @@ namespace data_dictionary {
class keyspace_metadata final {
sstring _name;
sstring _strategy_name;
// If _next_strategy_options has value, there is ongoing rf change of this keyspace.
locator::replication_strategy_config_options _strategy_options;
std::optional<locator::replication_strategy_config_options> _next_strategy_options;
std::optional<unsigned> _initial_tablets;
std::unordered_map<sstring, schema_ptr> _cf_meta_data;
bool _durable_writes;
@@ -44,7 +46,8 @@ public:
bool durable_writes,
std::vector<schema_ptr> cf_defs = std::vector<schema_ptr>{},
user_types_metadata user_types = user_types_metadata{},
storage_options storage_opts = storage_options{});
storage_options storage_opts = storage_options{},
std::optional<locator::replication_strategy_config_options> next_options = std::nullopt);
static lw_shared_ptr<keyspace_metadata>
new_keyspace(std::string_view name,
std::string_view strategy_name,
@@ -53,7 +56,8 @@ public:
std::optional<consistency_config_option> consistency_option,
bool durables_writes = true,
storage_options storage_opts = {},
std::vector<schema_ptr> cf_defs = {});
std::vector<schema_ptr> cf_defs = {},
std::optional<locator::replication_strategy_config_options> next_options = std::nullopt);
static lw_shared_ptr<keyspace_metadata>
new_keyspace(const keyspace_metadata& ksm);
void validate(const gms::feature_service&, const locator::topology&) const;
@@ -66,6 +70,18 @@ public:
const locator::replication_strategy_config_options& strategy_options() const {
return _strategy_options;
}
void set_strategy_options(const locator::replication_strategy_config_options& options) {
_strategy_options = options;
}
const std::optional<locator::replication_strategy_config_options>& next_strategy_options_opt() const {
return _next_strategy_options;
}
void set_next_strategy_options(const locator::replication_strategy_config_options& options) {
_next_strategy_options = options;
}
void clear_next_strategy_options() {
_next_strategy_options = std::nullopt;
}
locator::replication_strategy_config_options strategy_options_v1() const;
std::optional<unsigned> initial_tablets() const {
return _initial_tablets;

View File

@@ -330,14 +330,14 @@ const config_type& config_type_for<std::vector<db::config::error_injection_at_st
}
template <>
const config_type& config_type_for<enum_option<netw::dict_training_when>>() {
const config_type& config_type_for<enum_option<netw::dict_training_loop::when>>() {
static config_type ct(
"dictionary training conditions", printable_to_json<enum_option<netw::dict_training_when>>);
"dictionary training conditions", printable_to_json<enum_option<netw::dict_training_loop::when>>);
return ct;
}
template <>
const config_type& config_type_for<netw::algo_config>() {
const config_type& config_type_for<netw::advanced_rpc_compressor::tracker::algo_config>() {
static config_type ct(
"advanced rpc compressor config", printable_vector_to_json<enum_option<netw::compression_algorithm>>);
return ct;
@@ -530,9 +530,9 @@ struct convert<db::config::error_injection_at_startup> {
template <>
class convert<enum_option<netw::dict_training_when>> {
class convert<enum_option<netw::dict_training_loop::when>> {
public:
static bool decode(const Node& node, enum_option<netw::dict_training_when>& rhs) {
static bool decode(const Node& node, enum_option<netw::dict_training_loop::when>& rhs) {
std::string name;
if (!convert<std::string>::decode(node, name)) {
return false;
@@ -1110,7 +1110,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Specifies RPC compression algorithms supported by this node. ")
, internode_compression_enable_advanced(this, "internode_compression_enable_advanced", liveness::MustRestart, value_status::Used, false,
"Enables the new implementation of RPC compression. If disabled, Scylla will fall back to the old implementation.")
, rpc_dict_training_when(this, "rpc_dict_training_when", liveness::LiveUpdate, value_status::Used, netw::dict_training_when::type::NEVER,
, rpc_dict_training_when(this, "rpc_dict_training_when", liveness::LiveUpdate, value_status::Used, netw::dict_training_loop::when::type::NEVER,
"Specifies when RPC compression dictionary training is performed by this node.\n"
"* `never` disables it unconditionally.\n"
"* `when_leader` enables it only whenever the node is the Raft leader.\n"
@@ -2025,8 +2025,8 @@ template struct utils::config_file::named_value<enum_option<db::experimental_fea
template struct utils::config_file::named_value<enum_option<db::replication_strategy_restriction_t>>;
template struct utils::config_file::named_value<enum_option<db::consistency_level_restriction_t>>;
template struct utils::config_file::named_value<enum_option<db::tablets_mode_t>>;
template struct utils::config_file::named_value<enum_option<netw::dict_training_when>>;
template struct utils::config_file::named_value<netw::algo_config>;
template struct utils::config_file::named_value<enum_option<netw::dict_training_loop::when>>;
template struct utils::config_file::named_value<netw::advanced_rpc_compressor::tracker::algo_config>;
template struct utils::config_file::named_value<std::vector<enum_option<db::experimental_features_t>>>;
template struct utils::config_file::named_value<std::vector<enum_option<db::replication_strategy_restriction_t>>>;
template struct utils::config_file::named_value<std::vector<enum_option<db::consistency_level_restriction_t>>>;
@@ -2094,7 +2094,7 @@ future<gms::inet_address> resolve(const config_file::named_value<sstring>& addre
}
}
co_return seastar::coroutine::exception(std::move(ex));
co_return coroutine::exception(std::move(ex));
}
static std::vector<seastar::metrics::relabel_config> get_relable_from_yaml(const YAML::Node& yaml, const std::string& name) {

View File

@@ -9,7 +9,6 @@
#pragma once
#include <filesystem>
#include <unordered_map>
#include <seastar/core/sstring.hh>
@@ -17,14 +16,15 @@
#include <seastar/util/program-options.hh>
#include <seastar/util/log.hh>
#include "locator/replication_strategy_type.hh"
#include "locator/abstract_replication_strategy.hh"
#include "seastarx.hh"
#include "utils/config_file.hh"
#include "utils/enum_option.hh"
#include "gms/inet_address.hh"
#include "db/hints/host_filter.hh"
#include "utils/error_injection.hh"
#include "message/rpc_compression_types.hh"
#include "message/dict_trainer.hh"
#include "message/advanced_rpc_compressor.hh"
#include "db/consistency_level_type.hh"
#include "db/tri_mode_restriction.hh"
#include "sstables/compressor.hh"
@@ -325,9 +325,9 @@ public:
named_value<uint32_t> internode_compression_zstd_min_message_size;
named_value<uint32_t> internode_compression_zstd_max_message_size;
named_value<bool> internode_compression_checksumming;
named_value<netw::algo_config> internode_compression_algorithms;
named_value<netw::advanced_rpc_compressor::tracker::algo_config> internode_compression_algorithms;
named_value<bool> internode_compression_enable_advanced;
named_value<enum_option<netw::dict_training_when>> rpc_dict_training_when;
named_value<enum_option<netw::dict_training_loop::when>> rpc_dict_training_when;
named_value<uint32_t> rpc_dict_training_min_time_seconds;
named_value<uint64_t> rpc_dict_training_min_bytes;
named_value<bool> inter_dc_tcp_nodelay;
@@ -739,8 +739,8 @@ extern template struct utils::config_file::named_value<enum_option<db::experimen
extern template struct utils::config_file::named_value<enum_option<db::replication_strategy_restriction_t>>;
extern template struct utils::config_file::named_value<enum_option<db::consistency_level_restriction_t>>;
extern template struct utils::config_file::named_value<enum_option<db::tablets_mode_t>>;
extern template struct utils::config_file::named_value<enum_option<netw::dict_training_when>>;
extern template struct utils::config_file::named_value<netw::algo_config>;
extern template struct utils::config_file::named_value<enum_option<netw::dict_training_loop::when>>;
extern template struct utils::config_file::named_value<netw::advanced_rpc_compressor::tracker::algo_config>;
extern template struct utils::config_file::named_value<std::vector<enum_option<db::experimental_features_t>>>;
extern template struct utils::config_file::named_value<std::vector<enum_option<db::replication_strategy_restriction_t>>>;
extern template struct utils::config_file::named_value<std::vector<enum_option<db::consistency_level_restriction_t>>>;

View File

@@ -277,7 +277,7 @@ filter_for_query(consistency_level cl,
host_id_vector_replica_set selected_endpoints;
// Pre-select endpoints based on client preference. If the endpoints
// Preselect endpoints based on client preference. If the endpoints
// selected this way aren't enough to satisfy CL requirements select the
// remaining ones according to the load-balancing strategy as before.
if (!preferred_endpoints.empty()) {

View File

@@ -33,6 +33,11 @@ enum class schema_feature {
// Per-table tablet options
TABLET_OPTIONS,
// When enabled, `system_schema.keyspaces` will keep three replication values:
// the initial, the current, and the target replication factor,
// which reflect the phases of the multi RF change.
KEYSPACE_MULTI_RF_CHANGE,
};
using schema_features = enum_set<super_enum<schema_feature,
@@ -43,7 +48,8 @@ using schema_features = enum_set<super_enum<schema_feature,
schema_feature::TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,
schema_feature::GROUP0_SCHEMA_VERSIONING,
schema_feature::IN_MEMORY_TABLES,
schema_feature::TABLET_OPTIONS
schema_feature::TABLET_OPTIONS,
schema_feature::KEYSPACE_MULTI_RF_CHANGE
>>;
}

View File

@@ -216,6 +216,7 @@ schema_ptr keyspaces() {
{"durable_writes", boolean_type},
{"replication", map_type_impl::get_instance(utf8_type, utf8_type, false)},
{"replication_v2", map_type_impl::get_instance(utf8_type, utf8_type, false)}, // with rack list RF
{"next_replication", map_type_impl::get_instance(utf8_type, utf8_type, false)}, // target rack list RF for this RF change
},
// static columns
{},
@@ -1178,6 +1179,14 @@ utils::chunked_vector<mutation> make_create_keyspace_mutations(schema_features f
// If the maps are different, the upgrade must be already done.
store_map(m, ckey, "replication_v2", timestamp, cql3::statements::to_flattened_map(map));
}
if (features.contains<schema_feature::KEYSPACE_MULTI_RF_CHANGE>()) {
const auto& next_map_opt = keyspace->next_strategy_options_opt();
if (next_map_opt) {
auto next_map = *next_map_opt;
next_map["class"] = keyspace->strategy_name();
store_map(m, ckey, "next_replication", timestamp, cql3::statements::to_flattened_map(next_map));
}
}
if (features.contains<schema_feature::SCYLLA_KEYSPACES>()) {
schema_ptr scylla_keyspaces_s = scylla_keyspaces();
@@ -1251,6 +1260,7 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
// (or screw up shared pointers)
const auto& replication = row.get_nonnull<map_type_impl::native_type>("replication");
const auto& replication_v2 = row.get<map_type_impl::native_type>("replication_v2");
const auto& next_replication = row.get<map_type_impl::native_type>("next_replication");
cql3::statements::property_definitions::map_type flat_strategy_options;
for (auto& p : replication_v2 ? *replication_v2 : replication) {
@@ -1259,6 +1269,17 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
auto strategy_options = cql3::statements::from_flattened_map(flat_strategy_options);
auto strategy_name = std::get<sstring>(strategy_options["class"]);
strategy_options.erase("class");
std::optional<cql3::statements::property_definitions::extended_map_type> next_strategy_options = std::nullopt;
if (next_replication) {
cql3::statements::property_definitions::map_type flat_next_replication;
for (auto& p : *next_replication) {
flat_next_replication.emplace(value_cast<sstring>(p.first), value_cast<sstring>(p.second));
}
next_strategy_options = cql3::statements::from_flattened_map(flat_next_replication);
next_strategy_options->erase("class");
}
bool durable_writes = row.get_nonnull<bool>("durable_writes");
data_dictionary::storage_options storage_opts;
@@ -1284,7 +1305,7 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
}
}
}
co_return keyspace_metadata::new_keyspace(keyspace_name, strategy_name, strategy_options, initial_tablets, consistency, durable_writes, storage_opts);
co_return keyspace_metadata::new_keyspace(keyspace_name, strategy_name, strategy_options, initial_tablets, consistency, durable_writes, storage_opts, {}, next_strategy_options);
}
template<typename V>

View File

@@ -300,6 +300,7 @@ schema_ptr system_keyspace::topology() {
.with_column("upgrade_state", utf8_type, column_kind::static_column)
.with_column("global_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.with_column("paused_rf_change_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.with_column("ongoing_rf_changes", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.set_comment("Current state of topology change machine")
.with_hash_version()
.build();
@@ -3350,6 +3351,12 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
}
}
if (some_row.has("ongoing_rf_changes")) {
for (auto&& v : deserialize_set_column(*topology(), some_row, "ongoing_rf_changes")) {
ret.ongoing_rf_changes.insert(value_cast<utils::UUID>(v));
}
}
if (some_row.has("enabled_features")) {
ret.enabled_features = decode_features(deserialize_set_column(*topology(), some_row, "enabled_features"));
}

View File

@@ -15,11 +15,10 @@
#include <utility>
#include <vector>
#include "db/view/view_build_status.hh"
#include "gms/inet_address.hh"
#include "gms/generation-number.hh"
#include "gms/loaded_endpoint_state.hh"
#include "gms/gossiper.hh"
#include "schema/schema_fwd.hh"
#include "utils/UUID.hh"
#include "query/query-result-set.hh"
#include "db_clock.hh"
#include "mutation_query.hh"
#include "system_keyspace_view_types.hh"
@@ -37,10 +36,6 @@ namespace netw {
class shared_dict;
};
namespace query {
class result_set;
}
namespace sstables {
struct entry_descriptor;
class generation_type;

View File

@@ -29,8 +29,6 @@
#include "db/config.hh"
#include "db/view/base_info.hh"
#include "gms/gossiper.hh"
#include "query/query-result-set.hh"
#include "db/view/view_build_status.hh"
#include "db/view/view_consumer.hh"
#include "mutation/canonical_mutation.hh"
@@ -1586,9 +1584,11 @@ future<stop_iteration> view_update_builder::on_results() {
auto tombstone = std::max(_update_partition_tombstone, _update_current_tombstone);
if (tombstone && _existing && !_existing->is_end_of_partition()) {
// We don't care if it's a range tombstone, as we're only looking for existing entries that get deleted
if (_existing->is_clustering_row()) {
if (_existing->is_range_tombstone_change()) {
_existing_current_tombstone = _existing->as_range_tombstone_change().tombstone();
} else if (_existing->is_clustering_row()) {
auto existing = clustering_row(*_schema, _existing->as_clustering_row());
existing.apply(std::max(_existing_partition_tombstone, _existing_current_tombstone));
auto update = clustering_row(existing.key(), row_tombstone(std::move(tombstone)), row_marker(), ::row());
generate_update(std::move(update), { std::move(existing) });
} else if (_existing->is_static_row()) {
@@ -1599,9 +1599,10 @@ future<stop_iteration> view_update_builder::on_results() {
return should_stop_updates() ? stop() : advance_existings();
}
// If we have updates and it's a range tombstone, it removes nothing pre-exisiting, so we can ignore it
if (_update && !_update->is_end_of_partition()) {
if (_update->is_clustering_row()) {
if (_update->is_range_tombstone_change()) {
_update_current_tombstone = _update->as_range_tombstone_change().tombstone();
} else if (_update->is_clustering_row()) {
_update->mutate_as_clustering_row(*_schema, [&] (clustering_row& cr) mutable {
cr.apply(std::max(_update_partition_tombstone, _update_current_tombstone));
});

View File

@@ -13,7 +13,6 @@
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/core/on_internal_error.hh>
#include "gms/gossiper.hh"
#include "db/view/view_building_coordinator.hh"
#include "db/view/view_build_status.hh"
#include "locator/tablets.hh"

View File

@@ -13,7 +13,7 @@
#include "db/system_keyspace.hh"
#include "locator/tablets.hh"
#include "mutation/canonical_mutation.hh"
#include "raft/raft_fwd.hh"
#include "raft/raft.hh"
#include "service/endpoint_lifecycle_subscriber.hh"
#include "service/raft/raft_group0.hh"
#include "service/raft/raft_group0_client.hh"

View File

@@ -21,8 +21,6 @@
#include "dht/token.hh"
#include "replica/database.hh"
#include "service/storage_proxy.hh"
#include "service/storage_service.hh"
#include "db/system_keyspace.hh"
#include "service/raft/raft_group0_client.hh"
#include "service/raft/raft_group0.hh"
#include "schema/schema_fwd.hh"

View File

@@ -17,7 +17,7 @@
#include <flat_set>
#include "locator/abstract_replication_strategy.hh"
#include "locator/tablets.hh"
#include "raft/raft_fwd.hh"
#include "raft/raft.hh"
#include <seastar/core/gate.hh>
#include "db/view/view_building_state.hh"
#include "sstables/shared_sstable.hh"

View File

@@ -240,6 +240,9 @@ future<> view_update_generator::process_staging_sstables(lw_shared_ptr<replica::
_progress_tracker->on_sstable_registration(sst);
}
utils::get_local_injector().inject("view_update_generator_pause_before_processing",
utils::wait_for_message(std::chrono::minutes(5))).get();
// Generate view updates from staging sstables
auto start_time = db_clock::now();
auto [result, input_size] = generate_updates_from_staging_sstables(table, sstables);

View File

@@ -20,7 +20,6 @@
#include "cdc/metadata.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "query/query-result-set.hh"
#include "db/virtual_table.hh"
#include "partition_slice_builder.hh"
#include "db/virtual_tables.hh"

View File

@@ -271,7 +271,7 @@ The json structure is as follows:
}
The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `version` - representing the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file. The following scopes are supported:
- `node` - the manifest describes all SSTables owned by this node in this snapshot.

View File

@@ -12,7 +12,9 @@ Schema:
CREATE TABLE system_schema.keyspaces (
keyspace_name text PRIMARY KEY,
durable_writes boolean,
replication frozen<map<text, text>>
replication frozen<map<text, text>>,
replication_v2 frozen<map<text, text>>,
next_replication frozen<map<text, text>>
)
```
@@ -31,6 +33,8 @@ Columns:
stored as a flattened map of the extended options map (see below).
For `SimpleStrategy` there is a single option `"replication_factor"` specifying the replication factor.
* `next_replication` - the target replication factor for the keyspace during rf change.
If there is no ongoing rf change, `next_replication` value is not set.
Extended options map used by NetworkTopologyStrategy is a map where values can be either strings or lists of strings.

View File

@@ -146,6 +146,25 @@ AWS Security Token Service (STS) or the EC2 Instance Metadata Service.
- When set, these values are used by the S3 client to sign requests.
- If not set, requests are sent unsigned, which may not be accepted by all servers.
.. _admin-oci-object-storage:
Using Oracle OCI Object Storage
=================================
Oracle Cloud Infrastructure (OCI) Object Storage is compatible with the Amazon
S3 API, so it works with ScyllaDB without additional configuration.
To use OCI Object Storage, follow the same configuration as for AWS S3, and
specify your OCI S3-compatible endpoint.
Example:
.. code:: yaml
object_storage_endpoints:
- name: https://idedxcgnkfkt.compat.objectstorage.us-ashburn-1.oci.customer-oci.com:443
aws_region: us-ashburn-1
.. _admin-compression:
Compression

View File

@@ -231,6 +231,46 @@ Add New DC
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
If the keyspace uses rack list replication, update the replication factor in one ``ALTER KEYSPACE`` statement, under the following rules:
* Existing datacenters must keep their current replication factor.
* A new datacenter can be assigned a replication factor (**0 to N**).
* An existing datacenter can be removed (**N to 0**).
.. warning::
While adding a new datacenter and altering keyspaces, do **not** perform any reads or writes that involve the new datacenter.
In particular, avoid using global consistency levels (such as ``ALL``, ``EACH_QUORUM``) that would include the new datacenter in the operation.
Use ``LOCAL_*`` consistency levels (e.g., ``LOCAL_QUORUM``, ``LOCAL_ONE``) until the new datacenter is fully operational.
Before
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace4;
CREATE KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>']} AND tablets = { 'enabled': true };
The following is **not** allowed because it changes the replication factor of ``<existing_dc>`` (adds ``<existing_rack4>``) and adds ``<new_dc>`` in the same statement:
.. code-block:: cql
ALTER KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>', '<existing_rack4>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
Add all the nodes to the new datacenter and then:
.. code-block:: cql
ALTER KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
After
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace4;
CREATE KEYSPACE mykeyspace4 WITH REPLICATION = {'class': 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
You can abort the keyspace alteration using :doc:`Task manager </operating-scylla/admin-tools/task-manager>`.
#. If any vnode keyspace was altered, run ``nodetool rebuild`` on each node in the new datacenter, specifying the existing datacenter name in the rebuild command.
For example:

View File

@@ -102,6 +102,34 @@ Procedure
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
If the keyspace uses rack list replication, update the replication factor in one ``ALTER KEYSPACE`` statement, under the following rules:
* Existing datacenters must keep their current replication factor.
* An existing datacenter can be removed (**N to 0**).
* A new datacenter can be assigned a replication factor (**0 to N**).
.. warning::
While removing a datacenter and altering keyspaces, do **not** perform any reads or writes that involve the datacenter being removed.
In particular, avoid using global consistency levels (such as ``ALL``, ``EACH_QUORUM``) that would include the decommissioned datacenter in the operation.
Use ``LOCAL_*`` consistency levels (e.g., ``LOCAL_QUORUM``, ``LOCAL_ONE``) until the datacenter is fully decommissioned.
.. code-block:: shell
cqlsh> DESCRIBE nba4
cqlsh> CREATE KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : ['RAC4', 'RAC5'], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
The following is **not** allowed because it changes the replication factor of ``EUROPE-DC`` (adds ``RAC9``) and removes ``ASIA-DC`` in the same statement:
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : [], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8', 'RAC9']} AND tablets = { 'enabled': true };
Remove all replicas from the decommissioned datacenter:
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : [], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
.. note::
If table audit is enabled, the ``audit`` keyspace is automatically created with ``NetworkTopologyStrategy``.
@@ -113,6 +141,10 @@ Procedure
Failure to do so will result in decommission errors such as "zero replica after the removal".
.. warning::
Removal of replicas from a datacenter cannot be aborted. To get back to the previous replication, wait until the ALTER KEYSPACE finishes and then add the replicas back by running another ALTER KEYSPACE statement.
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.

View File

@@ -7,6 +7,7 @@
#include <seastar/core/sstring.hh>
#include <seastar/core/seastar.hh>
#include <seastar/core/smp.hh>
#include "db/schema_features.hh"
#include "utils/log.hh"
#include "gms/feature.hh"
#include "gms/feature_service.hh"
@@ -179,6 +180,7 @@ db::schema_features feature_service::cluster_schema_features() const {
f.set<db::schema_feature::GROUP0_SCHEMA_VERSIONING>();
f.set_if<db::schema_feature::IN_MEMORY_TABLES>(bool(in_memory_tables));
f.set_if<db::schema_feature::TABLET_OPTIONS>(bool(tablet_options));
f.set_if<db::schema_feature::KEYSPACE_MULTI_RF_CHANGE>(bool(keyspace_multi_rf_change));
return f;
}

View File

@@ -182,6 +182,7 @@ public:
gms::feature writetime_ttl_individual_element { *this, "WRITETIME_TTL_INDIVIDUAL_ELEMENT"sv };
gms::feature arbitrary_tablet_boundaries { *this, "ARBITRARY_TABLET_BOUNDARIES"sv };
gms::feature large_data_virtual_tables { *this, "LARGE_DATA_VIRTUAL_TABLES"sv };
gms::feature keyspace_multi_rf_change { *this, "KEYSPACE_MULTI_RF_CHANGE"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -34,7 +34,6 @@
#include "locator/token_metadata.hh"
#include "locator/types.hh"
#include "gms/gossip_address_map.hh"
#include "gms/loaded_endpoint_state.hh"
namespace gms {
@@ -72,6 +71,11 @@ struct gossip_config {
utils::updateable_value<utils::UUID> recovery_leader;
};
struct loaded_endpoint_state {
gms::inet_address endpoint;
std::optional<locator::endpoint_dc_rack> opt_dc_rack;
};
/**
* This module is responsible for Gossiping information for the local endpoint. This abstraction
* maintains the list of live and dead endpoints. Periodically i.e. every 1 second this module

View File

@@ -1,23 +0,0 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <optional>
#include "gms/inet_address.hh"
#include "locator/types.hh"
namespace gms {
struct loaded_endpoint_state {
inet_address endpoint;
std::optional<locator::endpoint_dc_rack> opt_dc_rack;
};
} // namespace gms

View File

@@ -11,7 +11,7 @@
#include "query/query_id.hh"
#include "locator/host_id.hh"
#include "tasks/types.hh"
#include "service/session_id.hh"
#include "service/session.hh"
namespace utils {
class UUID final {
@@ -43,3 +43,4 @@ class host_id final {
};
} // namespace locator

View File

@@ -21,7 +21,6 @@
#include "utils/UUID_gen.hh"
#include "types/types.hh"
#include "utils/managed_string.hh"
#include "utils/rjson.hh"
#include <ranges>
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>

View File

@@ -284,3 +284,14 @@ future<> instance_cache::stop() {
}
}
namespace std {
template <>
struct equal_to<seastar::scheduling_group> {
bool operator()(seastar::scheduling_group& sg1, seastar::scheduling_group& sg2) const noexcept {
return sg1 == sg2;
}
};
}

View File

@@ -19,7 +19,6 @@
#include "utils/sequenced_set.hh"
#include "utils/simple_hashers.hh"
#include "tablets.hh"
#include "locator/replication_strategy_type.hh"
#include "data_dictionary/consistency_config_options.hh"
// forward declaration since replica/database.hh includes this file
@@ -39,6 +38,13 @@ extern logging::logger rslogger;
using inet_address = gms::inet_address;
using token = dht::token;
enum class replication_strategy_type {
simple,
local,
network_topology,
everywhere_topology,
};
using replication_strategy_config_option = std::variant<sstring, rack_list>;
using replication_strategy_config_options = std::map<sstring, replication_strategy_config_option>;

View File

@@ -381,6 +381,10 @@ public:
return _nodes.at(node)._du.capacity;
}
bool has_node(host_id node) const {
return _nodes.contains(node);
}
shard_id get_shard_count(host_id node) const {
if (!_nodes.contains(node)) {
return 0;

View File

@@ -1,20 +0,0 @@
/*
* Copyright (C) 2015-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
namespace locator {
enum class replication_strategy_type {
simple,
local,
network_topology,
everywhere_topology,
};
} // namespace locator

View File

@@ -12,7 +12,7 @@
#include "locator/token_metadata_fwd.hh"
#include "utils/small_vector.hh"
#include "locator/host_id.hh"
#include "service/session_id.hh"
#include "service/session.hh"
#include "dht/i_partitioner_fwd.hh"
#include "dht/token-sharding.hh"
#include "dht/ring_position.hh"
@@ -21,9 +21,10 @@
#include "utils/chunked_vector.hh"
#include "utils/hash.hh"
#include "utils/UUID.hh"
#include "raft/raft_fwd.hh"
#include "raft/raft.hh"
#include <ranges>
#include <seastar/core/reactor.hh>
#include <seastar/util/log.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
@@ -152,19 +153,27 @@ struct hash<locator::range_based_tablet_id> {
namespace locator {
/// Creates a new replica set with old_replica replaced by new_replica.
/// If there is no old_replica, the set is returned unchanged.
/// Returns a copy of the replica set with the following modifications:
/// - If both old_replica and new_replica are set, old_replica is substituted
/// with new_replica. If old_replica is not found in rs, the set is returned as-is.
/// - If only old_replica is set, it is removed from the result.
/// - If only new_replica is set, it is appended to the result.
inline
tablet_replica_set replace_replica(const tablet_replica_set& rs, tablet_replica old_replica, tablet_replica new_replica) {
tablet_replica_set replace_replica(const tablet_replica_set& rs, std::optional<tablet_replica> old_replica, std::optional<tablet_replica> new_replica) {
tablet_replica_set result;
result.reserve(rs.size());
for (auto&& r : rs) {
if (r == old_replica) {
result.push_back(new_replica);
if (old_replica.has_value() && r == old_replica.value()) {
if (new_replica.has_value()) {
result.push_back(new_replica.value());
}
} else {
result.push_back(r);
}
}
if (!old_replica.has_value() && new_replica.has_value()) {
result.push_back(new_replica.value());
}
return result;
}
@@ -382,8 +391,8 @@ bool is_post_cleanup(tablet_replica replica, const tablet_info& tinfo, const tab
struct tablet_migration_info {
locator::tablet_transition_kind kind;
locator::global_tablet_id tablet;
locator::tablet_replica src;
locator::tablet_replica dst;
std::optional<locator::tablet_replica> src;
std::optional<locator::tablet_replica> dst;
};
class tablet_map;

View File

@@ -942,7 +942,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auto background_reclaim_scheduling_group = create_scheduling_group("background_reclaim", "bgre", 50).get();
// Maintenance supergroup -- the collection of background low-prio activites
// Maintenance supergroup -- the collection of background low-prio activities
auto maintenance_supergroup = create_scheduling_supergroup(200).get();
auto bandwidth_updater = io_throughput_updater("maintenance supergroup", maintenance_supergroup,
cfg->maintenance_io_throughput_mb_per_sec.is_set() ? cfg->maintenance_io_throughput_mb_per_sec : cfg->stream_io_throughput_mb_per_sec);

View File

@@ -11,10 +11,9 @@
#include <seastar/core/condition-variable.hh>
#include <seastar/rpc/rpc_types.hh>
#include <utility>
#include "rpc_compression_types.hh"
#include "utils/refcounted.hh"
#include "utils/updateable_value.hh"
#include "utils/enum_option.hh"
#include "shared_dict.hh"
namespace netw {
@@ -29,6 +28,103 @@ class dict_sampler;
using dict_ptr = lw_shared_ptr<foreign_ptr<lw_shared_ptr<shared_dict>>>;
class control_protocol_frame;
// An enum wrapper, describing supported RPC compression algorithms.
// Always contains a valid value —- the constructors won't allow
// an invalid/unknown enum variant to be constructed.
struct compression_algorithm {
using underlying = uint8_t;
enum class type : underlying {
RAW,
LZ4,
ZSTD,
COUNT,
} _value;
// Construct from an integer.
// Used to deserialize the algorithm from the first byte of the frame.
constexpr compression_algorithm(underlying x) {
if (x < std::to_underlying(type::RAW) || x >= std::to_underlying(type::COUNT)) {
throw std::runtime_error(fmt::format("Invalid value {} for enum compression_algorithm", static_cast<int>(x)));
}
_value = static_cast<type>(x);
}
// Construct from `type`. Makes sure that `type` has a valid value.
constexpr compression_algorithm(type x) : compression_algorithm(std::to_underlying(x)) {}
// These names are used in multiple places:
// RPC negotiation, in metric labels, and config.
static constexpr std::string_view names[] = {
"raw",
"lz4",
"zstd",
};
static_assert(std::size(names) == static_cast<int>(compression_algorithm::type::COUNT));
// Implements enum_option.
static auto map() {
std::unordered_map<std::string, type> ret;
for (size_t i = 0; i < std::size(names); ++i) {
ret.insert(std::make_pair<std::string, type>(std::string(names[i]), compression_algorithm(i).get()));
}
return ret;
}
constexpr std::string_view name() const noexcept { return names[idx()]; }
constexpr underlying idx() const noexcept { return std::to_underlying(_value); }
constexpr type get() const noexcept { return _value; }
constexpr static size_t count() { return static_cast<size_t>(type::COUNT); };
bool operator<=>(const compression_algorithm &) const = default;
};
// Represents a set of compression algorithms.
// Backed by a bitset.
// Used for convenience during algorithm negotiations.
class compression_algorithm_set {
uint8_t _bitset;
static_assert(std::numeric_limits<decltype(_bitset)>::digits > compression_algorithm::count());
constexpr compression_algorithm_set(uint8_t v) noexcept : _bitset(v) {}
public:
// Returns a set containing the given algorithm and all algorithms weaker (smaller in the enum order)
// than it.
constexpr static compression_algorithm_set this_or_lighter(compression_algorithm algo) noexcept {
auto x = 1 << (algo.idx());
return {x + (x - 1)};
}
// Returns the strongest (greatest in the enum order) algorithm in the set.
constexpr compression_algorithm heaviest() const {
return {std::bit_width(_bitset) - 1};
}
// The usual set operations.
constexpr static compression_algorithm_set singleton(compression_algorithm algo) noexcept {
return {1 << algo.idx()};
}
constexpr compression_algorithm_set intersection(compression_algorithm_set o) const noexcept {
return {_bitset & o._bitset};
}
constexpr compression_algorithm_set difference(compression_algorithm_set o) const noexcept {
return {_bitset &~ o._bitset};
}
constexpr compression_algorithm_set sum(compression_algorithm_set o) const noexcept {
return {_bitset | o._bitset};
}
constexpr bool contains(compression_algorithm algo) const noexcept {
return _bitset & (1 << algo.idx());
}
constexpr bool operator==(const compression_algorithm_set&) const = default;
// Returns the contained bitset. Used for serialization.
constexpr uint8_t value() const noexcept {
return _bitset;
}
// Reconstructs the set from the output of `value()`. Used for deserialization.
constexpr static compression_algorithm_set from_value(uint8_t bitset) {
compression_algorithm_set x = bitset;
x.heaviest(); // This is a validation check. It will throw if the bitset contains some illegal/unknown bits.
return x;
}
};
using algo_config = std::vector<enum_option<compression_algorithm>>;
// See docs/dev/advanced_rpc_compression.md,
// section `Negotiation` for more information about the protocol.
struct control_protocol {
@@ -152,7 +248,7 @@ struct per_algorithm_stats {
// prevent a misuse of the API (dangling references).
class advanced_rpc_compressor::tracker : public utils::refcounted {
public:
using algo_config = netw::algo_config;
using algo_config = algo_config;
struct config {
utils::updateable_value<uint32_t> zstd_min_msg_size{0};
utils::updateable_value<uint32_t> zstd_max_msg_size{std::numeric_limits<uint32_t>::max()};

View File

@@ -9,7 +9,7 @@
#pragma once
#include "shared_dict.hh"
#include "rpc_compression_types.hh"
#include "advanced_rpc_compressor.hh"
namespace netw {

View File

@@ -8,7 +8,6 @@
#pragma once
#include "rpc_compression_types.hh"
#include "utils/reservoir_sampling.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/future.hh>
@@ -89,7 +88,28 @@ class dict_training_loop {
seastar::semaphore _pause{0};
seastar::abort_source _pause_as;
public:
using when = netw::dict_training_when;
struct when {
enum class type {
NEVER,
WHEN_LEADER,
ALWAYS,
COUNT,
};
static constexpr std::string_view names[] = {
"never",
"when_leader",
"always",
};
static_assert(std::size(names) == static_cast<size_t>(type::COUNT));
// Implements enum_option.
static std::unordered_map<std::string, type> map() {
std::unordered_map<std::string, type> ret;
for (size_t i = 0; i < std::size(names); ++i) {
ret.insert({std::string(names[i]), type(i)});
}
return ret;
}
};
void pause();
void unpause();
void cancel() noexcept;

View File

@@ -54,11 +54,11 @@ dictionary_service::dictionary_service(
void dictionary_service::maybe_toggle_dict_training() {
auto when = _rpc_dict_training_when();
netw::dict_trainer_logger.debug("dictionary_service::maybe_toggle_dict_training(), called, _is_leader={}, when={}", _is_leader, when);
if (when == netw::dict_training_when::type::NEVER) {
if (when == netw::dict_training_loop::when::type::NEVER) {
_training_fiber.pause();
} else if (when == netw::dict_training_when::type::ALWAYS) {
} else if (when == netw::dict_training_loop::when::type::ALWAYS) {
_training_fiber.unpause();
} else if (when == netw::dict_training_when::type::WHEN_LEADER) {
} else if (when == netw::dict_training_loop::when::type::WHEN_LEADER) {
_is_leader ? _training_fiber.unpause() : _training_fiber.pause();
}
};

View File

@@ -40,7 +40,7 @@ namespace gms {
class dictionary_service {
db::system_keyspace& _sys_ks;
locator::host_id _our_host_id;
utils::updateable_value<enum_option<netw::dict_training_when>> _rpc_dict_training_when;
utils::updateable_value<enum_option<netw::dict_training_loop::when>> _rpc_dict_training_when;
service::raft_group0_client& _raft_group0_client;
abort_source& _as;
netw::dict_training_loop _training_fiber;
@@ -48,7 +48,7 @@ class dictionary_service {
bool _is_leader = false;
utils::observer<bool> _leadership_observer;
utils::observer<enum_option<netw::dict_training_when>> _when_observer;
utils::observer<enum_option<netw::dict_training_loop::when>> _when_observer;
std::optional<std::any> _feature_observer;
void maybe_toggle_dict_training();
@@ -61,7 +61,7 @@ public:
locator::host_id our_host_id = Uninitialized();
utils::updateable_value<uint32_t> rpc_dict_training_min_time_seconds = Uninitialized();
utils::updateable_value<uint64_t> rpc_dict_training_min_bytes = Uninitialized();
utils::updateable_value<enum_option<netw::dict_training_when>> rpc_dict_training_when = Uninitialized();
utils::updateable_value<enum_option<netw::dict_training_loop::when>> rpc_dict_training_when = Uninitialized();
};
// Note: the training fiber will start as soon as the relevant cluster feature is enabled.
dictionary_service(

View File

@@ -19,7 +19,6 @@
#include <seastar/coroutine/all.hh>
#include "message/messaging_service.hh"
#include "message/advanced_rpc_compressor.hh"
#include <seastar/core/sharded.hh>
#include "gms/gossiper.hh"
#include "service/storage_service.hh"

View File

@@ -19,11 +19,11 @@
#include "schema/schema_fwd.hh"
#include "streaming/stream_fwd.hh"
#include "locator/host_id.hh"
#include "service/session_id.hh"
#include "service/session.hh"
#include "service/maintenance_mode.hh"
#include "gms/gossip_address_map.hh"
#include "gms/generation-number.hh"
#include "tasks/types.hh"
#include "message/advanced_rpc_compressor.hh"
#include "utils/chunked_vector.hh"
#include <list>
@@ -120,8 +120,6 @@ namespace qos {
namespace netw {
class walltime_compressor_tracker;
/* All verb handler identifiers */
enum class messaging_verb : int32_t {
CLIENT_ID = 0,

View File

@@ -1,155 +0,0 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <bit>
#include <compare>
#include <cstddef>
#include <cstdint>
#include <iterator>
#include <limits>
#include <stdexcept>
#include <string>
#include <string_view>
#include <unordered_map>
#include <utility>
#include <vector>
#include "utils/enum_option.hh"
namespace netw {
// An enum wrapper, describing supported RPC compression algorithms.
// Always contains a valid value -- the constructors won't allow
// an invalid/unknown enum variant to be constructed.
struct compression_algorithm {
using underlying = uint8_t;
enum class type : underlying {
RAW,
LZ4,
ZSTD,
COUNT,
} _value;
// Construct from an integer.
// Used to deserialize the algorithm from the first byte of the frame.
constexpr compression_algorithm(underlying x) {
if (x < std::to_underlying(type::RAW) || x >= std::to_underlying(type::COUNT)) {
throw std::runtime_error(std::string("Invalid value ") + std::to_string(unsigned(x)) + " for enum compression_algorithm");
}
_value = static_cast<type>(x);
}
// Construct from `type`. Makes sure that `type` has a valid value.
constexpr compression_algorithm(type x) : compression_algorithm(std::to_underlying(x)) {}
// These names are used in multiple places:
// RPC negotiation, in metric labels, and config.
static constexpr std::string_view names[] = {
"raw",
"lz4",
"zstd",
};
static_assert(std::size(names) == static_cast<int>(compression_algorithm::type::COUNT));
// Implements enum_option.
static auto map() {
std::unordered_map<std::string, type> ret;
for (size_t i = 0; i < std::size(names); ++i) {
ret.insert(std::make_pair(std::string(names[i]), compression_algorithm(i).get()));
}
return ret;
}
constexpr std::string_view name() const noexcept { return names[idx()]; }
constexpr underlying idx() const noexcept { return std::to_underlying(_value); }
constexpr type get() const noexcept { return _value; }
constexpr static size_t count() { return static_cast<size_t>(type::COUNT); }
bool operator<=>(const compression_algorithm&) const = default;
};
// Represents a set of compression algorithms.
// Backed by a bitset.
// Used for convenience during algorithm negotiations.
class compression_algorithm_set {
uint8_t _bitset;
static_assert(std::numeric_limits<decltype(_bitset)>::digits > compression_algorithm::count());
constexpr compression_algorithm_set(uint8_t v) noexcept : _bitset(v) {}
public:
// Returns a set containing the given algorithm and all algorithms weaker (smaller in the enum order)
// than it.
constexpr static compression_algorithm_set this_or_lighter(compression_algorithm algo) noexcept {
auto x = 1 << algo.idx();
return {uint8_t(x + (x - 1))};
}
// Returns the strongest (greatest in the enum order) algorithm in the set.
constexpr compression_algorithm heaviest() const {
return {compression_algorithm::underlying(std::bit_width(_bitset) - 1)};
}
// The usual set operations.
constexpr static compression_algorithm_set singleton(compression_algorithm algo) noexcept {
return {uint8_t(1 << algo.idx())};
}
constexpr compression_algorithm_set intersection(compression_algorithm_set o) const noexcept {
return {uint8_t(_bitset & o._bitset)};
}
constexpr compression_algorithm_set difference(compression_algorithm_set o) const noexcept {
return {uint8_t(_bitset &~ o._bitset)};
}
constexpr compression_algorithm_set sum(compression_algorithm_set o) const noexcept {
return {uint8_t(_bitset | o._bitset)};
}
constexpr bool contains(compression_algorithm algo) const noexcept {
return _bitset & (1 << algo.idx());
}
constexpr bool operator==(const compression_algorithm_set&) const = default;
// Returns the contained bitset. Used for serialization.
constexpr uint8_t value() const noexcept {
return _bitset;
}
// Reconstructs the set from the output of `value()`. Used for deserialization.
constexpr static compression_algorithm_set from_value(uint8_t bitset) {
compression_algorithm_set x = bitset;
x.heaviest(); // Validation: throws on illegal/unknown bits.
return x;
}
};
using algo_config = std::vector<enum_option<compression_algorithm>>;
struct dict_training_when {
enum class type {
NEVER,
WHEN_LEADER,
ALWAYS,
COUNT,
};
static constexpr std::string_view names[] = {
"never",
"when_leader",
"always",
};
static_assert(std::size(names) == static_cast<size_t>(type::COUNT));
// Implements enum_option.
static std::unordered_map<std::string, type> map() {
std::unordered_map<std::string, type> ret;
for (size_t i = 0; i < std::size(names); ++i) {
ret.insert({std::string(names[i]), type(i)});
}
return ret;
}
};
} // namespace netw

View File

@@ -16,6 +16,8 @@ Usage:
import argparse, os, sys
from typing import Sequence
from test.pylib.driver_utils import safe_driver_shutdown
def read_statements(path: str) -> list[tuple[int, str]]:
stms: list[tuple[int, str]] = []
with open(path, 'r', encoding='utf-8') as f:
@@ -56,7 +58,7 @@ def exec_statements(statements: list[tuple[int, str]], socket_path: str, timeout
print(f"ERROR executing statement from file line {lineno}: {s}\n{e}", file=sys.stderr)
return 1
finally:
cluster.shutdown()
safe_driver_shutdown(cluster)
return 0
def main(argv: Sequence[str]) -> int:

View File

@@ -1,27 +0,0 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
// Lightweight forward-declaration header for commonly used raft types.
// Include this instead of raft/raft.hh when only the basic ID/index types
// are needed (e.g. in other header files), to avoid pulling in the full
// raft machinery (futures, abort_source, bytes_ostream, etc.).
#include "internal.hh"
namespace raft {
using server_id = internal::tagged_id<struct server_id_tag>;
using group_id = internal::tagged_id<struct group_id_tag>;
using term_t = internal::tagged_uint64<struct term_tag>;
using index_t = internal::tagged_uint64<struct index_tag>;
using read_id = internal::tagged_uint64<struct read_id_tag>;
class server;
} // namespace raft

View File

@@ -269,6 +269,10 @@ public:
// Gets the view a sstable currently belongs to.
compaction::compaction_group_view& view_for_sstable(const sstables::shared_sstable& sst) const;
utils::small_vector<compaction::compaction_group_view*, 3> all_views() const;
// Returns true iff v is the repaired view of this compaction group.
bool is_repaired_view(const compaction::compaction_group_view* v) const noexcept;
// Returns an sstable set containing only repaired sstables (those classified as repaired).
lw_shared_ptr<sstables::sstable_set> make_repaired_sstable_set() const;
seastar::condition_variable& get_staging_done_condition() noexcept {
return _staging_done_condition;
@@ -404,6 +408,8 @@ public:
// Make an sstable set spanning all sstables in the storage_group
lw_shared_ptr<const sstables::sstable_set> make_sstable_set() const;
// Like make_sstable_set(), but restricted to repaired sstables only across all compaction groups.
lw_shared_ptr<const sstables::sstable_set> make_repaired_sstable_set() const;
future<utils::chunked_vector<logstor::segment_snapshot>> take_logstor_snapshot() const;

View File

@@ -1006,7 +1006,7 @@ future<database::keyspace_change_per_shard> database::prepare_update_keyspace_on
co_await modify_keyspace_on_all_shards(sharded_db, [&] (replica::database& db) -> future<> {
auto& ks = db.find_keyspace(ksm.name());
auto new_ksm = ::make_lw_shared<keyspace_metadata>(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(),
ks.metadata()->cf_meta_data() | std::views::values | std::ranges::to<std::vector>(), ks.metadata()->user_types(), ksm.get_storage_options());
ks.metadata()->cf_meta_data() | std::views::values | std::ranges::to<std::vector>(), ks.metadata()->user_types(), ksm.get_storage_options(), ksm.next_strategy_options_opt());
auto change = co_await db.prepare_update_keyspace(ks, new_ksm, pending_token_metadata.local());
changes[this_shard_id()] = make_foreign(std::make_unique<keyspace_change>(std::move(change)));

View File

@@ -8,6 +8,7 @@
#pragma once
#include "locator/abstract_replication_strategy.hh"
#include "index/secondary_index_manager.hh"
#include <seastar/core/abort_source.hh>
#include <seastar/core/sstring.hh>
@@ -112,10 +113,6 @@ namespace gms {
class feature_service;
}
namespace locator {
class abstract_replication_strategy;
}
namespace alternator {
class table_stats;
}
@@ -760,6 +757,10 @@ private:
// groups during tablet split with overlapping token range, and we need to include them all in a single
// sstable set to allow safe tombstone gc.
lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc(const compaction_group&) const;
// Like sstable_set_for_tombstone_gc(), but restricted to repaired sstables only across all compaction
// groups of the same tablet (storage group). Used by the tombstone_gc=repair optimization to avoid
// scanning unrepaired sstables when looking for GC-blocking shadows.
lw_shared_ptr<const sstables::sstable_set> make_repaired_sstable_set_for_tombstone_gc(const compaction_group&) const;
bool cache_enabled() const {
return _config.enable_cache && _schema->caching_options().enabled();

View File

@@ -69,13 +69,6 @@ struct segment_descriptor : public log_heap_hook<segment_descriptor_hist_options
}
};
} // namespace replica::logstor
template<>
size_t hist_key<replica::logstor::segment_descriptor>(const replica::logstor::segment_descriptor& desc);
namespace replica::logstor {
using segment_descriptor_hist = log_heap<segment_descriptor, segment_descriptor_hist_options>;
struct segment_set {

View File

@@ -1203,11 +1203,35 @@ future<utils::chunked_vector<logstor::segment_snapshot>> storage_group::take_log
co_return std::move(snp);
}
lw_shared_ptr<const sstables::sstable_set> storage_group::make_repaired_sstable_set() const {
if (_split_ready_groups.empty() && _merging_groups.empty()) {
return _main_cg->make_repaired_sstable_set();
}
const auto& schema = _main_cg->_t.schema();
std::vector<lw_shared_ptr<sstables::sstable_set>> underlying;
underlying.reserve(1 + _merging_groups.size() + _split_ready_groups.size());
underlying.emplace_back(_main_cg->make_repaired_sstable_set());
for (const auto& cg : _merging_groups) {
if (!cg->empty()) {
underlying.emplace_back(cg->make_repaired_sstable_set());
}
}
for (const auto& cg : _split_ready_groups) {
underlying.emplace_back(cg->make_repaired_sstable_set());
}
return make_lw_shared(sstables::make_compound_sstable_set(schema, std::move(underlying)));
}
lw_shared_ptr<const sstables::sstable_set> table::sstable_set_for_tombstone_gc(const compaction_group& cg) const {
auto& sg = storage_group_for_id(cg.group_id());
return sg.make_sstable_set();
}
lw_shared_ptr<const sstables::sstable_set> table::make_repaired_sstable_set_for_tombstone_gc(const compaction_group& cg) const {
auto& sg = storage_group_for_id(cg.group_id());
return sg.make_repaired_sstable_set();
}
bool tablet_storage_group_manager::all_storage_groups_split() {
auto& tmap = tablet_map();
if (_split_ready_seq_number == tmap.resize_decision().sequence_number) {
@@ -3000,9 +3024,47 @@ public:
future<lw_shared_ptr<const sstables::sstable_set>> maintenance_sstable_set() const override {
return make_sstable_set_for_this_view(_cg.maintenance_sstables(), [this] { return *_cg.make_maintenance_sstable_set(); });
}
private:
// Returns true when tombstone GC is restricted to the repaired set:
// tombstone_gc=repair mode and this view is the repaired view.
//
// The optimization is safe for materialized view tables as well as base tables.
// The key invariant for MV: MV tablet repair calls flush_hints() before
// take_storage_snapshot(). flush_hints() creates a sync point that covers BOTH
// _hints_manager (base mutations) AND _hints_for_views_manager (view mutations).
// It waits until all pending hints — including any D_view hint stored in
// _hints_for_views_manager while the target node was down — have been replayed
// to the target node. Only then is take_storage_snapshot() called, which flushes
// the MV memtable and captures D_view in the repairing sstable. After repair
// completes, D_view is in the repaired set.
//
// If a subsequent base repair later replays a D_base hint that causes another
// D_view write (same key and timestamp), it is a no-op duplicate: the original
// D_view already in the repaired set still prevents T_mv from being purged.
//
// USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is
// explicitly UB and excluded from the safety argument.
bool is_tombstone_gc_repaired_only() const noexcept {
return _cg.is_repaired_view(this) &&
_t.schema()->tombstone_gc_options().mode() == tombstone_gc_mode::repair;
}
public:
lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const override {
// Optimization: when tombstone_gc=repair and this is the repaired view, only check
// repaired sstables. The repair ordering guarantee ensures that by the time a tombstone
// becomes GC-eligible (repair_time committed to Raft), any data it shadows has already
// been promoted from repairing to repaired. Unrepaired data always has timestamps newer
// than any GC-eligible tombstone (legitimate writes; USING TIMESTAMP abuse is UB).
// For all other tombstone_gc modes this invariant does not hold, so we fall through to
// the full storage-group set.
if (is_tombstone_gc_repaired_only()) {
return _t.make_repaired_sstable_set_for_tombstone_gc(_cg);
}
return _t.sstable_set_for_tombstone_gc(_cg);
}
bool skip_memtable_for_tombstone_gc() const noexcept override {
return is_tombstone_gc_repaired_only();
}
std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point query_time) const override {
return compaction::get_fully_expired_sstables(*this, sstables, query_time);
}
@@ -5419,6 +5481,21 @@ compaction::compaction_group_view& compaction_group::view_for_unrepaired_data()
return *_unrepaired_view;
}
bool compaction_group::is_repaired_view(const compaction::compaction_group_view* v) const noexcept {
return v == _repaired_view.get();
}
lw_shared_ptr<sstables::sstable_set> compaction_group::make_repaired_sstable_set() const {
auto set = make_lw_shared<sstables::sstable_set>(make_main_sstable_set());
auto sstables_repaired_at = get_sstables_repaired_at();
for (auto& sst : *_main_sstables->all()) {
if (repair::is_repaired(sstables_repaired_at, sst)) {
set->insert(sst);
}
}
return set;
}
compaction::compaction_group_view& compaction_group::view_for_sstable(const sstables::shared_sstable& sst) const {
switch (_repair_sstable_classifier(sst, get_sstables_repaired_at())) {
case repair_sstable_classification::unrepaired: return *_unrepaired_view;

View File

@@ -10,7 +10,7 @@
#include "mutation/mutation.hh"
#include "db/system_keyspace.hh"
#include "service/session_id.hh"
#include "service/session.hh"
#include "locator/tablets.hh"
namespace replica {

View File

@@ -493,7 +493,7 @@ std::unique_ptr<service::pager::query_pager> service::pager::query_pagers::pager
// If partition row limit is applied to paging, we still need to fall back
// to filtering the results to avoid extraneous rows on page breaks.
if (!filtering_restrictions && cmd->slice.partition_row_limit() < query::max_rows_if_set) {
filtering_restrictions = ::make_shared<cql3::restrictions::statement_restrictions>(s, true);
filtering_restrictions = cql3::restrictions::make_trivial_statement_restrictions(s, true);
}
if (filtering_restrictions) {
return std::make_unique<filtering_query_pager>(proxy, std::move(s), std::move(selection), state,

View File

@@ -16,7 +16,6 @@
#include "service/paxos/paxos_state.hh"
#include "service/query_state.hh"
#include "cql3/query_processor.hh"
#include "transport/messages/result_message.hh"
#include "cql3/untyped_result_set.hh"
#include "db/system_keyspace.hh"
#include "replica/database.hh"

View File

@@ -6,7 +6,6 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <unordered_set>
#include "service/raft/group0_fwd.hh"
namespace service {

View File

@@ -9,11 +9,7 @@
#pragma once
#include <iosfwd>
#include <variant>
#include <vector>
#include <seastar/core/timer.hh>
#include <seastar/core/lowres_clock.hh>
#include "raft/raft_fwd.hh"
#include "raft/raft.hh"
#include "gms/inet_address.hh"
namespace service {

View File

@@ -8,7 +8,7 @@
#pragma once
#include "service/session_id.hh"
#include "utils/UUID.hh"
#include <seastar/core/gate.hh>
#include <seastar/core/shared_future.hh>
@@ -19,6 +19,12 @@
namespace service {
using session_id = utils::tagged_uuid<struct session_id_tag>;
// We want it be different than default-constructed session_id to catch mistakes.
constexpr session_id default_session_id = session_id(
utils::UUID(0x81e7fc5a8d4411ee, 0x8577325096b39f47)); // timeuuid 2023-11-27 16:46:27.182089.0 UTC
/// Session is used to track execution of work related to some greater task, identified by session_id.
/// Work can enter the session using enter(), and is considered to be part of the session
/// as long as the guard returned by enter() is alive.

View File

@@ -1,21 +0,0 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "utils/UUID.hh"
namespace service {
using session_id = utils::tagged_uuid<struct session_id_tag>;
// We want it to be different than a default-constructed session_id to catch mistakes.
constexpr session_id default_session_id = session_id(
utils::UUID(0x81e7fc5a8d4411ee, 0x8577325096b39f47)); // timeuuid 2023-11-27 16:46:27.182089.0 UTC
} // namespace service

View File

@@ -38,6 +38,7 @@
#include "replica/exceptions.hh"
#include "locator/host_id.hh"
#include "dht/token_range_endpoints.hh"
#include "service/storage_service.hh"
#include "service/cas_shard.hh"
#include "service/storage_proxy_fwd.hh"

View File

@@ -1342,6 +1342,11 @@ future<bool> storage_service::ongoing_rf_change(const group0_guard& guard, sstri
co_return true;
}
}
for (auto request_id : _topology_state_machine._topology.ongoing_rf_changes) {
if (co_await ongoing_ks_rf_change(request_id)) {
co_return true;
}
}
co_return false;
}
@@ -2426,7 +2431,7 @@ storage_service::prepare_replacement_info(std::unordered_set<gms::inet_address>
}
future<std::map<gms::inet_address, float>> storage_service::get_ownership() {
return run_with_no_api_lock([this] (storage_service& ss) {
return run_with_no_api_lock([] (storage_service& ss) {
const auto& tm = ss.get_token_metadata();
auto token_map = dht::token::describe_ownership(tm.sorted_tokens());
// describeOwnership returns tokens in an unspecified order, let's re-order them
@@ -2434,7 +2439,7 @@ future<std::map<gms::inet_address, float>> storage_service::get_ownership() {
for (auto entry : token_map) {
locator::host_id id = tm.get_endpoint(entry.first).value();
auto token_ownership = entry.second;
ownership[_address_map.get(id)] += token_ownership;
ownership[ss._address_map.get(id)] += token_ownership;
}
return ownership;
});
@@ -2843,12 +2848,8 @@ future<> storage_service::raft_removenode(locator::host_id host_id, locator::hos
}
future<> storage_service::mark_excluded(const std::vector<locator::host_id>& hosts) {
if (this_shard_id() != 0) {
// group0 is only set on shard 0.
co_return co_await container().invoke_on(0, [&] (auto& ss) {
return ss.mark_excluded(hosts);
});
}
// Callers forward to shard 0 via run_with_no_api_lock (group0 is only set on shard 0).
SCYLLA_ASSERT(this_shard_id() == 0);
while (true) {
auto guard = co_await _group0->client().start_operation(_group0_as, raft_timeout{});
@@ -3093,8 +3094,8 @@ future<sstring> storage_service::wait_for_topology_request_completion(utils::UUI
}
future<> storage_service::abort_topology_request(utils::UUID request_id) {
co_await container().invoke_on(0, [request_id, this] (storage_service& ss) {
return _topology_state_machine.abort_request(*ss._group0, ss._group0_as, ss._feature_service, request_id);
co_await container().invoke_on(0, [request_id] (storage_service& ss) {
return ss._topology_state_machine.abort_request(*ss._group0, ss._group0_as, ss._feature_service, request_id);
});
}
@@ -3107,13 +3108,13 @@ future<> storage_service::wait_for_topology_not_busy() {
}
}
future<> storage_service::abort_paused_rf_change(utils::UUID request_id) {
future<> storage_service::abort_rf_change(utils::UUID request_id) {
auto holder = _async_gate.hold();
if (this_shard_id() != 0) {
// group0 is only set on shard 0.
co_return co_await container().invoke_on(0, [&] (auto& ss) {
return ss.abort_paused_rf_change(request_id);
return ss.abort_rf_change(request_id);
});
}
@@ -3124,20 +3125,81 @@ future<> storage_service::abort_paused_rf_change(utils::UUID request_id) {
while (true) {
auto guard = co_await _group0->client().start_operation(_group0_as, raft_timeout{});
bool found = std::ranges::contains(_topology_state_machine._topology.paused_rf_change_requests, request_id);
if (!found) {
slogger.warn("RF change request with id '{}' is not paused, so it can't be aborted", request_id);
utils::chunked_vector<canonical_mutation> updates;
if (std::ranges::contains(_topology_state_machine._topology.paused_rf_change_requests, request_id)) { // keyspace_rf_change_kind::conversion_to_rack_list
updates.push_back(canonical_mutation(topology_mutation_builder(guard.write_timestamp())
.resume_rf_change_request(_topology_state_machine._topology.paused_rf_change_requests, request_id).build()));
updates.push_back(canonical_mutation(topology_request_tracking_mutation_builder(request_id)
.done("Aborted by user request")
.build()));
} else if (std::ranges::contains(_topology_state_machine._topology.ongoing_rf_changes, request_id)) { // keyspace_rf_change_kind::multi_rf_change
auto req_entry = co_await _sys_ks.local().get_topology_request_entry(request_id);
if (!req_entry.error.empty()) {
slogger.warn("RF change request with id '{}' was already aborted", request_id);
co_return;
}
sstring ks_name = *req_entry.new_keyspace_rf_change_ks_name;
if (!_db.local().has_keyspace(ks_name)) {
co_return;
}
auto& ks = _db.local().find_keyspace(ks_name);
// Check the tablet maps: if any tablet still has a missing replica
// (i.e., needs extending), we can abort. Otherwise, we're in the
// replica removal phase and aborting would require a rollback.
auto next_replication = ks.metadata()->next_strategy_options_opt().value()
| std::views::transform([] (const auto& pair) {
return std::make_pair(pair.first, std::get<locator::rack_list>(pair.second));
}) | std::ranges::to<std::unordered_map<sstring, std::vector<sstring>>>();
const auto& tm = *get_token_metadata_ptr();
bool has_missing_replica = false;
auto all_tables = ks.metadata()->tables();
auto all_views = ks.metadata()->views()
| std::views::transform([] (const auto& view) { return schema_ptr(view); })
| std::ranges::to<std::vector<schema_ptr>>();
all_tables.insert(all_tables.end(), all_views.begin(), all_views.end());
for (const auto& table : all_tables) {
if (!tm.tablets().has_tablet_map(table->id()) || !tm.tablets().is_base_table(table->id())) {
continue;
}
const auto& tmap = tm.tablets().get_tablet_map(table->id());
for (const auto& ti : tmap.tablets()) {
std::unordered_map<sstring, std::vector<sstring>> dc_to_racks;
for (const auto& r : ti.replicas) {
const auto& node_dc_rack = tm.get_topology().get_node(r.host).dc_rack();
dc_to_racks[node_dc_rack.dc].push_back(node_dc_rack.rack);
}
auto diff = subtract_replication(next_replication, dc_to_racks);
if (!diff.empty()) {
has_missing_replica = true;
break;
}
}
if (has_missing_replica) {
break;
}
}
if (has_missing_replica) {
auto ks_md = make_lw_shared<data_dictionary::keyspace_metadata>(*ks.metadata());
ks_md->set_next_strategy_options(ks_md->strategy_options());
auto schema_muts = prepare_keyspace_update_announcement(_db.local(), ks_md, guard.write_timestamp());
for (auto& m : schema_muts) {
updates.push_back(canonical_mutation(m));
}
updates.push_back(canonical_mutation(topology_request_tracking_mutation_builder(request_id)
.abort("Aborted by user request")
.build()));
} else {
slogger.warn("RF change request with id '{}' is ongoing, but it started removing replicas, so it can't be aborted", request_id);
co_return;
}
} else {
slogger.warn("RF change request with id '{}' can't be aborted", request_id);
co_return;
}
utils::chunked_vector<canonical_mutation> updates;
updates.push_back(canonical_mutation(topology_mutation_builder(guard.write_timestamp())
.resume_rf_change_request(_topology_state_machine._topology.paused_rf_change_requests, request_id).build()));
updates.push_back(canonical_mutation(topology_request_tracking_mutation_builder(request_id)
.done("Aborted by user request")
.build()));
topology_change change{std::move(updates)};
mixed_change change{std::move(updates)};
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard,
format("aborting rf change request {}", request_id));
@@ -3895,11 +3957,8 @@ future<> storage_service::update_tablet_metadata(const locator::tablet_metadata_
}
future<> storage_service::prepare_for_tablets_migration(const sstring& ks_name) {
if (this_shard_id() != 0) {
co_return co_await container().invoke_on(0, [&] (auto& ss) {
return ss.prepare_for_tablets_migration(ks_name);
});
}
// Called via run_with_no_api_lock (forwards to shard 0).
SCYLLA_ASSERT(this_shard_id() == 0);
while (true) {
auto guard = co_await _group0->client().start_operation(_group0_as);
@@ -4039,11 +4098,8 @@ future<> storage_service::prepare_for_tablets_migration(const sstring& ks_name)
}
future<> storage_service::set_node_intended_storage_mode(intended_storage_mode mode) {
if (this_shard_id() != 0) {
co_return co_await container().invoke_on(0, [mode] (auto& ss) {
return ss.set_node_intended_storage_mode(mode);
});
}
// Called via run_with_no_api_lock (forwards to shard 0).
SCYLLA_ASSERT(this_shard_id() == 0);
auto& raft_server = _group0->group0_server();
auto holder = _group0->hold_group0_gate();
@@ -4139,11 +4195,8 @@ storage_service::migration_status storage_service::get_tablets_migration_status(
}
future<storage_service::keyspace_migration_status> storage_service::get_tablets_migration_status_with_node_details(const sstring& ks_name) {
if (this_shard_id() != 0) {
co_return co_await container().invoke_on(0, [&ks_name] (auto& ss) {
return ss.get_tablets_migration_status_with_node_details(ks_name);
});
}
// Called via run_with_no_api_lock (forwards to shard 0).
SCYLLA_ASSERT(this_shard_id() == 0);
keyspace_migration_status result;
result.keyspace = ks_name;
@@ -4204,11 +4257,8 @@ future<storage_service::keyspace_migration_status> storage_service::get_tablets_
}
future<> storage_service::finalize_tablets_migration(const sstring& ks_name) {
if (this_shard_id() != 0) {
co_return co_await container().invoke_on(0, [&ks_name] (auto& ss) {
return ss.finalize_tablets_migration(ks_name);
});
}
// Called via run_with_no_api_lock (forwards to shard 0).
SCYLLA_ASSERT(this_shard_id() == 0);
slogger.info("Finalizing vnodes-to-tablets migration for keyspace '{}'", ks_name);

View File

@@ -15,7 +15,6 @@
#include <seastar/core/shared_future.hh>
#include "absl-flat_hash_map.hh"
#include "gms/endpoint_state.hh"
#include "gms/gossip_address_map.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
#include "schema/schema_fwd.hh"
#include "service/client_routes.hh"
@@ -41,9 +40,11 @@
#include <seastar/core/shared_ptr.hh>
#include "cdc/generation_id.hh"
#include "db/system_keyspace.hh"
#include "raft/raft_fwd.hh"
#include "raft/raft.hh"
#include "node_ops/id.hh"
#include "raft/server.hh"
#include "db/view/view_building_state.hh"
#include "service/tablet_allocator.hh"
#include "service/tablet_operation.hh"
#include "mutation/timestamp.hh"
#include "utils/UUID.hh"
@@ -114,10 +115,6 @@ class tablet_mutation_builder;
namespace auth { class cache; }
namespace service {
class tablet_allocator;
}
namespace utils {
class disk_space_monitor;
}
@@ -783,13 +780,19 @@ private:
*/
future<> stream_ranges(std::unordered_map<sstring, std::unordered_multimap<dht::token_range, locator::host_id>> ranges_to_stream_by_keyspace);
// REST handlers are gated at the registration site (see gated() in
// api/storage_service.cc) so stop() drains in-flight requests before
// teardown. run_with_api_lock_internal and run_with_no_api_lock hold
// _async_gate on shard 0 as well, because REST requests arriving on
// any shard are forwarded there for execution.
template <typename Func>
auto run_with_api_lock_internal(storage_service& ss, Func&& func, sstring& operation) {
auto holder = ss._async_gate.hold();
if (!ss._operation_in_progress.empty()) {
throw std::runtime_error(format("Operation {} is in progress, try again", ss._operation_in_progress));
}
ss._operation_in_progress = std::move(operation);
return func(ss).finally([&ss] {
return func(ss).finally([&ss, holder = std::move(holder)] {
ss._operation_in_progress = sstring();
});
}
@@ -797,6 +800,10 @@ private:
public:
int32_t get_exception_count();
auto hold_async_gate() {
return _async_gate.hold();
}
template <typename Func>
auto run_with_api_lock(sstring operation, Func&& func) {
return container().invoke_on(0, [operation = std::move(operation),
@@ -807,8 +814,10 @@ public:
template <typename Func>
auto run_with_no_api_lock(Func&& func) {
return container().invoke_on(0, [func = std::forward<Func>(func)] (storage_service& ss) mutable {
return func(ss);
return container().invoke_on(0, [func = std::forward<Func>(func)] (storage_service& ss) mutable
-> futurize_t<std::invoke_result_t<Func, storage_service&>> {
auto holder = ss._async_gate.hold();
co_return co_await futurize_invoke(func, ss);
});
}
@@ -978,7 +987,7 @@ public:
future<> wait_for_topology_not_busy();
future<> abort_paused_rf_change(utils::UUID request_id);
future<> abort_rf_change(utils::UUID request_id);
private:
semaphore _do_sample_sstables_concurrency_limiter{1};

View File

@@ -154,7 +154,7 @@ auto coordinator::create_operation_ctx(const schema& schema, const dht::token& t
co_await utils::get_local_injector().inject("sc_coordinator_wait_before_acquire_server",
utils::wait_for_message(5min));
auto raft_server = co_await _groups_manager.acquire_server(raft_info.group_id, as);
auto raft_server = co_await _groups_manager.acquire_server(schema.id(), raft_info.group_id, as);
co_return operation_ctx {
.erm = std::move(erm),

View File

@@ -332,11 +332,27 @@ void groups_manager::update(token_metadata_ptr new_tm) {
schedule_raft_groups_deletion(false);
}
future<raft_server> groups_manager::acquire_server(raft::group_id group_id, abort_source& as) {
future<raft_server> groups_manager::acquire_server(table_id table_id, raft::group_id group_id, abort_source& as) {
if (!_features.strongly_consistent_tables) {
on_internal_error(logger, "strongly consistent tables are not enabled on this shard");
}
// A concurrent DROP TABLE may have already removed the table from database
// registries and erased the raft group from _raft_groups via
// schedule_raft_group_deletion. The schema.table() in create_operation_ctx()
// might not fail though in this case because someone might be holding
// lw_shared_ptr<table>, so that the table is dropped but the table object
// is still alive.
//
// Check that the table still exists in the database to turn the
// fatal on_internal_error below into a clean no_such_column_family
// exception.
//
// When the table does exist, we proceed to acquire state.gate->hold().
// This prevents schedule_raft_group_deletion (which co_awaits gate::close)
// from erasing the group until the DML operation completes.
_db.find_column_family(table_id);
const auto it = _raft_groups.find(group_id);
if (it == _raft_groups.end()) {
on_internal_error(logger, format("raft group {} not found", group_id));

View File

@@ -11,10 +11,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "message/messaging_service.hh"
#include "service/raft/raft_group_registry.hh"
namespace cql3 {
class query_processor;
}
#include "cql3/query_processor.hh"
namespace db {
class system_keyspace;
@@ -113,7 +110,7 @@ public:
void update(locator::token_metadata_ptr new_tm);
// The raft_server instance is used to submit write commands and perform read_barrier() before reads.
future<raft_server> acquire_server(raft::group_id group_id, abort_source& as);
future<raft_server> acquire_server(table_id table_id, raft::group_id group_id, abort_source& as);
// Called during node boot. Waits for all raft::server instances corresponding
// to the latest group0 state to start.

View File

@@ -31,6 +31,7 @@
#include <ranges>
#include <utility>
#include <fmt/ranges.h>
#include <seastar/core/on_internal_error.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/switch_to.hh>
#include <absl/container/flat_hash_map.h>
@@ -533,6 +534,38 @@ struct hash<migration_tablet_set> {
namespace service {
// Subtract right from left. The result contains only keys from left.
std::unordered_map<sstring, std::vector<sstring>> subtract_replication(const std::unordered_map<sstring, std::vector<sstring>>& left, const std::unordered_map<sstring, std::vector<sstring>>& right) {
std::unordered_map<sstring, std::vector<sstring>> res;
for (const auto& [dc, rf_value] : left) {
auto it = right.find(dc);
if (it == right.end()) {
res[dc] = rf_value;
} else {
std::vector<sstring> diff = rf_value | std::views::filter([&] (const sstring& rack) {
return std::find(it->second.begin(), it->second.end(), rack) == it->second.end();
}) | std::ranges::to<std::vector<sstring>>();
if (!diff.empty()) {
res[dc] = diff;
}
}
}
return res;
}
bool rf_count_per_dc_equals(const locator::replication_strategy_config_options& current, const locator::replication_strategy_config_options& next) {
if (current.size() != next.size()) {
return false;
}
for (const auto& [dc, current_rf_value] : current) {
auto it = next.find(dc);
if (it == next.end() || get_replication_factor(it->second) != get_replication_factor(current_rf_value)) {
return false;
}
}
return true;
}
/// The algorithm aims to equalize tablet count on each shard.
/// This goal is based on the assumption that every shard has similar processing power and space capacity,
/// and that each tablet has equal consumption of those resources. So by equalizing tablet count per shard we
@@ -1050,17 +1083,22 @@ public:
return _topology != nullptr && _sys_ks != nullptr && !_topology->paused_rf_change_requests.empty();
}
bool ongoing_rf_change() const {
return _topology != nullptr && _sys_ks != nullptr && !_topology->ongoing_rf_changes.empty();
}
future<migration_plan> make_plan() {
const locator::topology& topo = _tm->get_topology();
migration_plan plan;
auto rack_list_colocation = ongoing_rack_list_colocation();
auto rf_change_prep = co_await prepare_per_rack_rf_change_plan(plan);
// Prepare plans for each DC separately and combine them to be executed in parallel.
for (auto&& dc : topo.get_datacenters()) {
if (_db.get_config().rf_rack_valid_keyspaces() || _db.get_config().enforce_rack_list() || rack_list_colocation) {
if (_db.get_config().rf_rack_valid_keyspaces() || _db.get_config().enforce_rack_list() || rack_list_colocation || !rf_change_prep.actions.empty()) {
for (auto rack : topo.get_datacenter_racks().at(dc) | std::views::keys) {
auto rack_plan = co_await make_plan(dc, rack);
auto rack_plan = co_await make_plan(dc, rack, rf_change_prep.actions[{dc, rack}]);
auto level = rack_plan.empty() ? seastar::log_level::debug : seastar::log_level::info;
lblogger.log(level, "Plan for {}/{}: {}", dc, rack, plan_summary(rack_plan));
plan.merge(std::move(rack_plan));
@@ -1450,6 +1488,387 @@ public:
co_return std::move(plan);
}
enum class rf_change_state {
ready, // RF change is ready (succeed or failed).
needs_extending,
needs_shrinking,
};
using process_views = bool_class<struct process_views_tag>;
struct rf_change_action {
sstring keyspace;
rf_change_state state;
process_views pv = process_views::no;
};
using rf_change_actions = std::unordered_map<locator::endpoint_dc_rack, std::vector<rf_change_action>>;
struct rf_change_preparation {
rf_change_actions actions;
};
// Determines which dc+rack combinations need RF change actions for a given keyspace,
// by comparing current tablet replicas against the target replication configuration.
// Scans in priority order: extend tables, extend views, shrink views, shrink tables.
// Returns the first non-empty set of per-rack actions; colocated tables are skipped.
// An empty result means all tablets already match the target configuration.
future<rf_change_preparation> determine_rf_change_actions_per_rack(const sstring& ks_name, const std::vector<schema_ptr>& tables, const std::vector<schema_ptr>& views, const locator::replication_strategy_config_options& next) {
auto add_entry = [&ks_name] (rf_change_preparation& prep, const sstring& dc, const sstring& rack, rf_change_state state, process_views pv) {
locator::endpoint_dc_rack key{dc, rack};
auto& actions = prep.actions[key];
if (std::none_of(actions.begin(), actions.end(), [&](const rf_change_action& a) { return a.keyspace == ks_name; })) {
actions.push_back(rf_change_action{.keyspace = ks_name, .state = state, .pv = pv});
}
};
auto next_replication = next | std::views::transform([] (const auto& pair) {
return std::make_pair(pair.first, std::get<rack_list>(pair.second));
}) | std::ranges::to<std::unordered_map<sstring, std::vector<sstring>>>();
auto scan_tables = [&] (const std::vector<schema_ptr>& table_list, rf_change_state target_state, process_views pv) -> future<rf_change_preparation> {
rf_change_preparation prep;
for (const auto& table : table_list) {
if (!_tm->tablets().is_base_table(table->id())) {
continue;
}
const auto& tmap = _tm->tablets().get_tablet_map(table->id());
for (const tablet_info& ti : tmap.tablets()) {
std::unordered_map<sstring, std::vector<sstring>> dc_to_racks;
for (const auto& r : ti.replicas) {
const auto& node_dc_rack = _tm->get_topology().get_node(r.host).dc_rack();
dc_to_racks[node_dc_rack.dc].push_back(node_dc_rack.rack);
}
auto diff = (target_state == rf_change_state::needs_extending ?
subtract_replication(next_replication, dc_to_racks) : subtract_replication(dc_to_racks, next_replication))
| std::views::filter([] (const auto& pair) {
return !pair.second.empty();
}
) | std::ranges::to<std::unordered_map<sstring, std::vector<sstring>>>();
for (const auto& [dc, racks] : diff) {
for (const auto& rack : racks) {
add_entry(prep, dc, rack, target_state, pv);
}
}
co_await coroutine::maybe_yield();
}
}
co_return prep;
};
// Extend base tables.
if (auto prep = co_await scan_tables(tables, rf_change_state::needs_extending, process_views::no); !prep.actions.empty()) {
co_return prep;
}
if (utils::get_local_injector().enter("determine_rf_change_actions_per_rack_throw")) {
lblogger.info("determine_rf_change_actions_per_rack_throw: entered");
throw std::runtime_error("determine_rf_change_actions_per_rack_throw injection");
}
// Extend views.
if (auto prep = co_await scan_tables(views, rf_change_state::needs_extending, process_views::yes); !prep.actions.empty()) {
co_return prep;
}
// Shrink views.
if (auto prep = co_await scan_tables(views, rf_change_state::needs_shrinking, process_views::yes); !prep.actions.empty()) {
co_return prep;
}
// Shrink base tables.
if (auto prep = co_await scan_tables(tables, rf_change_state::needs_shrinking, process_views::no); !prep.actions.empty()) {
co_return prep;
}
co_return rf_change_preparation{};
}
future<rf_change_preparation> prepare_per_rack_rf_change_plan(migration_plan& mplan) {
lblogger.debug("In prepare_per_rack_rf_change_plan");
rf_change_preparation res;
keyspace_rf_change_plan plan;
if (!ongoing_rf_change()) {
co_return res;
}
for (const auto& request_id : _topology->ongoing_rf_changes) {
auto req_entry = co_await _sys_ks->get_topology_request_entry(request_id);
sstring ks_name = *req_entry.new_keyspace_rf_change_ks_name;
if (!_db.has_keyspace(ks_name)) {
if (!plan.completion) {
plan.completion = rf_change_completion_info{
.request_id = request_id,
.ks_name = ks_name,
.error = format("Keyspace {} not found", ks_name),
.saved_ks_props = req_entry.new_keyspace_rf_change_data.value(),
};
}
continue;
}
auto& ks = _db.find_keyspace(ks_name);
if (!ks.metadata()->next_strategy_options_opt()) {
on_internal_error(lblogger, format("There is an ongoing rf change request {} for keyspace {}, "
"but the keyspace does not have next replication settings", request_id, ks_name));
}
auto tables = ks.metadata()->tables();
auto views = ks.metadata()->views() | std::views::transform([] (const auto& view) { return schema_ptr(view); }) | std::ranges::to<std::vector<schema_ptr>>();
auto rf_change_prep = co_await determine_rf_change_actions_per_rack(ks_name, tables, views, *ks.metadata()->next_strategy_options_opt());
if (rf_change_prep.actions.empty()) {
if (!plan.completion) {
plan.completion = rf_change_completion_info{
.request_id = request_id,
.ks_name = ks_name,
.error = req_entry.error,
.saved_ks_props = req_entry.new_keyspace_rf_change_data.value()
};
}
continue;
}
// Check if any extending action targets a dc+rack with no available nodes.
// If so, the RF change can never complete and should be aborted.
sstring error_msg = "";
const auto& topo = _tm->get_topology();
const auto& dc_rack_nodes = topo.get_datacenter_rack_nodes();
for (const auto& [dc_rack, actions] : rf_change_prep.actions) {
bool needs_extending = std::ranges::any_of(actions, [] (const rf_change_action& a) {
return a.state == rf_change_state::needs_extending;
});
if (!needs_extending) {
break;
}
bool has_live_node = false;
bool has_down_node = false;
auto dc_it = dc_rack_nodes.find(dc_rack.dc);
if (dc_it != dc_rack_nodes.end()) {
auto rack_it = dc_it->second.find(dc_rack.rack);
if (rack_it != dc_it->second.end()) {
for (const auto& node_ref : rack_it->second) {
const auto& node = node_ref.get();
if (_skiplist.contains(node.host_id())) {
has_down_node = true;
break;
}
if (!node.is_excluded()) {
has_live_node = true;
}
}
}
}
if (has_down_node) {
lblogger.warn("RF change for keyspace {} requires extending to {}/{} but there are down nodes there; aborting",
ks_name, dc_rack.dc, dc_rack.rack);
error_msg = format("RF change aborted: there are down nodes in required rack {}/{}", dc_rack.dc, dc_rack.rack);
break;
}
if (!has_live_node) {
lblogger.warn("RF change for keyspace {} requires extending to {}/{} but no available nodes exist there; aborting",
ks_name, dc_rack.dc, dc_rack.rack);
error_msg = format("RF change aborted: no available nodes in required rack {}/{}", dc_rack.dc, dc_rack.rack);
break;
}
}
if (!error_msg.empty()) {
plan.aborts.push_back(rf_change_abort_info{
.request_id = request_id,
.ks_name = ks_name,
.error = error_msg,
.current_replication = ks.metadata()->strategy_options(),
});
continue;
}
for (auto& [dc_rack, actions] : rf_change_prep.actions) {
auto& dst = res.actions[dc_rack];
dst.insert(dst.end(), std::make_move_iterator(actions.begin()), std::make_move_iterator(actions.end()));
}
}
mplan.set_rf_change_plan(std::move(plan));
co_return res;
}
future<migration_plan> make_rf_change_plan(node_load_map& nodes, std::vector<rf_change_action> actions, sstring dc, sstring rack) {
lblogger.debug("In make_rf_change_plan");
migration_plan mplan;
keyspace_rf_change_plan plan;
auto nodes_by_load_dst = nodes | std::views::filter([&] (const auto& host_load) {
auto& [host, load] = host_load;
auto& node = *load.node;
return node.dc_rack().dc == dc && node.dc_rack().rack == rack;
}) | std::views::keys | std::ranges::to<std::vector<host_id>>();
bool has_extending = std::ranges::any_of(actions, [] (const rf_change_action& a) {
return a.state == rf_change_state::needs_extending;
});
if (has_extending) {
// Check that all normal, non-excluded nodes in the target dc/rack are present in the
// balanced node set. If any such node is missing, extending cannot safely proceed.
const auto& topo = _tm->get_topology();
const auto& dc_rack_nodes = topo.get_datacenter_rack_nodes();
bool missing_node = false;
auto dc_it = dc_rack_nodes.find(dc);
if (dc_it != dc_rack_nodes.end()) {
auto rack_it = dc_it->second.find(rack);
if (rack_it != dc_it->second.end()) {
for (const auto& node_ref : rack_it->second) {
const auto& node = node_ref.get();
if (node.is_normal() && !node.is_excluded() && !nodes.contains(node.host_id())) {
missing_node = true;
break;
}
}
}
}
if (missing_node || nodes_by_load_dst.empty()) {
lblogger.warn("Not all non-excluded nodes are available for RF change extending plan in dc {}, rack {}", dc, rack);
// Filter out extending actions since not all nodes are available.
// Shrinking actions can still proceed without target nodes.
std::erase_if(actions, [] (const rf_change_action& a) {
return a.state == rf_change_state::needs_extending;
});
if (actions.empty()) {
co_return mplan;
}
}
}
auto nodes_cmp = nodes_by_load_cmp(nodes);
auto nodes_dst_cmp = [&] (const host_id& a, const host_id& b) {
return nodes_cmp(b, a);
};
// Ascending load heap of candidate target nodes.
std::make_heap(nodes_by_load_dst.begin(), nodes_by_load_dst.end(), nodes_dst_cmp);
const locator::topology& topo = _tm->get_topology();
locator::endpoint_dc_rack location{dc, rack};
for (const auto& action : actions) {
const auto& ks_name = action.keyspace;
const auto& rf_change_state = action.state;
auto& ks = _db.find_keyspace(ks_name);
auto table_list = action.pv
? ks.metadata()->views() | std::views::transform([] (const auto& view) { return schema_ptr(view); }) | std::ranges::to<std::vector<schema_ptr>>()
: ks.metadata()->tables();
for (const auto& table_or_mv : table_list) {
const auto& tmap = _tm->tablets().get_tablet_map(table_or_mv->id());
co_await tmap.for_each_tablet([&] (tablet_id tid, const tablet_info& ti) -> future<> {
if (!_tm->tablets().is_base_table(table_or_mv->id())) {
return make_ready_future<>();
}
auto gid = locator::global_tablet_id{table_or_mv->id(), tid};
auto it = std::find_if(ti.replicas.begin(), ti.replicas.end(), [&] (const tablet_replica& r) {
return topo.get_node(r.host).dc_rack() == location;
});
auto replica = it != ti.replicas.end() ? std::optional<tablet_replica>{*it} : std::nullopt;
auto* tti = tmap.get_tablet_transition_info(tid);
bool pending_replica_in_this_rack = false;
bool leaving_replica_in_this_rack = false;
if (tti) {
auto leaving_replica = get_leaving_replica(ti, *tti);
leaving_replica_in_this_rack = leaving_replica.has_value() && topo.get_node(leaving_replica->host).dc_rack() == location;
pending_replica_in_this_rack = tti->pending_replica.has_value() && topo.get_node(tti->pending_replica->host).dc_rack() == location;
}
if ((rf_change_state == rf_change_state::needs_extending && (replica && !leaving_replica_in_this_rack)) ||
(rf_change_state == rf_change_state::needs_shrinking && (!replica && !pending_replica_in_this_rack))) {
return make_ready_future<>();
}
// Skip tablet that is in transitions.
if (tti) {
lblogger.debug("Skipped rf change extending for tablet={} which is already in transition={} stage={}", gid, tti->transition, tti->stage);
return make_ready_future<>();
}
// Skip tablet that is about to be in transition.
if (_scheduled_tablets.contains(gid)) {
return make_ready_future<>();
}
migration_tablet_set source_tablets {
.tablet_s = gid, // Ignore the merge co-location.
};
if (rf_change_state == rf_change_state::needs_extending) {
// Pick the least loaded node as target.
std::pop_heap(nodes_by_load_dst.begin(), nodes_by_load_dst.end(), nodes_dst_cmp);
auto target = nodes_by_load_dst.back();
lblogger.debug("target node: {}, avg_load={}", target, nodes[target].avg_load);
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};
lblogger.trace("target shard: {}, tablets={}, load={}", dst.shard,
nodes[target].shards[dst.shard].tablet_count,
nodes[target].shard_load(dst.shard, _target_tablet_size));
tablet_replica pending_replica{
.host = target,
.shard = dst.shard,
};
auto next = ti.replicas;
next.push_back(pending_replica);
tablet_migration_info mig{
.kind = locator::tablet_transition_kind::rebuild_v2,
.tablet = gid,
.src = std::nullopt,
.dst = pending_replica,
};
auto mig_streaming_info = get_migration_streaming_info(topo, ti, mig);
pick(*_load_sketch, dst.host, dst.shard, source_tablets);
if (can_accept_load(nodes, mig_streaming_info)) {
lblogger.debug("Starting rebuild_v2 transition to {}.{} of tablet {}; new_replica = {}", dc, rack, gid, pending_replica);
apply_load(nodes, mig_streaming_info);
mark_as_scheduled(mig);
mplan.add(std::move(mig));
}
increase_node_load(nodes, dst, source_tablets);
std::push_heap(nodes_by_load_dst.begin(), nodes_by_load_dst.end(), nodes_dst_cmp);
} else {
auto next = ti.replicas | std::views::filter([&] (const tablet_replica& r) {
return r != *replica;
}) | std::ranges::to<tablet_replica_set>();
tablet_migration_info mig{
.kind = locator::tablet_transition_kind::rebuild_v2,
.tablet = gid,
.src = *replica,
.dst = std::nullopt,
};
auto mig_streaming_info = get_migration_streaming_info(topo, ti, mig);
// The node being shrunk may be excluded/down and lack complete tablet stats.
// Since we're removing a replica (not placing one), accurate load data isn't needed.
if (_load_sketch->has_node(replica->host)) {
unload(*_load_sketch, replica->host, replica->shard, source_tablets);
}
if (can_accept_load(nodes, mig_streaming_info)) {
apply_load(nodes, mig_streaming_info);
mark_as_scheduled(mig);
mplan.add(std::move(mig));
}
if (nodes.contains(replica->host)) {
decrease_node_load(nodes, *replica, source_tablets);
}
}
return make_ready_future<>();
});
}
}
mplan.set_rf_change_plan(std::move(plan));
co_return mplan;
}
// Returns true if a table has replicas of all its sibling tablets co-located.
// This is used for determining whether merge can be finalized, since co-location
// is a strict requirement for sibling tablets to be merged.
@@ -2658,14 +3077,13 @@ public:
src_shard.dusage->used -= tablet_sizes;
}
// Adjusts the load of the source and destination (host:shard) that were picked for the migration.
void update_node_load_on_migration(node_load_map& nodes, tablet_replica src, tablet_replica dst, const migration_tablet_set& tablet_set) {
void increase_node_load(node_load_map& nodes, tablet_replica replica, const migration_tablet_set& tablet_set) {
auto tablet_count = tablet_set.tablets().size();
auto tablet_sizes = tablet_set.tablet_set_disk_size;
auto table = tablet_set.tablets().front().table;
auto& dst_node = nodes[dst.host];
auto& dst_shard = dst_node.shards[dst.shard];
auto& dst_node = nodes[replica.host];
auto& dst_shard = dst_node.shards[replica.shard];
dst_shard.tablet_count += tablet_count;
dst_shard.tablet_count_per_table[table] += tablet_count;
dst_shard.tablet_sizes_per_table[table] += tablet_sizes;
@@ -2675,9 +3093,15 @@ public:
dst_node.tablet_count += tablet_count;
dst_node.dusage->used += tablet_sizes;
dst_node.update();
}
auto& src_node = nodes[src.host];
auto& src_shard = src_node.shards[src.shard];
void decrease_node_load(node_load_map& nodes, tablet_replica replica, const migration_tablet_set& tablet_set) {
auto tablet_count = tablet_set.tablets().size();
auto tablet_sizes = tablet_set.tablet_set_disk_size;
auto table = tablet_set.tablets().front().table;
auto& src_node = nodes[replica.host];
auto& src_shard = src_node.shards[replica.shard];
src_shard.tablet_count -= tablet_count;
src_shard.tablet_count_per_table[table] -= tablet_count;
src_shard.tablet_sizes_per_table[table] -= tablet_sizes;
@@ -2693,6 +3117,12 @@ public:
src_node.update();
}
// Adjusts the load of the source and destination (host:shard) that were picked for the migration.
void update_node_load_on_migration(node_load_map& nodes, tablet_replica src, tablet_replica dst, const migration_tablet_set& tablet_set) {
increase_node_load(nodes, dst, tablet_set);
decrease_node_load(nodes, src, tablet_set);
}
static void unload(locator::load_sketch& sketch, host_id host, shard_id shard, const migration_tablet_set& tablet_set) {
sketch.unload(host, shard, tablet_set.tablets().size(), tablet_set.tablet_set_disk_size);
}
@@ -3643,7 +4073,7 @@ public:
}
}
future<migration_plan> make_plan(dc_name dc, std::optional<sstring> rack = std::nullopt) {
future<migration_plan> make_plan(dc_name dc, std::optional<sstring> rack = std::nullopt, std::vector<rf_change_action> rf_change_actions = {}) {
migration_plan plan;
if (utils::get_local_injector().enter("tablet_migration_bypass")) {
@@ -3761,12 +4191,6 @@ public:
});
}
if (nodes.empty()) {
lblogger.debug("No nodes to balance.");
_current_stats->stop_balance++;
co_return plan;
}
// Detect finished drain.
for (auto i = nodes_to_drain.begin(); i != nodes_to_drain.end();) {
@@ -3841,7 +4265,6 @@ public:
}
lblogger.debug("No candidate nodes");
_current_stats->stop_no_candidates++;
co_return plan;
}
// We want to saturate the target node so we migrate several tablets in parallel, one for each shard
@@ -4003,7 +4426,7 @@ public:
print_node_stats(nodes, only_active::yes);
if (!nodes_to_drain.empty() || (_tm->tablets().balancing_enabled() && (shuffle || !is_balanced(min_load, max_load)))) {
if (has_dest_nodes && (!nodes_to_drain.empty() || (_tm->tablets().balancing_enabled() && (shuffle || !is_balanced(min_load, max_load)))) && !nodes.empty()) {
host_id target = *min_load_node;
lblogger.info("target node: {}, avg_load: {}, max: {}", target, min_load, max_load);
plan.merge(co_await make_internode_plan(nodes, nodes_to_drain, target));
@@ -4015,6 +4438,10 @@ public:
plan.merge(co_await make_intranode_plan(nodes, nodes_to_drain));
}
if (!rf_change_actions.empty() && rack.has_value()) {
plan.merge(co_await make_rf_change_plan(nodes, rf_change_actions, dc, rack.value()));
}
if (_tm->tablets().balancing_enabled() && plan.empty() && !ongoing_rack_list_colocation()) {
auto dc_merge_plan = co_await make_merge_colocation_plan(nodes);
auto level = dc_merge_plan.tablet_migration_count() > 0 ? seastar::log_level::info : seastar::log_level::debug;

View File

@@ -8,8 +8,10 @@
#pragma once
#include "locator/abstract_replication_strategy.hh"
#include "replica/database_fwd.hh"
#include "locator/tablets.hh"
#include "locator/abstract_replication_strategy.hh"
#include "tablet_allocator_fwd.hh"
#include "locator/token_metadata_fwd.hh"
#include <seastar/core/metrics.hh>
@@ -181,6 +183,34 @@ struct tablet_rack_list_colocation_plan {
}
};
struct rf_change_completion_info {
utils::UUID request_id;
sstring ks_name;
sstring error;
std::unordered_map<sstring, sstring> saved_ks_props;
};
struct rf_change_abort_info {
utils::UUID request_id;
sstring ks_name;
sstring error;
locator::replication_strategy_config_options current_replication;
};
struct keyspace_rf_change_plan {
std::optional<rf_change_completion_info> completion;
std::vector<rf_change_abort_info> aborts;
size_t size() const { return (completion ? 1 : 0) + aborts.size(); };
void merge(keyspace_rf_change_plan&& other) {
if (!completion) {
completion = std::move(other.completion);
}
std::move(other.aborts.begin(), other.aborts.end(), std::back_inserter(aborts));
}
};
class migration_plan {
public:
using migrations_vector = utils::chunked_vector<tablet_migration_info>;
@@ -189,19 +219,22 @@ private:
table_resize_plan _resize_plan;
tablet_repair_plan _repair_plan;
tablet_rack_list_colocation_plan _rack_list_colocation_plan;
keyspace_rf_change_plan _rf_change_plan;
bool _has_nodes_to_drain = false;
std::vector<drain_failure> _drain_failures;
public:
/// Returns true iff there are decommissioning nodes which own some tablet replicas.
bool has_nodes_to_drain() const { return _has_nodes_to_drain; }
bool requires_schema_changes() const { return _rf_change_plan.size() > 0; }
const migrations_vector& migrations() const { return _migrations; }
bool empty() const { return !size(); }
size_t size() const { return _migrations.size() + _resize_plan.size() + _repair_plan.size() + _rack_list_colocation_plan.size() + _drain_failures.size(); }
size_t size() const { return _migrations.size() + _resize_plan.size() + _repair_plan.size() + _rack_list_colocation_plan.size() + _drain_failures.size() + _rf_change_plan.size(); }
size_t tablet_migration_count() const { return _migrations.size(); }
size_t resize_decision_count() const { return _resize_plan.size(); }
size_t tablet_repair_count() const { return _repair_plan.size(); }
size_t tablet_rack_list_colocation_count() const { return _rack_list_colocation_plan.size(); }
size_t keyspace_rf_change_count() const { return _rf_change_plan.size(); }
const std::vector<drain_failure>& drain_failures() const { return _drain_failures; }
void add(tablet_migration_info info) {
@@ -225,6 +258,7 @@ public:
_resize_plan.merge(std::move(other._resize_plan));
_repair_plan.merge(std::move(other._repair_plan));
_rack_list_colocation_plan.merge(std::move(other._rack_list_colocation_plan));
_rf_change_plan.merge(std::move(other._rf_change_plan));
}
void set_has_nodes_to_drain(bool b) {
@@ -249,6 +283,12 @@ public:
_rack_list_colocation_plan = std::move(rack_list_colocation_plan);
}
const keyspace_rf_change_plan& rf_change_plan() const { return _rf_change_plan; }
void set_rf_change_plan(keyspace_rf_change_plan rf_change_plan) {
_rf_change_plan = std::move(rf_change_plan);
}
future<std::unordered_set<locator::global_tablet_id>> get_migration_tablet_ids() const;
};
@@ -317,6 +357,9 @@ future<bool> requires_rack_list_colocation(
db::system_keyspace* sys_ks,
utils::UUID request_id);
bool rf_count_per_dc_equals(const locator::replication_strategy_config_options& current, const locator::replication_strategy_config_options& next);
std::unordered_map<sstring, std::vector<sstring>> subtract_replication(const std::unordered_map<sstring, std::vector<sstring>>& left, const std::unordered_map<sstring, std::vector<sstring>>& right);
}
template <>

View File

@@ -452,7 +452,7 @@ future<std::optional<tasks::task_status>> global_topology_request_virtual_task::
}
future<> global_topology_request_virtual_task::abort(tasks::task_id id, tasks::virtual_task_hint) noexcept {
return _ss.abort_paused_rf_change(id.uuid());
return _ss.abort_rf_change(id.uuid());
}
future<std::vector<tasks::task_stats>> global_topology_request_virtual_task::get_stats() {

View File

@@ -414,6 +414,20 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
}
};
future<> update_topology_state_with_mixed_change(
group0_guard guard, utils::chunked_vector<canonical_mutation>&& updates, const sstring& reason) {
try {
rtlogger.info("updating topology state with mixed change: {}", reason);
rtlogger.trace("update_topology_state mutations: {}", updates);
mixed_change change{std::move(updates)};
group0_command g0_cmd = _group0.client().prepare_command(std::move(change), guard, reason);
co_await _group0.client().add_entry(std::move(g0_cmd), std::move(guard), _as);
} catch (group0_concurrent_modification&) {
rtlogger.info("race while changing state: {}. Retrying", reason);
throw;
}
}
raft::server_id parse_replaced_node(const std::optional<request_param>& req_param) const {
return service::topology::parse_replaced_node(req_param);
}
@@ -961,6 +975,63 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
}
}
enum class keyspace_rf_change_kind {
default_rf_change,
conversion_to_rack_list,
multi_rf_change
};
future<keyspace_rf_change_kind> choose_keyspace_rf_change_kind(utils::UUID req_id,
lw_shared_ptr<keyspace_metadata> old_ks_md,
lw_shared_ptr<keyspace_metadata> new_ks_md,
const std::vector<schema_ptr>& tables_with_mvs) {
const auto& new_replication_strategy_config = new_ks_md->strategy_options();
const auto& old_replication_strategy_config = old_ks_md->strategy_options();
auto check_needs_colocation = [&] () -> future<bool> {
bool rack_list_conversion = false;
for (const auto& [dc, rf_value] : new_replication_strategy_config) {
if (std::holds_alternative<locator::rack_list>(rf_value)) {
auto it = old_replication_strategy_config.find(dc);
if (it != old_replication_strategy_config.end() && std::holds_alternative<sstring>(it->second)) {
rack_list_conversion = true;
break;
}
}
}
co_return rack_list_conversion ? co_await requires_rack_list_colocation(_db, get_token_metadata_ptr(), &_sys_ks, req_id) : false;
};
auto all_changes_are_0_N = [&] {
auto all_dcs = old_replication_strategy_config | std::views::keys;
auto new_dcs = new_replication_strategy_config | std::views::keys;
std::set<sstring> dcs(all_dcs.begin(), all_dcs.end());
dcs.insert(new_dcs.begin(), new_dcs.end());
for (const auto& dc : dcs) {
auto old_it = old_replication_strategy_config.find(dc);
auto new_it = new_replication_strategy_config.find(dc);
size_t old_rf = (old_it != old_replication_strategy_config.end()) ? locator::get_replication_factor(old_it->second) : 0;
size_t new_rf = (new_it != new_replication_strategy_config.end()) ? locator::get_replication_factor(new_it->second) : 0;
if (old_rf == new_rf) {
continue;
}
if (old_rf != 0 && new_rf != 0) {
return false;
}
}
return true;
};
if (tables_with_mvs.empty()) {
co_return keyspace_rf_change_kind::default_rf_change;
}
if (co_await check_needs_colocation()) {
co_return keyspace_rf_change_kind::conversion_to_rack_list;
}
if (_feature_service.keyspace_multi_rf_change && locator::uses_rack_list_exclusively(old_replication_strategy_config) && locator::uses_rack_list_exclusively(new_replication_strategy_config) && !rf_count_per_dc_equals(old_replication_strategy_config, new_replication_strategy_config) && all_changes_are_0_N()) {
co_return keyspace_rf_change_kind::multi_rf_change;
}
co_return keyspace_rf_change_kind::default_rf_change;
}
// Precondition: there is no node request and no ongoing topology transition
// (checked under the guard we're holding).
future<> handle_global_request(group0_guard guard) {
@@ -1016,9 +1087,18 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
saved_ks_props = *req_entry.new_keyspace_rf_change_data;
}
auto tbuilder_with_request_drop = [&] () {
topology_mutation_builder tbuilder(guard.write_timestamp());
tbuilder.set_transition_state(topology::transition_state::tablet_migration)
.set_version(_topo_sm._topology.version + 1)
.del_global_topology_request()
.del_global_topology_request_id()
.drop_first_global_topology_request_id(_topo_sm._topology.global_requests_queue, req_id);
return tbuilder;
};
utils::chunked_vector<canonical_mutation> updates;
sstring error;
bool needs_colocation = false;
if (_db.has_keyspace(ks_name)) {
try {
auto& ks = _db.find_keyspace(ks_name);
@@ -1030,82 +1110,93 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
size_t unimportant_init_tablet_count = 2; // must be a power of 2
locator::tablet_map new_tablet_map{unimportant_init_tablet_count};
auto schedule_migrations = [&] () -> future<> {
auto tables_with_mvs = ks.metadata()->tables();
auto views = ks.metadata()->views();
tables_with_mvs.insert(tables_with_mvs.end(), views.begin(), views.end());
if (!tables_with_mvs.empty()) {
auto table = tables_with_mvs.front();
auto tablet_count = tmptr->tablets().get_tablet_map(table->id()).tablet_count();
locator::replication_strategy_params params{ks_md->strategy_options(), tablet_count, ks.metadata()->consistency_option()};
auto new_strategy = locator::abstract_replication_strategy::create_replication_strategy("NetworkTopologyStrategy", params, tmptr->get_topology());
auto tables_with_mvs = ks.metadata()->tables();
auto views = ks.metadata()->views();
tables_with_mvs.insert(tables_with_mvs.end(), views.begin(), views.end());
auto rf_change_kind = co_await choose_keyspace_rf_change_kind(req_id, ks.metadata(), ks_md, tables_with_mvs);
switch (rf_change_kind) {
case keyspace_rf_change_kind::default_rf_change: {
if (!tables_with_mvs.empty()) {
auto table = tables_with_mvs.front();
auto tablet_count = tmptr->tablets().get_tablet_map(table->id()).tablet_count();
locator::replication_strategy_params params{ks_md->strategy_options(), tablet_count, ks.metadata()->consistency_option()};
auto new_strategy = locator::abstract_replication_strategy::create_replication_strategy("NetworkTopologyStrategy", params, tmptr->get_topology());
auto check_needs_colocation = [&] () -> future<bool> {
const auto& new_replication_strategy_config = new_strategy->get_config_options();
const auto& old_replication_strategy_config = ks.metadata()->strategy_options();
bool rack_list_conversion = false;
for (const auto& [dc, rf_value] : new_replication_strategy_config) {
if (std::holds_alternative<locator::rack_list>(rf_value)) {
auto it = old_replication_strategy_config.find(dc);
if (it != old_replication_strategy_config.end() && std::holds_alternative<sstring>(it->second)) {
rack_list_conversion = true;
break;
for (const auto& table_or_mv : tables_with_mvs) {
if (!tmptr->tablets().is_base_table(table_or_mv->id())) {
// Apply the transition only on base tables.
// If this table has a base table then the transition will be applied on the base table, and
// the base table will coordinate the transition for the entire group.
continue;
}
auto old_tablets = co_await tmptr->tablets().get_tablet_map(table_or_mv->id()).clone_gently();
new_tablet_map = co_await new_strategy->maybe_as_tablet_aware()->reallocate_tablets(table_or_mv, tmptr, co_await old_tablets.clone_gently());
replica::tablet_mutation_builder tablet_mutation_builder(guard.write_timestamp(), table_or_mv->id());
co_await new_tablet_map.for_each_tablet([&](locator::tablet_id tablet_id, const locator::tablet_info& tablet_info) -> future<> {
auto last_token = new_tablet_map.get_last_token(tablet_id);
auto old_tablet_info = old_tablets.get_tablet_info(last_token);
auto abandoning_replicas = locator::substract_sets(old_tablet_info.replicas, tablet_info.replicas);
auto new_replicas = locator::substract_sets(tablet_info.replicas, old_tablet_info.replicas);
if (abandoning_replicas.size() + new_replicas.size() > 1) {
throw std::runtime_error(fmt::format("Invalid state of a tablet {} of a table {}.{}. Expected replication factor: {}, but the tablet has replicas only on {}. "
"Try again later or use the \"Fixing invalid replica state with RF change\" procedure to fix the problem.", tablet_id, ks_name, table_or_mv->cf_name(),
ks.get_replication_strategy().get_replication_factor(*tmptr), old_tablet_info.replicas));
}
}
}
co_return rack_list_conversion ? co_await requires_rack_list_colocation(_db, tmptr, &_sys_ks, req_id) : false;
};
if (needs_colocation = co_await check_needs_colocation(); needs_colocation) {
co_return;
}
for (const auto& table_or_mv : tables_with_mvs) {
if (!tmptr->tablets().is_base_table(table_or_mv->id())) {
// Apply the transition only on base tables.
// If this table has a base table then the transition will be applied on the base table, and
// the base table will coordinate the transition for the entire group.
continue;
}
auto old_tablets = co_await tmptr->tablets().get_tablet_map(table_or_mv->id()).clone_gently();
new_tablet_map = co_await new_strategy->maybe_as_tablet_aware()->reallocate_tablets(table_or_mv, tmptr, co_await old_tablets.clone_gently());
replica::tablet_mutation_builder tablet_mutation_builder(guard.write_timestamp(), table_or_mv->id());
co_await new_tablet_map.for_each_tablet([&](locator::tablet_id tablet_id, const locator::tablet_info& tablet_info) -> future<> {
auto last_token = new_tablet_map.get_last_token(tablet_id);
auto old_tablet_info = old_tablets.get_tablet_info(last_token);
auto abandoning_replicas = locator::substract_sets(old_tablet_info.replicas, tablet_info.replicas);
auto new_replicas = locator::substract_sets(tablet_info.replicas, old_tablet_info.replicas);
if (abandoning_replicas.size() + new_replicas.size() > 1) {
throw std::runtime_error(fmt::format("Invalid state of a tablet {} of a table {}.{}. Expected replication factor: {}, but the tablet has replicas only on {}. "
"Try again later or use the \"Fixing invalid replica state with RF change\" procedure to fix the problem.", tablet_id, ks_name, table_or_mv->cf_name(),
ks.get_replication_strategy().get_replication_factor(*tmptr), old_tablet_info.replicas));
}
updates.emplace_back(co_await make_canonical_mutation_gently(
replica::tablet_mutation_builder(guard.write_timestamp(), table_or_mv->id())
.set_new_replicas(last_token, tablet_info.replicas)
.set_stage(last_token, locator::tablet_transition_stage::allow_write_both_read_old)
.set_transition(last_token, locator::choose_rebuild_transition_kind(_feature_service))
.build()
));
updates.emplace_back(co_await make_canonical_mutation_gently(
replica::tablet_mutation_builder(guard.write_timestamp(), table_or_mv->id())
.set_new_replicas(last_token, tablet_info.replicas)
.set_stage(last_token, locator::tablet_transition_stage::allow_write_both_read_old)
.set_transition(last_token, locator::choose_rebuild_transition_kind(_feature_service))
.build()
));
// Calculate abandoning replica and abort view building tasks on them
if (!abandoning_replicas.empty()) {
if (abandoning_replicas.size() != 1) {
on_internal_error(rtlogger, fmt::format("Keyspace RF abandons {} replicas for table {} and tablet id {}", abandoning_replicas.size(), table_or_mv->id(), tablet_id));
// Calculate abandoning replica and abort view building tasks on them
if (!abandoning_replicas.empty()) {
if (abandoning_replicas.size() != 1) {
on_internal_error(rtlogger, fmt::format("Keyspace RF abandons {} replicas for table {} and tablet id {}", abandoning_replicas.size(), table_or_mv->id(), tablet_id));
}
_vb_coordinator->abort_tasks(updates, guard, table_or_mv->id(), *abandoning_replicas.begin(), last_token);
}
_vb_coordinator->abort_tasks(updates, guard, table_or_mv->id(), *abandoning_replicas.begin(), last_token);
}
co_await coroutine::maybe_yield();
});
co_await coroutine::maybe_yield();
});
}
}
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m: schema_muts) {
updates.emplace_back(m);
}
updates.push_back(canonical_mutation(tbuilder_with_request_drop().build()));
updates.push_back(canonical_mutation(topology_request_tracking_mutation_builder(req_id)
.done()
.build()));
break;
}
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m: schema_muts) {
updates.emplace_back(m);
case keyspace_rf_change_kind::conversion_to_rack_list: {
rtlogger.info("keyspace_rf_change for keyspace {} postponed for colocation", ks_name);
topology_mutation_builder tbuilder = tbuilder_with_request_drop();
tbuilder.pause_rf_change_request(req_id);
updates.push_back(canonical_mutation(tbuilder.build()));
break;
}
};
co_await schedule_migrations();
case keyspace_rf_change_kind::multi_rf_change: {
rtlogger.info("keyspace_rf_change for keyspace {} will use multi-rf change procedure", ks_name);
ks_md->set_next_strategy_options(ks_md->strategy_options());
ks_md->set_strategy_options(ks.metadata()->strategy_options()); // start from the old strategy
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m: schema_muts) {
updates.emplace_back(m);
}
topology_mutation_builder tbuilder = tbuilder_with_request_drop();
tbuilder.start_rf_change_migrations(req_id);
updates.push_back(canonical_mutation(tbuilder.build()));
break;
}
}
} catch (const std::exception& e) {
error = e.what();
rtlogger.error("Couldn't process global_topology_request::keyspace_rf_change, desired new ks opts: {}, error: {}",
@@ -1116,22 +1207,12 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
error = "Can't ALTER keyspace " + ks_name + ", keyspace doesn't exist";
}
bool pause_request = needs_colocation && error.empty();
topology_mutation_builder tbuilder(guard.write_timestamp());
tbuilder.set_transition_state(topology::transition_state::tablet_migration)
.set_version(_topo_sm._topology.version + 1)
.del_global_topology_request()
.del_global_topology_request_id()
.drop_first_global_topology_request_id(_topo_sm._topology.global_requests_queue, req_id);
if (pause_request) {
rtlogger.info("keyspace_rf_change for keyspace {} postponed for colocation", ks_name);
tbuilder.pause_rf_change_request(req_id);
} else {
if (error != "") {
updates.push_back(canonical_mutation(tbuilder_with_request_drop().build()));
updates.push_back(canonical_mutation(topology_request_tracking_mutation_builder(req_id)
.done(error)
.build()));
}
updates.push_back(canonical_mutation(tbuilder.build()));
sstring reason = seastar::format("ALTER tablets KEYSPACE called with options: {}", saved_ks_props);
rtlogger.trace("do update {} reason {}", updates, reason);
@@ -1615,6 +1696,83 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
.build());
}
// Updates keyspace properties; removes system_schema.keyspaces::next_replication;
// finishes RF change request; Removes request from system.topology::ongoing_rf_changes.
void generate_rf_change_completion_update(utils::chunked_vector<canonical_mutation>& out, const group0_guard& guard, const rf_change_completion_info& completion) {
if (rtlogger.is_enabled(seastar::log_level::debug)) {
sstring props_str;
for (const auto& [key, value] : completion.saved_ks_props) {
props_str += fmt::format(" {}={};", key, value);
}
rtlogger.debug("generate_rf_change_completion_update: request_id={}, ks_name={}, error='{}', saved_ks_props:{}",
completion.request_id, completion.ks_name, completion.error, props_str);
}
sstring error = completion.error;
if (_db.has_keyspace(completion.ks_name)) {
auto& ks = _db.find_keyspace(completion.ks_name);
if (error.empty()) {
cql3::statements::ks_prop_defs new_ks_props{std::map<sstring, sstring>{completion.saved_ks_props.begin(), completion.saved_ks_props.end()}};
new_ks_props.validate();
auto ks_md = new_ks_props.as_ks_metadata_update(ks.metadata(), *get_token_metadata_ptr(), _db.features(), _db.get_config());
ks_md->clear_next_strategy_options();
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m: schema_muts) {
out.emplace_back(m);
}
} else {
auto ks_md = make_lw_shared<data_dictionary::keyspace_metadata>(*ks.metadata());
ks_md->clear_next_strategy_options();
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m: schema_muts) {
out.emplace_back(m);
}
}
}
out.emplace_back(topology_mutation_builder(guard.write_timestamp())
.finish_rf_change_migrations(_topo_sm._topology.ongoing_rf_changes, completion.request_id)
.build());
out.push_back(canonical_mutation(topology_request_tracking_mutation_builder(completion.request_id)
.done(error)
.build()));
}
// Sets next_replication to current_replication and sets error on the topology request.
// Similar to storage_service::abort_rf_change for the ongoing_rf_changes case.
void generate_rf_change_abort_update(utils::chunked_vector<canonical_mutation>& out, const group0_guard& guard, const rf_change_abort_info& abort_info) {
rtlogger.debug("generate_rf_change_abort_update: request_id={}, ks_name={}, error='{}'", abort_info.request_id, abort_info.ks_name, abort_info.error);
if (!_db.has_keyspace(abort_info.ks_name)) {
return;
}
auto& ks = _db.find_keyspace(abort_info.ks_name);
auto ks_md = make_lw_shared<data_dictionary::keyspace_metadata>(*ks.metadata());
ks_md->set_next_strategy_options(abort_info.current_replication);
auto schema_muts = prepare_keyspace_update_announcement(_db, ks_md, guard.write_timestamp());
for (auto& m : schema_muts) {
out.emplace_back(m);
}
out.push_back(canonical_mutation(topology_request_tracking_mutation_builder(abort_info.request_id)
.abort(abort_info.error)
.build()));
}
future<> generate_rf_change_updates(utils::chunked_vector<canonical_mutation>& out, const group0_guard& guard, const keyspace_rf_change_plan& rf_change_plan) {
for (const auto& abort_info : rf_change_plan.aborts) {
co_await coroutine::maybe_yield();
generate_rf_change_abort_update(out, guard, abort_info);
}
if (rf_change_plan.completion.has_value()) {
generate_rf_change_completion_update(out, guard, *rf_change_plan.completion);
}
}
future<> generate_migration_updates(utils::chunked_vector<canonical_mutation>& out, const group0_guard& guard, const migration_plan& plan) {
if (plan.resize_plan().finalize_resize.empty() || plan.has_nodes_to_drain()) {
// schedule tablet migration only if there are no pending resize finalisations or if the node is draining.
@@ -1637,6 +1795,8 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
if (auto request_to_resume = plan.rack_list_colocation_plan().request_to_resume(); request_to_resume) {
generate_rf_change_resume_update(out, guard, request_to_resume);
}
co_await generate_rf_change_updates(out, guard, plan.rf_change_plan());
}
auto sched_time = db_clock::now();
@@ -2225,9 +2385,11 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
}
bool has_nodes_to_drain = false;
bool requires_schema_changes = false;
if (!preempt) {
auto plan = co_await _tablet_allocator.balance_tablets(get_token_metadata_ptr(), &_topo_sm._topology, &_sys_ks, {}, get_dead_nodes());
has_nodes_to_drain = plan.has_nodes_to_drain();
requires_schema_changes = plan.requires_schema_changes();
if (!drain || plan.has_nodes_to_drain()) {
co_await generate_migration_updates(updates, guard, plan);
}
@@ -2243,7 +2405,11 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
topology_mutation_builder(guard.write_timestamp())
.set_version(_topo_sm._topology.version + 1)
.build());
co_await update_topology_state(std::move(guard), std::move(updates), format("Tablet migration"));
if (requires_schema_changes) {
co_await update_topology_state_with_mixed_change(std::move(guard), std::move(updates), format("Tablet migration"));
} else {
co_await update_topology_state(std::move(guard), std::move(updates), format("Tablet migration"));
}
}
if (needs_barrier) {
@@ -4134,7 +4300,11 @@ future<std::optional<group0_guard>> topology_coordinator::maybe_start_tablet_mig
.set_version(_topo_sm._topology.version + 1)
.build());
co_await update_topology_state(std::move(guard), std::move(updates), "Starting tablet migration");
if (plan.requires_schema_changes()) {
co_await update_topology_state_with_mixed_change(std::move(guard), std::move(updates), "Starting tablet migration");
} else {
co_await update_topology_state(std::move(guard), std::move(updates), "Starting tablet migration");
}
co_return std::nullopt;
}

Some files were not shown because too many files have changed in this diff Show More