As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.
Fixes SCYLLADB-1680
Fixes#16367
To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.
This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.
- (cherry picked from commit 870013b437)
- (cherry picked from commit 9a86044c63)
Parent PR: #29604Closesscylladb/scylladb#29817
* github.com:scylladb/scylladb:
test: Stop providing alternator-streams experimental flag
alternator: Graduate Alternator Streams from experimental
value_to_json() converts CQL values to JSON for vector search filters.
For decimal and varint types, it used rjson::parse() on the JSON string,
which parses through a double and silently loses precision for values
exceeding ~15 significant digits — producing wrong filter results.
Additionally, for decimal type we need an exact string representation
that preserves the original (unscaled, scale) pair, because partition
keys use byte-level identity: different serialized representations of
the same numeric value are distinct rows, so the filter must reproduce
the exact representation stored in the key.
Add big_decimal::to_string_canonical() which follows the Java BigDecimal
toString() spec (JDK 8+), producing a bijective string representation
that uses exponential notation for extreme scales instead of expanding
trailing zeros (which could cause OOM). This could replace to_string(),
but doing so has wider consequences (e.g. hash/equality contract for
decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for
decimal_type, and use rjson::from_string() for varint_type, both
bypassing the lossy double parse path.
Tests cover the new to_string_canonical() and the filter fix, as well as
existing decimal type behavior (key representation, clustering order,
toJson) that we rely on and must not break. The CQL decimal type tests
(test_type_decimal.py) also pass against Cassandra.
Fixes: SCYLLADB-2107
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574Closesscylladb/scylladb#29505
(cherry picked from commit 15493872b2)
Closesscylladb/scylladb#29957
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing
The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).
The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.
Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.
The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted
Fixes: SCYLLADB-2006
This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2
- (cherry picked from commit aaead10e5d)
- (cherry picked from commit 6856f51097)
Parent PR: #29756Closesscylladb/scylladb#29895
* github.com:scylladb/scylladb:
test: add test for intranode balance threshold in size-based mode
tablet_allocator: apply balance threshold to intranode shard balancing
Add test_tablet_routing_info_after_cas_shard_bounce that verifies
TABLETS_ROUTING_V1 payload is returned after an internal CAS shard
bounce.
The test simulates the transport-layer bounce: it creates a table whose
single tablet replica lands on a shard different from the test thread,
executes an LWT (which bounces), then transfers client_state via
client_state_for_another_shard (preserving _original_shard) and
re-executes on the tablet shard. The test asserts that check_locality()
correctly detects the misrouting and returns tablet routing info.
Refs SCYLLADB-2041
(cherry picked from commit 8a76ec7e65)
Verify that the load balancer does not issue intranode migrations when
the load difference between shards is within the size_based_balance_threshold,
and that it does issue migrations when the difference exceeds the threshold.
(cherry picked from commit 6856f51097)
The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions. As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.
The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.
On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables. When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site. This matches the
CI failure seen in SCYLLADB-1774:
ERROR ... replica_read_timeout_no_exception: cql env callback failed,
error: exceptions::read_timeout_exception (Operation timed out for
replica_read_timeout_no_exception.tbl - received only 0 responses
from 1 CL=ONE.)
The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.
The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.
No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
there is the happy path, not a failure.
The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested. The problem was only that the pre-test warm-up was not
guarded against the same timeout.
Fixes: SCYLLADB-1830
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29731
(cherry picked from commit 1f15e05946)
Closesscylladb/scylladb#29760
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.
`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`
The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.
Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.
Fixes: SCYLLADB-1905
This fix needs to be backported to versions: 2025.4, 2026.1
- (cherry picked from commit 4987204f71)
- (cherry picked from commit 6b3e18c4a9)
Parent PR: #29585Closesscylladb/scylladb#29818
* github.com:scylladb/scylladb:
test: verify load balancer handles dropped tables gracefully
tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.
The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
(cherry picked from commit f7bc8f5fa7)
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
(cherry picked from commit 6b3e18c4a9)
Alternator Streams were experimental until 2026.2, when they became GA.
Stop requiring `--experimental-features=alternator-streams` by:
- Removing ALTERNATOR_STREAMS from the experimental feature enum
- Mapping "alternator-streams" to UNUSED for backward compatibility
- Removing the gating that disabled the ALTERNATOR_STREAMS gossip
feature when the experimental flag was absent
- Removing the runtime guard that rejected StreamSpecification requests
without the feature flag
- Updating config_test to reflect the new UNUSED mapping
The gms::feature alternator_streams is kept for rolling upgrade
compatibility with older nodes.
Fixes SCYLLADB-1680
(cherry picked from commit 870013b437)
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:
1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
(checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.
Pin strong ref to _insert_stmt locally before the suspension.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667
Backport: all supported branches, it's memory corruption bug, long present
Closesscylladb/scylladb#29588
* github.com:scylladb/scylladb:
test/boost: add dummy case to table_helper_test for non-injection modes
test/boost: add regression test for table_helper insert() UAF
utils/error_injection: add waiters() API
table_helper: fix use-after-free on prepared-statement invalidation
(cherry picked from commit efcc0b6376)
Closesscylladb/scylladb#29747
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
Assertion `caching_enabled()' failed.
at utils/loading_cache.hh:319
in authorized_prepared_statements_cache::insert
```
`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.
Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.
Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.
Fixes SCYLLADB-1699
The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.
Closesscylladb/scylladb#29638
* github.com:scylladb/scylladb:
test: boost: regression test for loading_cache::insert with caching disabled
utils: loading_cache: add insert() that is a no-op when caching is disabled
(cherry picked from commit c00fee0316)
Closesscylladb/scylladb#29762
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: SCYLLADB-1767
Closesscylladb/scylladb#29620
(cherry picked from commit 7e14ea5ac8)
Closesscylladb/scylladb#29692
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard. It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.
When two fibers call lock_tables_metadata() concurrently, this can
deadlock. parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues. The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue. This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).
In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.
This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata(). When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:
1. activate_large_data_virtual_tables() — calls
legacy_drop_table_on_all_shards() which calls
lock_tables_metadata() synchronously via .get()
2. reload_schema_in_bg() — fires as a background fiber from
TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
schema_applier::commit() which also calls lock_tables_metadata()
If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions. The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.
Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.
Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time
Also added a unit test for concurrent lock_tables_metadata() calls.
Fixes: SCYLLADB-1784
Fixes: SCYLLADB-1785
Fixes: SCYLLADB-1786
Closesscylladb/scylladb#29678
(cherry picked from commit ebaf536449)
Closesscylladb/scylladb#29709
Commit ce00d61917 ("db: implement large_data virtual tables with feature
flag gating") changed these two tests to construct their mutation with
a randomly generated partition key (simple_schema::make_pkey()) instead
of the previously fixed pk "pv", with the comment that this avoids a
"Failed to generate sharding metadata" error.
simple_schema::make_pkey() delegates to tests::generate_partition_key(),
which defaults to key_size{1, 128}, i.e. the partition key length is
uniformly random in [1, 128] bytes. That interacts badly with the fact
that both tests pick thresholds at exact byte boundaries of the MC
sstable row encoding:
- The large-data handler records a row's size as
_data_writer->offset() - current_pos
(sstables/mx/writer.cc: collect_row_stats()), i.e. the number of
bytes the row took on disk.
- For the first clustering row, the body includes a vint-encoded
prev_row_size = pos - _prev_row_start.
- _prev_row_start is captured at the start of the partition
(consume_new_partition()) before the partition key is written to
the data stream, so prev_row_size rolls in the partition key's
serialized length (2-byte prefix + pk bytes) + deletion_time +
static row size.
A random-size partition key therefore perturbs the first clustering
row's encoded size by 1-2 bytes across runs (the vint of prev_row_size
crosses the 128 boundary), flipping the test's byte-exact threshold
comparison. On seed 2104744000 this produced:
critical check row_size_count == expected.size() has failed [3 != 2]
Fix the two byte-exact-sensitive tests by reverting their partition key
to the fixed s.new_mutation("pv") used before ce00d61917. Under smp=1
(which these tests run with, per -c1 in the test invocation) a fixed
key is always shard-local, so no sharding-metadata issue arises here.
The other tests modified by ce00d61917 (test_sstable_log_too_many_rows,
test_sstable_log_too_many_dead_rows, test_sstable_too_many_collection_elements,
test_large_data_records_round_trip, etc.) assert on row/element counts
or use thresholds with enough slack that the partition key size does
not matter, and are left unchanged.
Add an explanatory comment to each fixed site so the pitfall is not
re-introduced by a future refactor.
Verified stable with:
./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1
./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1
./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
All four invocations: 100/100 passed.
Fixes: SCYLLADB-1685
Closesscylladb/scylladb#29621
Fixes: SCYLLADB-1523
The returned file object does not increment file pos as is. One line fix.
Added test to make sure this read path works as expected.
Closesscylladb/scylladb#29456
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231.
Closesscylladb/scylladb#29310
* github.com:scylladb/scylladb:
compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
test/repair: Add tombstone GC safety tests for incremental repair
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.
The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".
Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.
Bump the TTL to 60s in all three affected tests:
- test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
- test_read_all (same pattern, same invariants — suspect)
- test_read_all_multi_range (same pattern, same invariants — suspect)
Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.
Fixes: SCYLLADB-1642
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closesscylladb/scylladb#29564
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.
In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.
In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
- next_replication is cleared;
- new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.
If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.
Fixes: SCYLLADB-567.
No backport needed; new feature.
Closesscylladb/scylladb#24421
* github.com:scylladb/scylladb:
service: fix indentation
docs: update documentation
test: test multi RF changes
service: tasks: allow aborting ongoing RF changes
cql3: allow changing RF by more than one when adding or removing a DC
service: handle multi_rf_change
service: implement make_rf_change_plan
service: add keyspace_rf_change_plan to migration_plan
service: extend tablet_migration_info to handle rebuilds
service: split update_node_load_on_migration
service: rearrange keyspace_rf_change handler
db: add columns to system_schema.keyspaces
db: service: add ongoing_rf_changes to system.topology
gms: add keyspace_multi_rf_change feature
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.
Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.
The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.
The refactoring is composed of these parts (but patches from different parts
are interspersed):
1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API
Major refactoring, and no bugs fixed, so definitely not backporting.
Closesscylladb/scylladb#29114
* github.com:scylladb/scylladb:
cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
cql3: statement_restrictions: use predicate vector size for clustering prefix length
cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
cql3: statement_restrictions: add predicate-based index support checking
cql3: statement_restrictions: use pre-built single-column maps for index support checks
cql3: statement_restrictions: build clustering-prefix restrictions incrementally
cql3: statement_restrictions: build partition-range restrictions incrementally
cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
cql3: statement_restrictions: track has-token state incrementally
cql3: statement_restrictions: track partition-key-empty state incrementally
cql3: statement_restrictions: track first multi-column predicate incrementally
cql3: statement_restrictions: track last clustering column incrementally
cql3: statement_restrictions: track clustering-has-slice incrementally
cql3: statement_restrictions: track has-multi-column-clustering incrementally
cql3: statement_restrictions: track clustering-empty state incrementally
cql3: statement_restrictions: replace restr bridge variable with pred.filter
cql3: statement_restrictions: convert single-column branch to use predicate properties
cql3: statement_restrictions: convert multi-column branch to use predicate properties
cql3: statement_restrictions: convert constructor loop to iterate over predicates
cql3: statement_restrictions: annotate predicates with operator properties
cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
cql3: statement_restrictions: complete preparation early
cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
cql3: statement_restrictions: refine possible_lhs_values() function_call processing
cql3: statement_restrictions: return nullptr for function solver if not token
cql3: statement_restrictions: refine possible_lhs_values() subscript solving
cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
cql3: statement_restrictions: convert possible_lhs_values into a solver
cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
cql3: statement_restrictions: refactor IS NOT NULL processing
cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
cql3: statement_restrictions: fold add_is_not_restriction() into its caller
cql3: statement_restrictions: fold add_restriction() into its caller
cql3: statement_restrictions: remove possible_partition_token_values()
cql3: statement_restrictions: remove possible_column_values
cql3: statement_restrictions: pass schema to possible_column_values()
cql3: statement_restrictions: remove fallback path in solve()
cql3: statement_restrictions: reorder possible_lhs_column parameters
cql3: statement_restrictions: prepare solver for multi-column restrictions
cql3: statement_restrictions: add solver for token restriction on index
cql3: statement_restrictions: pre-analyze column in value_for()
cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
cql3: statement_restrictions: adjust signature of range_from_raw_bounds
cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
cql3: statement_restrictions: wrap value_for_index_partition_key()
cql3: statement_restrictions: hide value_for()
cql3: statement_restrictions: push down clustering prefix wrapper one level
cql3: statement_restrictions: wrap functions that return clustering ranges
cql3: statement_restrictions: do not pass view schema back and forth
cql3: statement_restrictions: pre-analyze token range restrictions
cql3: statement_restrictions: pre-analyze partition key columns
cql3: statement_restrictions: do not collect subscripted partition key columns
cql3: statement_restrictions: split _partition_range_restrictions into three cases
cql3: statement_restrictions: move value_list, value_set to header file
cql3: statement_restrictions: wrap get_partition_key_ranges
cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
test: statement_restrictions: add index_selection regression test
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
repaired-only optimization is active; used by get_max_purgeable_timestamp() in
compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
writing T_base so D_base is staged on servers[0] via row-sync; blocks the
view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
(fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
in repaired set); asserts no resurrection; releases injection; waits for staging to
complete; asserts no resurrection after a second flush+compaction. Demonstrates that
the read-before-write in stream_view_replica_updates() makes the optimization safe even
when staging fires after T_mv has been GC'd.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test was using max_size_mb = 8*1024 (8 GB) with 100 iterations,
causing it to create up to 260 files of 32 MB each per iteration via
fallocate. On a loaded CI machine this totals hundreds of GB of file
operations, easily exceeding the 15-minute test timeout (SCYLLADB-1496).
The test only needs enough files to verify that delete_segments keeps
the disk footprint within [shard_size, shard_size + seg_size]. Reduce
max_size_mb to 128 (8 files of 32 MB per iteration) and the iteration
count to 10, which is sufficient to exercise the serialized-deletion
and recycle logic without imposing excessive I/O load.
Closesscylladb/scylladb#29510
The previous commit made prepare_indexed_local() use the pre-built
predicate vectors instead of calling extract_single_column_restrictions_for_column().
That was the last production caller.
Remove the function definition (65 lines of expression-walking visitor)
and its declaration/doc-comment from the header.
Replace the unit test (expression_extract_column_restrictions) which
directly called the removed function with synthetic column_definitions,
with per_column_restriction_routing which exercises the same routing
logic through the public analyze_statement_restrictions() API. The new
test verifies not just factor counts but the exact (column_name, oper_t)
pairs in each per-column entry, catching misrouted restrictions that a
count-only check would miss.
Expressions are a tree-like structure so a single expression is sufficient
(for complicated ones, a conjunction is used), but predicates are flat.
Prepare for conversion to predicates by storing the expressions that
will correspond to predicates, namely the boolean factors of the WHERE
clause.
For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.
Avoid the back-and-forth and use the stored value. Using a different
value would be broken.
This change allows unifying the signatures of the four functions that
get clustering ranges.
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.
To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
In preparation for refactoring statement_restrictions, add a simple
and an exhaustive regression test, encoding the index selection
algorithm into the test. We cannot change the index selection algorithm
because then mixed-node clusters will alter the sorting key mid-query
(if paging takes place).
Because the exhaustive space has such a large stack frame, and
because Address Santizer bloats the stack frame, increase it
for debug builds.
This PR exposes vnodes-to-tablets migrations through the task manager API via a virtual task. This allows users to list, query status, and wait on ongoing migrations through a standard interface, consistent with other global operations such as tablet operations and topology requests are already exposed.
The virtual task exposes all migrations that are currently in progress. Each migrating keyspace appears as a separate task, identified by a deterministic name-based (v3) UUID derived from the keyspace name. Progress is reported as the number of nodes that have switched to tablets vs. the total. The number increases on the forward path and decreases on rollback.
The task is not abortable - rolling back a migration requires a manual procedure.
The `wait` API blocks until the migration either completes (returning `done`) or is rolled back (returning `suspended`).
Example output:
```
$ scylla nodetool tasks list vnodes_to_tablets_migration
task_id type kind scope state sequence_number keyspace table entity shard start_time end_time
1747b573-6cd6-312d-abb1-9b66c1c2d81f vnodes_to_tablets_migration cluster keyspace running 0 ks 0
$ scylla nodetool tasks status 1747b573-6cd6-312d-abb1-9b66c1c2d81f
id: 1747b573-6cd6-312d-abb1-9b66c1c2d81f
type: vnodes_to_tablets_migration
kind: cluster
scope: keyspace
state: running
is_abortable: false
start_time:
end_time:
error:
parent_id: none
sequence_number: 0
shard: 0
keyspace: ks
table:
entity:
progress_units: nodes
progress_total: 3
progress_completed: 0
```
Fixes SCYLLADB-1150.
New feature, no backport needed.
Closesscylladb/scylladb#29256
* github.com:scylladb/scylladb:
test: cluster: Verify vnodes-to-tablets migration virtual task
distributed_loader: Link resharding tasks to migration virtual task
distributed_loader: Make table_populator aware of migration rollbacks
service: Add virtual task for vnodes-to-tablets migrations
storage_service: Guard migration status against uninitialized group0
compaction: Add parent_id to table_resharding_compaction_task_impl
storage_service: Add keyspace-level migration status function
storage_service: Replace migration status string with enum
utils: Add UUID::is_name_based()
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.
This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.
Fixes#5563Closesscylladb/scylladb#28984
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.
Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.
When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.
Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.
Fixes SCYLLADB-724.
New feature, no backport is needed.
Closesscylladb/scylladb#29319
* github.com:scylladb/scylladb:
test: Use arbitrary tokens in vnodes->tablets migration tests
test: boost: Add test for wrap-around vnodes
storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
storage_service: Hoist migration precondition
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.
Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.
CHILD_SHARDS is a Amazon's functionality and is required by KCL.
Add unit tests.
Fixes: #25160Closesscylladb/scylladb#28189
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.
During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.
Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.
Fixes: CUSTOMER-268
Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.
Closesscylladb/scylladb#29242
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.
Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.
The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.
Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
feature, which relies on replayed raft items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes: SCYLLADB-1411
Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.
Closesscylladb/scylladb#29372
* github.com:scylladb/scylladb:
commitlog: add test to verify segment replay order
commitlog: fix replay order by using ordered map per shard
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).
The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:
- Before the feature is enabled: the old physical tables remain in
all_tables(), CQL writes are active, no virtual tables are registered.
This ensures safe rollback during rolling upgrades.
- After the feature is enabled: old physical tables are dropped from
disk via legacy_drop_table_on_all_shards(), virtual tables are
registered on all shards, and CQL writes are skipped via
skip_cql_writes() in cql_table_large_data_handler.
Key implementation details:
- Three virtual table classes (large_partitions_virtual_table,
large_rows_virtual_table, large_cells_virtual_table) extend
streaming_virtual_table with cross-shard record collection.
- generate_legacy_id() gains a version parameter; virtual tables
use version 1 to get different UUIDs than the old physical tables.
- compaction_time is derived from SSTable generation UUID at display
time via UUID_gen::unix_timestamp().
- Legacy SSTables without LargeDataRecords emit synthetic summary
rows based on above_threshold > 0 in LargeDataStats.
- The activation logic uses two paths: when the feature is already
enabled (test env, restart), it runs as a coroutine; when not yet
enabled, it registers a when_enabled callback that runs inside
seastar::async from feature_service::enable().
- sstable_3_x_test updated to use a simplified large_data_test_handler
and validate LargeDataRecords in SSTable metadata directly.
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():
- test_large_data_records_round_trip: verifies partition_size, row_size,
and cell_size records are written with correct field semantics when
thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
are written when data is below all thresholds
Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.
This will make later refactoring easier.
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.
This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.
However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.
The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.
Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.
The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.
The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
- Mutation cleaner merging versions in the background (changes change_mark)
- LSA segment compaction during reserve() (invalidates references)
- B-tree rebalancing on partition insertion (invalidates references)
- Debug mode's always-true need_preempt() creating many multi-version
partitions via preempted apply_monotonically()
A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.
Fixes: SCYLLADB-1253
Closesscylladb/scylladb#29368