mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-02 06:05:53 +00:00
c25f3eced8d89d3e49fde05eef37fa8b1700573a
53536 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c25f3eced8 |
gms/gossiper: fix use-after-move in do_send_ack2_msg
The second logger.debug() call accesses ack2_msg after it was moved
via std::move() in the co_await send_gossip_digest_ack2 call.
This is undefined behavior.
Fix by formatting ack2_msg to a string before the move, then using
that cached string in both debug log calls.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1778
Closes scylladb/scylladb#29227
(cherry picked from commit
|
||
|
|
d264fea176 |
replica/database: fix cross-shard deadlock in lock_tables_metadata()
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock on every shard. It used invoke_on_all(), which dispatches lock acquisitions to all shards in parallel via parallel_for_each + smp::submit_to. When two fibers call lock_tables_metadata() concurrently, this can deadlock. parallel_for_each starts all iterations unconditionally: even when the local shard's lock attempt blocks (because the other fiber already holds it), SMP messages are still sent to remote shards. Both fibers' lock-acquisition messages land in the per-shard SMP queues. The SMP queue itself is FIFO, but process_incoming() drains it and schedules each item as a reactor task via add_task(), which — in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles each newly added task against all pending tasks in the same scheduling group's reactor task queue. This means fiber A's lock acquisition can be reordered past fiber B's (and past unrelated tasks) on a given shard. If fiber A wins the lock on shard X while fiber B wins on shard Y, this creates a classic cross-shard lock-ordering deadlock (circular wait). In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor task queue is FIFO. Still, even in release builds, the SMP queues can reorder messages even, so the deadlock is still possible, even if it's much less likely. In debug and sanitize builds, the task-queue shuffle makes the deadlock very likely whenever both fibers' lock-acquisition tasks are pending simultaneously in the reactor task queue on any shard. This deadlock was exposed by |
||
|
|
9d942a5408 |
build: point seastar submodule at scylla-seastar.git
This allows us to backport seastar commits as the need arises. |
||
|
|
9622291e07 |
Merge 'test/cluster/test_incremental_repair: fix flaky coordinator-change scenario' from Avi Kivity
- Ensure servers[1] is not the topology coordinator before restarting it, preventing the leader death + re-election + re-repair sequence that masked the compaction-merge bug
- Add a retry loop that detects post-restart leadership transfer to servers[1] via direct coordinator query, retrying up to 5 times
Fixes: SCYLLADB-1743
Backporting to 2026.2, which sees the failure regularly.
Closes scylladb/scylladb#29671
* github.com:scylladb/scylladb:
test/cluster/test_incremental_repair: add retry for residual leadership race
test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
(cherry picked from commit
|
||
|
|
b98470a860 | Update ScyllaDB version to: 2026.2.0-rc1 | ||
|
|
5231c77e8e | Update ScyllaDB version to: 2026.2.0-rc0 scylla-2026.2.0-rc0-candidate-20260426095453 scylla-2026.2.0-rc0 | ||
|
|
d5efd1f676 |
test/cluster: wait for Alternator readiness in server startup
server_add() only waits for CQL readiness before returning. The Alternator HTTP port may not be listening yet, causing ConnectionRefused with Alternator tests. Extend the ServerUpState enum and startup loop to also check Alternator port readiness when configured. Whenever Alternator port(s) is/are configured, each is verified if connectable and queryable, similar to how CQL ports are probed. Fixes SCYLLADB-1701 Closes scylladb/scylladb#29625 |
||
|
|
d14d07a079 |
test: fix flaky test_sstable_write_large_{row,cell} by using a fixed partition key
Commit |
||
|
|
70261dc674 |
Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz
The failure_detector_timeout_in_ms override of 2000ms in 6 cluster test files is too aggressive for debug/sanitize builds. During node joins, the coordinator's failure detector times out on RPC pings to the joining node while it is still applying schema snapshots, marks it DOWN, and bans it — causing flaky test failures. Scale the timeout by MODES_TIMEOUT_FACTOR (3x for debug/sanitize, 2x for dev, 1x for release) via a shared failure_detector_timeout fixture in conftest.py. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1587 Backport: no, elasticsearch analyser shows only a single failure Closes scylladb/scylladb#29522 * github.com:scylladb/scylladb: test/cluster: scale failure_detector_timeout_in_ms by build mode test/cluster: add failure_detector_timeout fixture |
||
|
|
d280517e27 |
test/cluster/test_incremental_repair: fix flaky do_tablet_incremental_repair_and_ops
The log grep in get_sst_status searched from the beginning of the log (no from_mark), so the second-repair assertions were checking cumulative counts across both repairs rather than counts for the second repair alone. The expected values (sst_add==2, sst_mark==2) relied on this cumulative behaviour: 1 from the first repair + 1 from the second = 2. This works when the second repair encounters exactly one unrepaired sstable, but fails whenever the second repair sees two. The second repair can see two unrepaired sstables when the 100 keys inserted before it (via asyncio.gather) trigger a background auto-flush before take_storage_snapshot runs. take_storage_snapshot always flushes the memtable itself, so if an auto-flush already split the batch into two sstables on disk, the second repair's snapshot contains both and logs "Added sst" twice, making the cumulative count 3 instead of 2. Fix: take a log mark per-server before each repair call and pass it to get_sst_status so each check counts only the entries produced by that repair. The expected values become 1/0/1 and 1/1/1 respectively, independent of how many sstables happened to exist beforehand. get_sst_status gains an optional from_mark parameter (default None) which preserves existing call sites that intentionally grep from the start of the log. Fixes: SCYLLADB-1086 Closes scylladb/scylladb#29484 |
||
|
|
7634d3f7d4 |
test/cluster: fix flaky test_hints_consistency_during_replace
The test creates a sync point immediately after writing 100 rows with CL=ANY, without waiting for pending hint writes to complete. store_hint() is fire-and-forget: it submits do_store_hint() to a gate and returns immediately. do_store_hint() updates _last_written_rp only after writing to the commitlog. If create_sync_point() is called before all do_store_hint() coroutines complete, the captured replay position is stale, and await_sync_point() returns DONE before all hints are replayed, leaving some rows missing. Fix by waiting for the size_of_hints_in_progress metric to reach zero before creating the sync point, ensuring all in-flight hint writes have completed and _last_written_rp is up to date. This follows the same pattern already used in test_sync_point. Fixes: SCYLLADB-1560 Closes scylladb/scylladb#29623 |
||
|
|
b49cf6247f |
test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL
Tracing events are written to system_traces.events with CL=ANY, so they are only guaranteed to be present on the local node of the query coordinator. Reading them back with the driver default (CL=LOCAL_ONE) may route the query to a replica that has not yet received all events, causing the assertion on 'digest mismatch, starting read repair' to fail intermittently. Fix execute_with_tracing() to read tracing via the ResponseFuture API with query_cl=ConsistencyLevel.ALL, so events from all replicas are merged before the caller inspects them. Fixes: SCYLLADB-1633 Closes scylladb/scylladb#29566 |
||
|
|
878f341338 |
test/cluster/test_view_building_coordinator: fix view_updates_drained predicate
The previous fix for the flakiness in test_file_streaming waited for the scylla_database_view_update_backlog metric to drop to 0 via wait_for(view_updates_drained, ...). However, the predicate returned True/False, while wait_for treats any non-None result as 'done' and keeps retrying only on None. So when the backlog was non-zero the predicate returned False, which wait_for interpreted as success and returned immediately - the test could then stop servers[0]/servers[1] before the view updates generated by new_server from the migrated staging sstable were actually delivered, leading to a partially populated MV (e.g. 431/1000 rows) and a failing assertion. Fix the predicate to return None instead of False when the backlog is not yet drained, so wait_for will actually retry until the metric reaches 0 (or the deadline is hit). Fixes SCYLLADB-1182 Closes scylladb/scylladb#29587 |
||
|
|
67b3ad94a0 |
test.py: enhance error output in case no tests were executed
By default, pytest produces the error if provided file is not exists. But coupled with xdist it will produce no errors. This is due how the pytest works with xdist. test.py always uses the parameter -n, so if something will go wrong there will be no errors produced, only exit code 5 will be thrown. This PR will print warning in case pytest's exit code is 5. Closes scylladb/scylladb#29584 |
||
|
|
c97ce32f47 |
Update position in dma_read(iovec) in create_file_for_seekable_source
Fixes: SCYLLADB-1523 The returned file object does not increment file pos as is. One line fix. Added test to make sure this read path works as expected. Closes scylladb/scylladb#29456 |
||
|
|
3468e8de8b |
test/mv/test_mv_staging: wait for cql after restart
Wait for cql on all hosts after restarting a server in the test. The problem that was observed is that the test restarts servers[1] and doesn't wait for the cql to be ready on it. On test teardown it drops the keyspace, trying to execute it on the host that is not ready, and fails. Fixes SCYLLADB-1632 Closes scylladb/scylladb#29562 |
||
|
|
3df951bc9c |
Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski
Commit
|
||
|
|
eb3326b417 |
Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta
should be merged after #29235 Complete the typed skip markers migration started in the plugin PR. Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call across the test suite is replaced with a typed equivalent, making skip reasons machine-readable in JUnit XML and Allure reports. **62 files changed** across 8 commits, covering ~127 skip sites in total. Bare `pytest.skip` provides only a free-text reason string. CI dashboards (JUnit, Allure) cannot distinguish between a test skipped due to a known bug, a missing feature, a slow test, or an environment limitation. This makes it hard to track skip debt, prioritize fixes, or filter dashboards by skip category. The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`, `skip_env`) introduced by the `skip_reason_plugin` solve this by embedding a `skip_type` field into every skip report entry. | Type | Count | Files | Description | |------|-------|-------|-------------| | `skip_bug` | 24 | 16 | Skip reason references a known bug/issue | | `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla | | `skip_slow` | 4 | 3 | Test too slow for regular CI runs | | `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) | | Type | Count | Files | Description | |------|-------|-------|-------------| | `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime | | `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) | - **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()` - **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning - **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression: - AST scan for bare `@pytest.mark.skip` decorators - AST scan for bare `pytest.skip()` runtime calls - Real `pytest --collect-only` against all Python test directories Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`: ```python from test.pylib.skip_types import skip_env ``` Usage: ```python skip_env("Tablets not enabled") ``` 1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files 2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files 3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files 4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file 5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files 6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files 7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files 8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests - All 60 plugin + guard tests pass (`test/pylib_test/`) - No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase - `pytest --collect-only` succeeds across all test directories with the hardened plugin SCYLLADB-1349 Closes scylladb/scylladb#29305 * github.com:scylladb/scylladb: test/alternator: replace bare pytest.skip() with typed skip helpers test: migrate new bare skips introduced by upstream after rebase test/pylib: reject bare pytest.mark.skip and add codebase guards test: update comments referencing pytest.skip() to skip_env() test: migrate runtime pytest.skip() to typed skip_bug() test: migrate runtime pytest.skip() to typed skip_env() test: migrate bare @pytest.mark.skip to skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_slow test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs |
||
|
|
e84e7dfb7a |
build: drop utils/rolling_max_tracker.hh from precompiled header
Added by mistake. Precompiled headers should only include library headers that rarely change, since any dependency change causes a full rebuild. Closes scylladb/scylladb#29560 |
||
|
|
3aced88586 |
Merge 'audit: decrease allocations / instructions on will_log() fast path' from Marcin Maliszkiewicz
Audit::will_log() runs on every CQL/Alternator request. Since |
||
|
|
4043d95810 |
Merge 'storage_service: fix REST API races during shutdown and cross-shard forwarding' from Piotr Smaron
REST route removal unregisters handlers but does not wait for requests that already entered storage_service. A request can therefore suspend inside an async operation, restart proceeds to tear the service down, and the coroutine later resumes against destroyed members such as _topology_state_machine, _group0, or _sys_ks — a use-after-destruction bug that surfaces as UBSAN dynamic-type failures (e.g. the crash seen from topology_state_load()). Fix this by holding storage_service::_async_gate from the entry boundary of every externally-triggered async operation so that stop() drains them before teardown begins. The gate is acquired in run_with_api_lock, run_with_no_api_lock, and in individual REST handlers that bypass those wrappers (reload_raft_topology_state, mark_excluded, removenode, schema reload, topology-request waits/abort, cleanup, ring/schema queries, SSTable dictionary training/publish, and sampling). Additionally, fix get_ownership() and abort_topology_request() which forward work to shard 0 but were still referencing the caller-shard's `this` pointer instead of the destination-shard instance, causing silent cross-shard access to shard-local state. Add a cluster regression test that repeatedly exercises the multi-shard ownership REST path to cover the forwarding fix. Fixes: SCYLLADB-1415 Should be backported to all branches, the code has been introduced around 2024.1 release. Closes scylladb/scylladb#29373 * github.com:scylladb/scylladb: storage_service: fix shard-0 forwarding in REST helpers storage_service: gate REST-facing async operations during shutdown storage_service: prepare for async gate in REST handlers |
||
|
|
cc39b54173 |
alternator: use stream_arn instead of std::string in list_streams
Use `stream_arn` object for storage of last returned to the user stream instead of raw `std::string`. `stream_arn` is used for parsing ARN incoming from the user, for returning `std::string` was used because of buggy copy / move operations of `stream_arn`. Those were fixed, so we're fixing usage as well. Fixes: SCYLLADB-1241 Closes scylladb/scylladb#29578 |
||
|
|
183c6d120e |
test: exclude pylib_test from default test runs
Add pylib_test to norecursedirs in pytest.ini so it is not collected during ./test.py or pytest test/ runs, but can still be run directly via 'pytest test/pylib_test'. Also fix pytest log cleanup: worker log files (pytest_gw*) were not being deleted on success because cleanup was restricted to the main process only. Now each process (main and workers) cleans up its own log file on success. Closes scylladb/scylladb#29551 |
||
|
|
dffb266b79 |
storage_service: fix shard-0 forwarding in REST helpers
get_ownership() and abort_topology_request() forward work to shard 0 via container().invoke_on(0, ...) but the lambda captured 'this' and accessed members through it instead of through the shard-0 'ss' parameter. This means the lambda used the caller-shard's instance, defeating the purpose of the forwarding. Use the 'ss' parameter consistently so the operations run against the correct shard-0 state. |
||
|
|
6a91d046f3 |
storage_service: gate REST-facing async operations during shutdown
Hold _async_gate in all REST-facing async operations so that stop() drains in-flight requests before teardown, preventing use-after-free crashes when REST calls race with shutdown. A centralized gated() wrapper in set_storage_service (api/storage_service.cc) automatically holds the gate for every REST handler registered there, so new handlers get shutdown-safety by default. run_with_api_lock_internal and run_with_no_api_lock hold _async_gate on shard 0 as well, because REST requests arriving on any shard are forwarded there for execution. Methods that previously self-forwarded to shard 0 (mark_excluded, prepare_for_tablets_migration, set_node_intended_storage_mode, get_tablets_migration_status, finalize_tablets_migration) now assert this_shard_id() == 0. Their REST handlers call them via run_with_no_api_lock, which performs the shard-0 hop and gate hold centrally. Fixes: SCYLLADB-1415 |
||
|
|
74dd33811e |
storage_service: prepare for async gate in REST handlers
Add hold_async_gate() public accessor for use by the REST registration layer in a followup commit. Convert run_with_no_api_lock to a coroutine so a followup commit can hold the async gate across the entire forwarded operation. No functional changes. |
||
|
|
18ceeaf3ef |
Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231. Closes scylladb/scylladb#29310 * github.com:scylladb/scylladb: compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode test/repair: Add tombstone GC safety tests for incremental repair |
||
|
|
f5eb99f149 |
test: bump multishard_query_test querier_cache TTL to 60s to avoid flake
Three test cases in multishard_query_test.cc set the querier_cache entry TTL to 2s and then assert, between pages of a stateful paged query, that cached queriers are still present (population >= 1) and that time_based_evictions stays 0. The 2s TTL is not load-bearing for what these tests exercise — they are checking the paging-cache handoff, not TTL semantics. But on busy CI runners (SCYLLADB-1642 was observed on aarch64 release), scheduling jitter between saving a reader and sampling the population can exceed 2s. When that happens, the TTL fires, both saved queriers are time-evicted, population drops to 0, and the assertion `require_greater_equal(saved_readers, 1u)` fails. The trailing `require_equal(time_based_evictions, 0)` check never runs because the earlier assertion has already aborted the iteration — which is why the Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93". Reproduced deterministically in test_read_with_partition_row_limits by injecting a `seastar::sleep(2500ms)` between the save and the sample: the hook then reports population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0 and the assertion fires — matching the Jenkins symptoms exactly. Bump the TTL to 60s in all three affected tests: - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642) - test_read_all (same pattern, same invariants — suspect) - test_read_all_multi_range (same pattern, same invariants — suspect) Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction) and test_evict_a_shard_reader_on_each_page (tests manual eviction via evict_one(); its TTL is not load-bearing but the fix is deferred for a separate review) unchanged. Fixes: SCYLLADB-1642 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Closes scylladb/scylladb#29564 |
||
|
|
cddde464ca |
Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature |
||
|
|
b6cb025e9b |
test/audit: add reproducer for native-protocol batch not being audited
The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a QUERY message, which goes through the CQL parser and raw::batch_statement:: prepare() — a path that correctly sets audit_info. This missed the bug where native-protocol BATCH messages (opcode 0x0D), handled by process_batch_internal in transport/server.cc, construct a batch_statement without setting audit_info, causing audit to silently skip the batch. Add _test_batch_native_protocol which uses the driver's BatchStatement (both unprepared and prepared variants) to exercise this code path. Refs SCYLLADB-1652 |
||
|
|
f5bb9b6282 |
audit: set audit_info for native-protocol BATCH messages
Commit
|
||
|
|
5f93d57d6e |
test/audit: rename internal test methods to avoid CI misdetection
The CI heuristic picks up any function named test_* in changed files and tries to run it as a standalone pytest test. The AuditTester class methods (test_batch, test_dml, etc.) are not top-level pytest tests — they are internal helpers called from the actual test functions. Prefix them with underscore so CI does not mistake them for standalone tests. |
||
|
|
cf237e060a |
test: auth_cluster: use safe_driver_shutdown() for Cluster teardown
A handful of cassandra-driver Cluster.shutdown() call sites in the
auth_cluster tests were missed by the previous sweep that introduced
safe_driver_shutdown(), because the local variable holding the Cluster
is named "c" rather than "cluster".
Direct Cluster.shutdown() is racy: the driver's "Task Scheduler"
thread may raise RuntimeError ("cannot schedule new futures after
shutdown") during or after the call, occasionally failing tests.
safe_driver_shutdown() suppresses this expected RuntimeError and
joins the scheduler thread.
Replace the remaining c.shutdown() calls in:
- test/cluster/auth_cluster/test_startup_response.py
- test/cluster/auth_cluster/test_maintenance_socket.py
with safe_driver_shutdown(c) and add the corresponding import from
test.pylib.driver_utils.
No behavioral change to the tests; only the driver teardown is
hardened against a known driver-side race.
Fixes SCYLLADB-1662
Closes scylladb/scylladb#29576
|
||
|
|
6f7bf30a14 |
alternator: increase wait time to tablet sync
When forcing tablet count change via cql command, the underlying tablet machinery takes some time to adjust. Original code waited at most 0.1s for tablet data to be synchronized. This seems to be not enough on debug builds, so we add exponential backoff and increase maximum waiting time. Now the code will wait 0.1s first time and continue waiting with each time doubling the time, up to maximum of 6 times - or total time ~6s. Fixes: SCYLLADB-1655 Closes scylladb/scylladb#29573 |
||
|
|
74b523ea20 |
treewide: fix spelling errors.
Fix various spelling errors. Closes scylladb/scylladb#29574 |
||
|
|
cb8253067d |
Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev
When DROP TABLE races with an in-flight DML on a strongly-consistent table, the node aborts in `groups_manager::acquire_server()` because the raft group has already been erased from `_raft_groups`. A concurrent `DROP TABLE` may have already removed the table from database registries and erased the raft group via `schedule_raft_group_deletion`. The `schema.table()` in `create_operation_ctx()` might not fail though because someone might be holding `lw_shared_ptr<table>`, so that the table is dropped but the table object is still alive. Fix by accepting table_id in acquire_server and checking that the table still exists in the database via `find_column_family` before looking up the raft group. If the table has been dropped, find_column_family throws no_such_column_family instead of the node aborting via on_internal_error. When the table does exist, acquire_server proceeds to acquire state.gate; schedule_raft_group_deletion co_awaits gate::close, so it will wait for the DML operation to complete before erasing the group. backport: not needed (not released feature) Fixes SCYLLADB-1450 Closes scylladb/scylladb#29430 * github.com:scylladb/scylladb: strong_consistency: fix crash when DROP TABLE races with in-flight DML test: add regression test for DROP TABLE racing with in-flight DML |
||
|
|
bcda39f716 |
test: audit: use set diff to identify new audit rows
assert_entries_were_added asserted that new audit rows always appear at the tail of each per-node, event_time-sorted sequence. That invariant is not a property of the audit feature: audit writes are asynchronous with respect to query completion, and on a multi-node cluster QUORUM reads of audit.audit_log can reveal a row with an older event_time after a row with a newer one has already been observed. Replace the positional tail slice with a per-node set difference between the rows observed before and after the audited operation. The wait_for retry loop, noise filtering, and final by-value comparison against expected_entries are unchanged, so the test still verifies the real contract, that the expected audit entries appear, without relying on a visibility-ordering invariant that the audit log does not guarantee. Fixes SCYLLADB-1589 Closes scylladb/scylladb#29567 |
||
|
|
6165124fcc |
Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE clause, deciding on the query plan (which index to use), and extracting the partition and clustering keys to use for the index. Currently, it suffers from repetition in making its decisions: there are 15 calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis of binary operators is done once, then reused. The key data structure introduced is the predicate. While an expression takes inputs from the row evaluated, constants, and bind variables, and produces a boolean result, predicates ask which values for a column (or a number of columns) are needed to satisfy (part of) the WHERE clause. The WHERE clause is then expressed as a conjunction of such predicates. The analyzer uses the predicates to select the index, then uses the predicates to compute the partition and clustering keys. The refactoring is composed of these parts (but patches from different parts are interspersed): 1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change 2. move computation from query time to prepare time 3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API Major refactoring, and no bugs fixed, so definitely not backporting. Closes scylladb/scylladb#29114 * github.com:scylladb/scylladb: cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn cql3: statement_restrictions: remove extract_single_column_restrictions_for_column cql3: statement_restrictions: use predicate vectors in prepare_indexed_local cql3: statement_restrictions: use predicate vector size for clustering prefix length cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions cql3: statement_restrictions: add predicate-based index support checking cql3: statement_restrictions: use pre-built single-column maps for index support checks cql3: statement_restrictions: build clustering-prefix restrictions incrementally cql3: statement_restrictions: build partition-range restrictions incrementally cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally cql3: statement_restrictions: build partition-key single-column restrictions map incrementally cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column cql3: statement_restrictions: track has-token state incrementally cql3: statement_restrictions: track partition-key-empty state incrementally cql3: statement_restrictions: track first multi-column predicate incrementally cql3: statement_restrictions: track last clustering column incrementally cql3: statement_restrictions: track clustering-has-slice incrementally cql3: statement_restrictions: track has-multi-column-clustering incrementally cql3: statement_restrictions: track clustering-empty state incrementally cql3: statement_restrictions: replace restr bridge variable with pred.filter cql3: statement_restrictions: convert single-column branch to use predicate properties cql3: statement_restrictions: convert multi-column branch to use predicate properties cql3: statement_restrictions: convert constructor loop to iterate over predicates cql3: statement_restrictions: annotate predicates with operator properties cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column cql3: statement_restrictions: complete preparation early cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column cql3: statement_restrictions: refine possible_lhs_values() function_call processing cql3: statement_restrictions: return nullptr for function solver if not token cql3: statement_restrictions: refine possible_lhs_values() subscript solving cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error cql3: statement_restrictions: convert possible_lhs_values into a solver cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion cql3: statement_restrictions: refactor IS NOT NULL processing cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions cql3: statement_restrictions: fold add_is_not_restriction() into its caller cql3: statement_restrictions: fold add_restriction() into its caller cql3: statement_restrictions: remove possible_partition_token_values() cql3: statement_restrictions: remove possible_column_values cql3: statement_restrictions: pass schema to possible_column_values() cql3: statement_restrictions: remove fallback path in solve() cql3: statement_restrictions: reorder possible_lhs_column parameters cql3: statement_restrictions: prepare solver for multi-column restrictions cql3: statement_restrictions: add solver for token restriction on index cql3: statement_restrictions: pre-analyze column in value_for() cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase cql3: statement_restrictions: adjust signature of range_from_raw_bounds cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder cql3: statement_restrictions: multi-key clustering restrictions one layer deeper cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds() cql3: statement_restrictions: pre-analyze single-column clustering key restrictions cql3: statement_restrictions: wrap value_for_index_partition_key() cql3: statement_restrictions: hide value_for() cql3: statement_restrictions: push down clustering prefix wrapper one level cql3: statement_restrictions: wrap functions that return clustering ranges cql3: statement_restrictions: do not pass view schema back and forth cql3: statement_restrictions: pre-analyze token range restrictions cql3: statement_restrictions: pre-analyze partition key columns cql3: statement_restrictions: do not collect subscripted partition key columns cql3: statement_restrictions: split _partition_range_restrictions into three cases cql3: statement_restrictions: move value_list, value_set to header file cql3: statement_restrictions: wrap get_partition_key_ranges cql3: statement_restrictions: prepare statement_restrictions for capturing `this` test: statement_restrictions: add index_selection regression test |
||
|
|
d222e6e2a4 |
doc: document support for OCI Object Storage
This commit extends the object storage configuration section with support for OCi object storage. Fixes SCYLLADB-502 Closes scylladb/scylladb#29503 |
||
|
|
cfebe17592 |
sstables: fix segfault in parse_assert() when message is nullptr
parse_assert() accepts an optional `message` parameter that defaults to nullptr. When the assertion fails and message is nullptr, it is implicitly converted to sstring via the sstring(const char*) constructor, which calls strlen(nullptr) -- undefined behavior that manifests as a segfault in __strlen_evex. This turns what should be a graceful malformed_sstable_exception into a fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered parse_assert() during streaming (in continuous_data_consumer:: fast_forward_to()), causing a crash loop on the affected node. Fix by guarding the nullptr case with a ternary, passing an empty sstring() when message is null. on_parse_error() already handles the empty-message case by substituting "parse_assert() failed". Fixes: SCYLLADB-1329 Closes scylladb/scylladb#29285 |
||
|
|
935e6a495d |
Merge 'transport: add per-service-level cql_requests_serving metric' from Piotr Smaron
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests. This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label). Fixes: SCYLLADB-1340 New feature, no backport. Closes scylladb/scylladb#29493 * github.com:scylladb/scylladb: transport: add per-service-level cql_requests_serving metric transport: move requests_serving decrement to after response is sent |
||
|
|
cd79b99112 |
test: fix flaky test_alter_tablets_rf_dc_drop by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643. Closes scylladb/scylladb#29563 |
||
|
|
474e962e01 |
compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. Implementation: - Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view. - Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)). - Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all compaction groups in the storage group. - Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from all compaction groups across all storage groups (needed for multi-tablet tables). - Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the repaired-only optimization is active; used by get_max_purgeable_timestamp() in compaction.cc to bypass the memtable shadow check. - is_tombstone_gc_repaired_only() private helper gates both methods: requires is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion. - Add error injection "view_update_generator_pause_before_processing" in process_staging_sstables() to support testing the staging-delay scenario. - New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view landed in the repaired set via the hints-before-snapshot path. - New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before writing T_base so D_base is staged on servers[0] via row-sync; blocks the view-update-generator with an error injection; writes T_base + T_mv; runs MV repair (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view in repaired set); asserts no resurrection; releases injection; waits for staging to complete; asserts no resurrection after a second flush+compaction. Demonstrates that the read-before-write in stream_view_replica_updates() makes the optimization safe even when staging fires after T_mv has been GC'd. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
a50aa7e689 |
test/cluster: wait for ready CQL in cross-rack merge test
test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately after adding the new cross-rack nodes. In the failing runs the driver is still converging on the updated topology at that point, so the control connection sees incomplete peer metadata while schema changes are in flight. That leaves a race where CREATE TABLE is sent during topology churn and the test can surface a misleading AlreadyExists error even though the table creation has already been committed. Use get_ready_cql(servers) here so the test waits for inter-node visibility and CQL readiness before creating the keyspace and table. Fixes: SCYLLADB-1635 Closes scylladb/scylladb#29561 |
||
|
|
d18eb9479f |
cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.
This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.
The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.
All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`
Fixes https://github.com/scylladb/scylladb/issues/25340
Closes scylladb/scylladb#25342
|
||
|
|
69c58c6589 |
Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage. The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it. This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901 The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions. Closes scylladb/scylladb#28873 * github.com:scylladb/scylladb: streaming: reject mutation fragments on critical disk utilization test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection sstables: clean up TemporaryHashes file in wipe() sstables: add error injection point in write_components test/cluster/storage: extract validate_data_existence to module scope test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity utils/disk_space_monitor: add error injection to suppress threshold checks |
||
|
|
16ed338a89 |
Fix CODEOWNERS to cover nested docs subfolders
The `docs/*` pattern only matches files directly inside `docs/`, not files in nested subfolders like `docs/folder_b/test.md` or `docs/alternator/setup.md`. Those files currently have no code owner assigned. Replace with `/docs/` and `/docs/alternator/` which match the directories and all their subdirectories recursively, per GitHub's CODEOWNERS syntax. Ref: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners Closes scylladb/scylladb#29521 |
||
|
|
5687a4840d |
conf: pair sstable_format=ms with column_index_size_in_kb=1
One of the advantages of Trie indexes (with sstable_format=ms) is that
the index is more compact, and more suitable for paging from disk
(fewer pages required per search). We can exploit it by setting
column_index_size_in_kb to 1 rather than 64, increasing the index file
size (and requiring more index pages to be loaded and parsed) in return
for smaller data file reads.
To test this, I created a 1M row partition with 300-byte rows, compacted
it into a single sstable, and tested reads to a single row.
With column_index_size_in_kb=64:
Rows.db file size 60k
3 pages read from Rows.db (4k each)
2x 32k read from Data.db
With column_index_size_in_kb=1:
Rows.db file size 2MB (33X)
5 pages read from Rows.db (4k each, 1.7X)
1x 4107 bytes read from Data.db (0.5X IOPS, 0.06X bandwidth)
Given that Rows.db will be typically cached, or at least all but one of the
levels (its size is 157X smaller than Data.db), we win on both IOPS
and bandwidth.
I would have expected the the Data.db read to be closer to 1k, but this
is already an improvement.
Given that, set column_index_size_in_kb=1, but only for new clusters
where we also select sstable_format=ms.
Raw data (w1, w64 are working directories with different
column_index_size_in_kb):
```console
$ ls -l w*/data/bench/wide_partition-*/*{Rows,Data}.db
-rw-r--r-- 1 avi avi 314964958 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db
-rw-r--r-- 1 avi avi 2001227 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db
-rw-r--r-- 1 avi avi 314963261 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db
-rw-r--r-- 1 avi avi 59989 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db
```
column_index_size_in_kb=64 trace:
```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;
pk | ck | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 | 654321 | 9OXdwmDHRapL2w5YruWLTOtiC3PKbyctSDdQ8YpuPKtWkSYBF10G7bKo2rdnxSAd52HLI21568YM7OwK05B6qAF7X2b6910qsJEA106QBEcFWQVybMCkxkpO4VDRcAVNLRgjB3vygcDBP17GBTb2s7l47UOloy3KtZ7J5YQgKcf7zlFSKGHa49vnRrzoXZCdYexOpix6jcSV2SiwRNqgv6XmYhx43ZwGa4zUtOe0eIKJj7KTxu5bzyWUWGW7US4NLFZRD8Vdb6EasIFkOfVKdiFp2LZHMXGRvtvdF93UTFUb
(1 rows)
Tracing session: 19219900-3bf3-11f1-bc43-c0a4e62b53d1
activity | timestamp | source | source_elapsed | client
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
Execute CQL3 query | 2026-04-19 16:24:30.992000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing a statement [shard 0/sl:default] | 2026-04-19 16:24:30.992643+00:00 | 127.0.0.1 | 1 | 127.0.0.1
Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:24:30.992738+00:00 | 127.0.0.1 | 96 | 127.0.0.1
Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:24:30.992765+00:00 | 127.0.0.1 | 123 | 127.0.0.1
Creating read executor for token -3485513579396041028 with all: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] targets: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:24:30.992781+00:00 | 127.0.0.1 | 139 | 127.0.0.1
Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:24:30.992782+00:00 | 127.0.0.1 | 140 | 127.0.0.1
read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:24:30.992795+00:00 | 127.0.0.1 | 153 | 127.0.0.1
Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:24:30.992801+00:00 | 127.0.0.1 | 160 | 127.0.0.1
[reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:24:30.992805+00:00 | 127.0.0.1 | 163 | 127.0.0.1
[reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:24:30.992814+00:00 | 127.0.0.1 | 172 | 127.0.0.1
Reading key {-3485513579396041028, pk{000400000000}} from sstable w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db [shard 0/sl:default] | 2026-04-19 16:24:30.992837+00:00 | 127.0.0.1 | 195 | 127.0.0.1
page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.992851+00:00 | 127.0.0.1 | 209 | 127.0.0.1
page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995294+00:00 | 127.0.0.1 | 2653 | 127.0.0.1
page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995375+00:00 | 127.0.0.1 | 2733 | 127.0.0.1
page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995376+00:00 | 127.0.0.1 | 2734 | 127.0.0.1
page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 | 2821 | 127.0.0.1
page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 | 2821 | 127.0.0.1
w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206057984 [shard 0/sl:default] | 2026-04-19 16:24:30.995471+00:00 | 127.0.0.1 | 2829 | 127.0.0.1
w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:24:30.995475+00:00 | 127.0.0.1 | 2833 | 127.0.0.1
w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206057984, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995586+00:00 | 127.0.0.1 | 2945 | 127.0.0.1
Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:24:30.995637+00:00 | 127.0.0.1 | 2995 | 127.0.0.1
w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206090752, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995645+00:00 | 127.0.0.1 | 3003 | 127.0.0.1
Querying is done [shard 0/sl:default] | 2026-04-19 16:24:30.995653+00:00 | 127.0.0.1 | 3012 | 127.0.0.1
Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:24:30.995670+00:00 | 127.0.0.1 | 3028 | 127.0.0.1
Request complete | 2026-04-19 16:24:30.995039 | 127.0.0.1 | 3039 | 127.0.0.1
w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:22:43.107215+00:00 | 127.0.0.1 | 8685 | 127.0.0.1
```
column_index_size_in_kb=1 trace:
```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;
pk | ck | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 | 654321 | FIA7X52ZqYwvDxEGlmWJUSy1I94WTuWZTdLwXr9HBQ90RJLqYKr5nInTADSI6hzofwawaXphAQK07YMoyzFfRaGeKPQPKUb35XpLEGvLJ4xu9r4es8wUEHPXaFBGdMcWUkyDJSTYCFzZAPCzUHEuPJHMXVrI6UExWrIR0Xujg4GZa9UciU9rbEvrSBwSzoPEfbXJ6qZSGiTD8gcXz5kdAblLxsAeWug8tZqslsTu04HMLKfZ8WopQvHbpR6YlGSnM99CiBgz30LMmllULV4VA4u9kMpzsRV2IE2tKmJOddEl
(1 rows)
Tracing session: 3953a1f0-3bf3-11f1-b976-4a3dc2a7a57f
activity | timestamp | source | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
Execute CQL3 query | 2026-04-19 16:25:25.007000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing a statement [shard 0/sl:default] | 2026-04-19 16:25:25.007423+00:00 | 127.0.0.1 | 1 | 127.0.0.1
Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:25:25.007511+00:00 | 127.0.0.1 | 89 | 127.0.0.1
Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:25:25.007536+00:00 | 127.0.0.1 | 114 | 127.0.0.1
Creating read executor for token -3485513579396041028 with all: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] targets: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:25:25.007551+00:00 | 127.0.0.1 | 129 | 127.0.0.1
Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:25:25.007553+00:00 | 127.0.0.1 | 131 | 127.0.0.1
read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:25:25.007556+00:00 | 127.0.0.1 | 134 | 127.0.0.1
Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:25:25.007562+00:00 | 127.0.0.1 | 139 | 127.0.0.1
[reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:25:25.007564+00:00 | 127.0.0.1 | 142 | 127.0.0.1
[reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:25:25.007573+00:00 | 127.0.0.1 | 151 | 127.0.0.1
Reading key {-3485513579396041028, pk{000400000000}} from sstable w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db [shard 0/sl:default] | 2026-04-19 16:25:25.007594+00:00 | 127.0.0.1 | 172 | 127.0.0.1
page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.007607+00:00 | 127.0.0.1 | 184 | 127.0.0.1
page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016029+00:00 | 127.0.0.1 | 8607 | 127.0.0.1
page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016109+00:00 | 127.0.0.1 | 8687 | 127.0.0.1
page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016111+00:00 | 127.0.0.1 | 8688 | 127.0.0.1
page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016176+00:00 | 127.0.0.1 | 8754 | 127.0.0.1
page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016260+00:00 | 127.0.0.1 | 8838 | 127.0.0.1
page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 | 8839 | 127.0.0.1
page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 | 8839 | 127.0.0.1
w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: scheduling bulk DMA read of size 4107 at offset 206086656 [shard 0/sl:default] | 2026-04-19 16:25:25.016268+00:00 | 127.0.0.1 | 8846 | 127.0.0.1
w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: finished bulk DMA read of size 4107 at offset 206086656, successfully read 4608 bytes [shard 0/sl:default] | 2026-04-19 16:25:25.016340+00:00 | 127.0.0.1 | 8918 | 127.0.0.1
Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:25:25.016367+00:00 | 127.0.0.1 | 8945 | 127.0.0.1
Querying is done [shard 0/sl:default] | 2026-04-19 16:25:25.016385+00:00 | 127.0.0.1 | 8963 | 127.0.0.1
Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:25:25.016401+00:00 | 127.0.0.1 | 8979 | 127.0.0.1
Request complete | 2026-04-19 16:25:25.015989 | 127.0.0.1 | 8989 | 127.0.0.1
```
Closes scylladb/scylladb#29552
|
||
|
|
e414b2b0b9 |
test/cluster: scale failure_detector_timeout_in_ms by build mode
Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.
The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:
20:10:57,049 [shard 0] raft_group0 - server 614 entered
'join group0' transition state for 53b01f0b
The joining node begins receiving the raft snapshot 100ms later:
20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539
It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:
20:10:57,511 [shard 0] migration_manager - Creating keyspace
system_auth_v2
...
20:10:57,788 [shard 0] migration_manager - Creating
system_auth_v2.role_members
Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:
20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
when pinging 53b01f0b: seastar::rpc::timeout_error
(rpc call timed out)
25ms later, the coordinator marks the joining node DOWN and removes it:
20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
Mark node 53b01f0b as DOWN
20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
53b01f0b
The joining node was still retrying the snapshot transfer at that point:
20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539
It then receives the ban notification and aborts:
20:11:01,844 [shard 0] raft_group0 - received notification of being
banned from the cluster
Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).
Test measurements (before -> after fix):
debug mode:
test_replace_with_same_ip_twice 24.02s -> 25.02s
test_banned_node_notification 217.22s -> 221.72s
test_kill_coordinator_during_op 116.11s -> 127.13s
test_node_failure_during_tablet_migration
[streaming-source] 183.25s -> 192.69s
test_replace (4 tests) skipped in debug (skip_in_debug)
test_raft_replace_ignore_nodes skipped in debug (run_in_dev only)
dev mode:
test_replace_different_ip 10.51s -> 11.50s
test_replace_different_ip_using_host_id 10.01s -> 12.01s
test_replace_reuse_ip 10.51s -> 12.03s
test_replace_reuse_ip_using_host_id 13.01s -> 12.01s
test_raft_replace_ignore_nodes 19.52s -> 19.52s
|
||
|
|
99ac36b353 |
test/cluster: add failure_detector_timeout fixture
Add a shared pytest fixture that scales the failure detector timeout by build mode factor (e.g. 3x for debug/sanitize, 2x for dev). |