scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-25 01:02:20 +00:00

Author	SHA1	Message	Date
Ferenc Szili	21f0ef209b	test: add test for intranode balance threshold in size-based mode Verify that the load balancer does not issue intranode migrations when the load difference between shards is within the size_based_balance_threshold, and that it does issue migrations when the difference exceeds the threshold. (cherry picked from commit `6856f51097`)	2026-05-13 16:58:04 +00:00
Nadav Har'El	5de73f5480	test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old nodes concurrently, then adds a new node before issuing CREATE ROLE. The concurrent bootstraps trigger the well-known Python driver bug (scylladb/python-driver#317): two on_add notifications race in update_created_pools, causing a second pool to be created for a host whose pool was already established. If CREATE ROLE is in-flight on the old pool when it is closed, the driver retries on the new pool, executing the statement twice. The second execution fails with "Role ... already exists", making the test flaky. Fix by using CREATE ROLE IF NOT EXISTS. This is safe because unique_name() generates a timestamp+random suffix that is guaranteed to be unique; the role can "already exist" only due to the driver double-execution bug, never due to a real conflict. This is the same workaround that has been applied many times elsewhere in our test suite for exactly the same root cause: - CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368, later generalised in scylladb#22399 via new_test_keyspace helpers) - DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487) Fixes: SCYLLADB-1811 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29732 (cherry picked from commit `983eb5ab43`) Closes scylladb/scylladb#29743	2026-05-13 09:30:52 +03:00
Nadav Har'El	594e8f35b4	test: fix replica_read_timeout_no_exception flakiness on slow systems The test uses a 10ms read timeout to exercise code paths that handle timed-out reads without throwing C++ exceptions. As part of setup, it inserts rows and flushes them to two SSTables, then runs a warm-up SELECT to populate internal caches (e.g. the auth cache) before the real test begins. The reason for this warm-up read was the possibility that the first read does additional operations (such as reading and caching authentication) that might throw exceptions internally. I couldn't verify that such exceptions actually happen in today's code, but they might (re)appear in the future, so we should keep the warm-up SELECT. On slow CI machines (aarch64, debug build), that warm-up SELECT can take longer than 10ms to read from the two SSTables. When it does, the read times out: the coordinator receives 0 responses from the local replica within the deadline and propagates a read_timeout_exception. Since the exception is not caught, it escapes the test lambda, is logged as "cql env callback failed", and causes Boost.Test to report a C++ failure at the do_with_cql_env_thread call site. This matches the CI failure seen in SCYLLADB-1774: ERROR ... replica_read_timeout_no_exception: cql env callback failed, error: exceptions::read_timeout_exception (Operation timed out for replica_read_timeout_no_exception.tbl - received only 0 responses from 1 CL=ONE.) The CI log also shows that only 12 reads were admitted (the warm-up read plus the 11 reads from the two prepare() calls and CREATE/INSERT statements made earlier), and the current permit was stuck in need_cpu state -- the reactor hadn't had a chance to schedule the read before the 10ms window elapsed. The fix catches read_timeout_exception from the warm-up SELECT and retries until the read succeeds. The warm-up is required for correctness: some lazy-init code paths (e.g. auth cache population) use C++ exceptions for control flow internally. Those exceptions must be absorbed before the cxx_exceptions baseline is sampled inside execute_test(); otherwise they would appear in the delta and cause a false test failure. Simply ignoring a timed-out warm-up is not safe, because the lazy-init exceptions would then fire during the 1000 test reads, inflating cxx_exceptions_after relative to cxx_exceptions_before. No other calls in setup are susceptible to the 10ms read timeout: - CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write timeout (10s) and are not reads. - e.prepare() goes through the query processor without reading table data, so it is not subject to the read timeout. - The semaphore manipulation in Test 2 is internal and has no timeout. - All 1000 reads in execute_test() are expected to fail, so a timeout there is the happy path, not a failure. The 10ms timeout itself is fine for the test's purpose: it is deliberately aggressive so that reads reliably time out on the hot path being tested. The problem was only that the pre-test warm-up was not guarded against the same timeout. Fixes: SCYLLADB-1830 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29731 (cherry picked from commit `1f15e05946`) Closes scylladb/scylladb#29760	2026-05-13 09:28:04 +03:00
Ferenc Szili	b1fad45a6d	test: fix flaky test_tablets_split_merge_with_many_tables In debug mode, this test can timeout during tablets merge. While the test already decreases the number of tables in debug mode (20 tables, instead of 200 for dev mode), this is not enough, and the test can still timeout during merge. This change reduces the number of tables from 20 to 5 in debug mode. It also drops the log level for lead_balancer to debug. This should make any potential future problems with this test easier to investigate. Fixes: SCYLLADB-1863 Closes scylladb/scylladb#29682 (cherry picked from commit `ec4b483e88`) Closes scylladb/scylladb#29786	2026-05-13 09:18:30 +03:00
Botond Dénes	a0a61fe81f	Merge '[Backport 2026.2] load_balancer: fix tablet allocator dropped table' from Scylladb[bot] - Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error` - The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort. `get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables: - `make_sizing_plan`: skips to next table - `make_resize_plan`: skips to next table (merge suppression is moot) - `check_constraints`: returns `skip_info{}` with empty viable targets - `get_rs`: returns `nullptr`, checked by `check_constraints` The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it. Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot. Fixes: SCYLLADB-1905 This fix needs to be backported to versions: 2025.4, 2026.1 - (cherry picked from commit `4987204f71`) - (cherry picked from commit `6b3e18c4a9`) Parent PR: #29585 Closes scylladb/scylladb#29818 * github.com:scylladb/scylladb: test: verify load balancer handles dropped tables gracefully tablet_allocator: handle dropped tables gracefully in get_schema_and_rs	2026-05-13 09:16:30 +03:00
Piotr Szymaniak	4d00019eff	test/alternator: stop avoiding tablets in Streams tests Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test. Refs SCYLLADB-463 Closes scylladb/scylladb#29697 (cherry picked from commit `459c1dc32f`) Closes scylladb/scylladb#29819	2026-05-13 09:15:20 +03:00
Botond Dénes	473320df18	Merge '[Backport 2026.2] load_balance: fix drain with forced capacity-based balancing' from Scylladb[bot] When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes. The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan. Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced. Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing. This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2 Fixes: SCYLLADB-1953 - (cherry picked from commit `906d2b817e`) - (cherry picked from commit `f7bc8f5fa7`) Parent PR: #29791 Closes scylladb/scylladb#29866 * github.com:scylladb/scylladb: test: boost: add drain test for forced capacity-based balancing service: allow draining with forced capacity-based balancing	2026-05-13 09:05:42 +03:00
Botond Dénes	ceae68b487	schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE output incorrectly showed NullCompactionStrategy instead of the actual strategy. This happened because schema_properties() called compaction_strategy(), which returns compaction_strategy_type::null when compaction is disabled. Fix it by using configured_compaction_strategy(), which always returns the real strategy type - consistent with how schema_tables.cc serializes it to disk. Fixes SCYLLADB-1353 Closes scylladb/scylladb#29804 (cherry picked from commit `8d6f031a4a`) Closes scylladb/scylladb#29867	2026-05-13 08:59:59 +03:00
Andrzej Jackowski	3df25f1952	test: wait for TTL scheduling sanity metric The test samples sl:default runtime before and after setup writes to prove that it measures the scheduling group used by regular CQL writes. The metric is exported in milliseconds, so a single 200-row batch may not be visible immediately, or may be too small in some environments. Keep the original 200-row table size, but wait up to 30 seconds for the metric to advance. If it does not, retry the same writes before TTL is enabled. The retries update the same keys, so the expiration part of the test still waits for exactly the original number of rows. In a local 100-run with N=200 rows, the observed delta of `ms_statement_before - ms_statement_before_write` was: min=4.0, max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that in a rare corner case the delta drops even to 0. Fixes SCYLLADB-1869 Closes scylladb/scylladb#29797 (cherry picked from commit `89261bf759`) Closes scylladb/scylladb#29868	2026-05-13 08:59:23 +03:00
Wojciech Mitros	1ed765a381	test: run test_mv_admission_control_exception on one shard In the test we perform 2 consecutive writes where the first write is supposed to increase the view update backlog above the mv admission control threshold and the second one is expected to be rejected because of that. On each node/shard we have 2 types of view update backlogs: 1. for deciding whether we should admit writes 2. for propagating the backlog information to other nodes/shards. For the second write to be rejected, it must be performed on a node and shard which updated its backlog of type 1. The view update backlog of type 2. is immediately increased on the base table replica. For this backlog to be registered as a backlog of type 1., it needs to be either carried by gossip (happening once every second) or by attaching it to a replica write response. We don't want to increase the runtime of tests unnecessarily, so we don't wait and we rely on the second mechanism. The response to the first base table write (the one causing increase in the backlog) carries the increased backlog to the coordinator of this write. So for the second write to observe the increased backlog, it needs to be coordinated on the same node+shard as the first write. We make sure that both writes are coordinated on the same node+shard by using prepared statements combined with setting the host in `run_async`. Both writes target the same partition and with prepared statements we route them directly to the correct shard. That was the idea, at least. In practice, for the driver to learn the correct shard, it first needs to learn the token->shard mapping from the server. For vnodes it can expect a shard by calculating the token of the affected partition, but for tablets, it had no opportunity to learn the tablet->shard mapping so the first write may route to any shard. Additionally, we aren't guaranteed that the driver established connections to all shards on all nodes at the point of any write. So if a connection finishes establishing between the two writes, this may also cause us to coordinate these 2 writes on different shards, leading to a missed view backlog growth and not-rejected second write. We fix this in this patch by running the test using one shard on each node. This way, as long as we perform both writes on the same node, they'll also be coordinated on the same shard. This also makes the prepared statement and BoundStatement unnecessary — we can use SimpleStatement with FallthroughRetryPolicy directly. Fixes: SCYLLADB-1957 Closes scylladb/scylladb#29862 (cherry picked from commit `f3cf20803b`) Closes scylladb/scylladb#29873	2026-05-13 08:56:27 +03:00
Ferenc Szili	12f4280d1e	test: boost: add drain test for forced capacity-based balancing Add a Boost unit test that forces capacity-based balancing through configuration and verifies that a drained and excluded node will be drained of its tablets when tablet size stats are missing. The test covers the regression where the allocator rejected the plan due to incomplete tablet stats, even though forced capacity-based balancing does not depend on tablet sizes. (cherry picked from commit `f7bc8f5fa7`)	2026-05-12 12:59:33 +00:00
Asias He	714003ef2e	repair: Reject repair requests where start and end tokens are equal When a user calls the repair API with identical startToken and endToken values, the code creates a wrapping interval (T, T]. This causes unwrap() to split it into (-inf, T] and (T, +inf), covering the entire token ring and triggering a full repair. Reject such requests early with an error message matching Cassandra's behavior: "Start and end tokens must be different." Fixes: CUSTOMER-368 Closes scylladb/scylladb#29821 (cherry picked from commit `0204372156`) Closes scylladb/scylladb#29836	2026-05-12 11:58:04 +03:00
Calle Wilund	be2f0a8601	storage_service: Disable snapshots after raft decommission Fixes: SCYLLADB-1936 In case we abort a decommission operation, the snapshot/backup mechanism need to remain open. This change moves it to after raft_decommission. In the case of a cluster snapshot, our nodes ownership or not of tables will be serialized by raft anyway, so should remain consistent. In that case we at worst coordinate from a node in "leave" status In the case of a local snapshot, ownership matters less, only sstables on disk, which should not change. In the case of backup, this operates on a snapshot, state of which is not affected. Adds an injection point for testing. v2: - Added injection point to ensure test can abort decommission Closes scylladb/scylladb#29667 (cherry picked from commit `2cc1a2c406`) Closes scylladb/scylladb#29848	2026-05-12 11:42:14 +03:00
Ferenc Szili	be56bf031f	test: verify load balancer handles dropped tables gracefully Add test_load_balancing_with_dropped_table that simulates the race between DROP TABLE and the load balancer by capturing a token metadata snapshot before dropping the table, then passing the stale snapshot to balance_tablets(). Verifies it completes without aborting and produces no migrations for the dropped table. (cherry picked from commit `6b3e18c4a9`)	2026-05-10 22:37:42 +00:00
Patryk Jędrzejczak	4f87c9c510	Merge 'topology_coordinator: join tablet load stats refresh in stop()' from Andrzej Jackowski Commit `2b7aa32` (topology_coordinator: Refresh load stats after table is created or altered) registered topology_coordinator as a schema change listener and added on_create_column_family which fire-and-forgets _tablet_load_stats_refresh.trigger(). The triggered task runs on the gossip scheduling group via with_scheduling_group and accesses the topology_coordinator via 'this'. stop() unregisters the listener but does not wait for any in-flight refresh task. If a notification fires between _tablet_load_stats_refresh.join() in run() and unregister_listener in stop(), the scheduled task can outlive the topology_coordinator and access freed memory after run_topology_coordinator's coroutine frame is destroyed. Wait for the refresh to complete in stop() after unregistering the listener, ensuring no task can fire after destruction. Fixes SCYLLADB-1728 Backport to 2026.1 and 2026.2, because the issue was introduced in `2b7aa32` Closes scylladb/scylladb#29653 * https://github.com/scylladb/scylladb: test: tablet_stats: reproduce shutdown refresh race topology_coordinator: join tablet load stats refresh in stop() (cherry picked from commit `d9dd3bfe53`) Closes scylladb/scylladb#29686	2026-05-10 13:56:42 +03:00
Nadav Har'El	f9aae8c2f1	Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened). After restarting servers[1], the topology coordinator can initiate a residual re-repair when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the next retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys. 1. Propagating `current_key` through the exception — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt. 2. DROP TABLE + CREATE TABLE between retries — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues. Instead of trying to clean up contaminated state, each retry creates a completely fresh keyspace (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures. The detection is now comprehensive: - Broadened coordinator check: any coordinator change (`new_coord != coord`), not just migration to servers[1] - Re-repair detection at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log 1. `test: extract _setup_table_for_race_window helper` — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change. 2. `test: fix race window test flakiness from residual re-repair` — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception. Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode. Fixes: SCYLLADB-1743 So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen. Closes scylladb/scylladb#29721 * github.com:scylladb/scylladb: test: fix race window test flakiness from residual re-repair test: extract _setup_table_for_race_window helper for race window test (cherry picked from commit `d33bb6ea00`) Closes scylladb/scylladb#29761	2026-05-08 12:24:23 +02:00
Piotr Dulikowski	104e9b3c32	Merge 'table_helper: fix use-after-free on prepared-statement invalidation' from Marcin Maliszkiewicz insert() held no local strong ref to the prepared modification_statement across the suspension in execute(). On a single shard: 1. Fiber A suspends inside _insert_stmt->execute(). 2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes the prepared_statements_cache entry, releasing its strong ref. 3. Fiber B re-enters cache_table_info(), sees _prepared_stmt (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr, releasing the last strong ref. The modification_statement is freed. 4. Fiber A resumes inside execute() and touches freed this. Pin strong ref to _insert_stmt locally before the suspension. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667 Backport: all supported branches, it's memory corruption bug, long present Closes scylladb/scylladb#29588 github.com:scylladb/scylladb: test/boost: add dummy case to table_helper_test for non-injection modes test/boost: add regression test for table_helper insert() UAF utils/error_injection: add waiters() API table_helper: fix use-after-free on prepared-statement invalidation (cherry picked from commit `efcc0b6376`) Closes scylladb/scylladb#29747	2026-05-08 10:47:42 +02:00
Wojciech Mitros	4fc4f4e9f9	test: propagate view update backlog before partition delete In the test_delete_partition_rows_from_table_with_mv case we perform a deletion of a large partition to verify that the deletion will self-throttle when generating many view updates. Before the deletion, we first build the materialized view, which causes the view update backlog to grow. The backlog should be back to empty when the view building finishes, and we do wait for that to happen, but the information about the backlog drop may not be propagated to the delete coordinator in time - the gossip interval is 1s and we perform no other writes between the nodes in the meantime, so we don't make use of the "piggyback" mechanism of propagating view backlog either. If the coordinator thinks that the backlog is high on the replica, it may reject the delete, failing this test. We change this in this patch - after the view is built, we perform an extra write from the coordinator. When the write finishes, the coordinator will have the up-to-date view backlog and can proceed with the DELETE. Additionally, we enable the "update_backlog_immediately" injection, which makes the node backlog (the highest backlog across shards) update immediately after each change. Fixes: SCYLLADB-1877 Closes scylladb/scylladb#29775 (cherry picked from commit `ab12083525`) Closes scylladb/scylladb#29793	2026-05-07 22:43:18 +03:00
Piotr Dulikowski	851c605b1d	Merge '[Backport 2026.2] vector_search: test: fix flaky test_dns_resolving_repeated' from Scylladb[bot] The `vector_store_client_test_dns_resolving_repeated` test was intermittently timing out on CI. The exact root cause is not fully understood, but the hypothesis is that a single trigger signal can be lost somewhere (not exactly known where). This is not an issue for the production code because refresh trigger will be called multiple times whenever all configured nodes will be unreachable. Fixes SCYLLADB-1794 Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches. - (cherry picked from commit `4722be1289`) - (cherry picked from commit `207de967fb`) Parent PR: #29752 Closes scylladb/scylladb#29784 * github.com:scylladb/scylladb: vector_search: test: default timeout in test_dns_resolving_repeated vector_search: test: fix flaky test_dns_resolving_repeated	2026-05-07 14:34:34 +02:00
Karol Nowacki	44249c0a75	vector_search: test: default timeout in test_dns_resolving_repeated Replace explicit 1-second timeouts in repeat_until() with the default STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for loaded CI environments where lowres_clock granularity (~10ms) combined with OS scheduling delays and resource contention (-c2 -m2G) could cause the loop to expire before the DNS refresh task completes its cycle. This also unifies test timeouts across test cases. (cherry picked from commit `207de967fb`)	2026-05-06 20:48:55 +00:00
Karol Nowacki	e9240587f4	vector_search: test: fix flaky test_dns_resolving_repeated Move trigger_dns_resolver() inside the repeat_until loop instead of calling it once before the loop. The test was intermittently timing out on CI. The exact root cause is not fully understood, but the hypothesis is that a single trigger signal can be lost somewhere (not exactly known where). This is not an issue for the production code because refresh trigger will be called multiple times - in every query where all configured nodes will be unreachable. By triggering inside the loop, we ensure the signal is re-sent on each iteration until the resolver actually performs the refresh and picks up the new (failing) DNS resolution. This makes the test resilient to timing-dependent signal loss without changing production code. Fixes: SCYLLADB-1794 (cherry picked from commit `4722be1289`)	2026-05-06 20:48:54 +00:00
Marcin Maliszkiewicz	b39c7fa034	test: add reproducer for auth cache crash on missing permissions column (cherry picked from commit `5c5306c692`)	2026-05-06 20:47:30 +00:00
Marcin Maliszkiewicz	fb6d5368bb	Merge 'auth: fix shutdown and startup races in LDAP cache pruner' from Andrzej Jackowski The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader): - Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls. - Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared. Fixes SCYLLADB-1679 Backport to 2026.2, introduced in `7eedf50c12` Closes scylladb/scylladb#29605 * github.com:scylladb/scylladb: auth: make shutdown the exact reverse of startup test: ldap: add test for pruner crash during shutdown auth: start authorizer and set permission loader before role manager auth: stop role manager before clearing permission loader auth: reload LDAP permission cache on local shard only (cherry picked from commit `b0f988afc4`) Closes scylladb/scylladb#29681	2026-05-06 14:33:33 +02:00
Marcin Maliszkiewicz	9e0c86b7fd	Merge 'utils: loading_cache: add `insert()` that is a no-op when caching is disabled' from Dario Mirovic When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with: ``` Assertion `caching_enabled()' failed. at utils/loading_cache.hh:319 in authorized_prepared_statements_cache::insert ``` `loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off. Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side. Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload. Fixes SCYLLADB-1699 The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported. Closes scylladb/scylladb#29638 * github.com:scylladb/scylladb: test: boost: regression test for loading_cache::insert with caching disabled utils: loading_cache: add insert() that is a no-op when caching is disabled (cherry picked from commit `c00fee0316`) Closes scylladb/scylladb#29762	2026-05-06 14:27:41 +02:00
Patryk Jędrzejczak	6d09897339	Merge 'Barrier and drain logging' from Gleb Natapov Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281 Bakport since we want to have it if it happens in the field. Fixes: SCYLLADB-1836 Refs: #26281 Closes scylladb/scylladb#29735 * https://github.com/scylladb/scylladb: session, raft_topology: add periodic warnings for hung drain and stale version waits session: add info-level logging to drain_closing_sessions raft_topology: log sub-step progress in local_topology_barrier raft_topology: log read_barrier progress in topology cmd handler (cherry picked from commit `b69d00b0a7`) Closes scylladb/scylladb#29763	2026-05-06 10:26:44 +02:00
Yaniv Michael Kaul	5c8662d606	raft/group0: fix destroy assertion on startup failure If start_server_for_group0() successfully registers a server in _raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine()) throws, the server is never destroyed because abort_and_drain()/destroy() check std::get_if<raft::group_id>(&_group0) which was only set after the entire with_scheduling_group block completed. Move _group0.emplace<raft::group_id>() inside the lambda, immediately after start_server_for_group() succeeds, so that cleanup paths can always find and destroy the registered server. This fixes the assertion: "raft_group_registry - stop(): server for group ... is not destroyed" which manifests during shutdown after an upgrade where topology_state_load() fails due to netw::unknown_address. Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades Refs: SCYLLADB-1217 Refs: CUSTOMER-340 Refs: CUSTOMER-335 Fixes: SCYLLADB-1809 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: Yes, Opencode/Opus 4.6 Closes scylladb/scylladb#29702 (cherry picked from commit `6179406467`) Closes scylladb/scylladb#29742	2026-05-05 10:48:13 +02:00
Aleksandr Bykov	148e05820b	test: fix flaky test_kill_coordinator_during_op The test hardcoded the expected number of coordinator elections (2, 3, 4, 5) for each phase. If a prior phase triggered an extra election, subsequent phases would wait for a count that was already reached or would never match. Fix by reading the current election count before each operation and expecting exactly one more, making each phase independent of prior history. Also add wait_for_no_pending_topology_transition() calls after each coordinator election to ensure the topology state machine has fully settled before proceeding with restarts and further operations. Decrease the failure detector timeout (failure_detector_timeout_in_ms) to 2000 ms on all test nodes so that coordinator crashes are detected faster, reducing test wallclock time and timeout-related flakiness. Enable raft_topology=trace logging on all test nodes to aid post-failure diagnosis. Add diagnostic logging in wait_new_coordinator_elected(). Fixes: SCYLLADB-1790 Closes scylladb/scylladb#29284 (cherry picked from commit `8afdae24d2`) Closes scylladb/scylladb#29723	2026-05-02 16:27:16 +03:00
Łukasz Paszkowski	1438830348	sstables: only wipe TemporaryHashes for sstable formats that have it Commit `8d34127684` ("sstables: clean up TemporaryHashes file in wipe()") unconditionally calls filename(..., component_type::TemporaryHashes) inside filesystem_storage::wipe(). However, the TemporaryHashes component is only registered in the component map of the 'ms' sstable format. For older formats (ka, la, mc, md, me) the lookup goes through sstable_version_constants::get_component_map(version).at(...) and throws std::out_of_range. The exception is then swallowed by the outer catch(...) in wipe(), which just logs and ignores. As a side effect, the subsequent remove_file(new_toc_name) is never reached and the TemporaryTOC ('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink() of a non-'ms' sstable. Guard the lookup with get_component_map(version).contains() so the cleanup is only attempted for formats that actually define the component. Add a regression test in test/boost/sstable_directory_test.cc that creates an 'me'-format sstable, unlinks it and asserts that the sstable directory is left empty. Without the fix the test fails with a leftover 'me-...-TOC.txt.tmp' file. Fixes: SCYLLADB-1767 Closes scylladb/scylladb#29620 (cherry picked from commit `7e14ea5ac8`) Closes scylladb/scylladb#29692	2026-04-30 21:49:31 +03:00
Wojciech Mitros	d264fea176	replica/database: fix cross-shard deadlock in lock_tables_metadata() lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock on every shard. It used invoke_on_all(), which dispatches lock acquisitions to all shards in parallel via parallel_for_each + smp::submit_to. When two fibers call lock_tables_metadata() concurrently, this can deadlock. parallel_for_each starts all iterations unconditionally: even when the local shard's lock attempt blocks (because the other fiber already holds it), SMP messages are still sent to remote shards. Both fibers' lock-acquisition messages land in the per-shard SMP queues. The SMP queue itself is FIFO, but process_incoming() drains it and schedules each item as a reactor task via add_task(), which — in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles each newly added task against all pending tasks in the same scheduling group's reactor task queue. This means fiber A's lock acquisition can be reordered past fiber B's (and past unrelated tasks) on a given shard. If fiber A wins the lock on shard X while fiber B wins on shard Y, this creates a classic cross-shard lock-ordering deadlock (circular wait). In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor task queue is FIFO. Still, even in release builds, the SMP queues can reorder messages even, so the deadlock is still possible, even if it's much less likely. In debug and sanitize builds, the task-queue shuffle makes the deadlock very likely whenever both fibers' lock-acquisition tasks are pending simultaneously in the reactor task queue on any shard. This deadlock was exposed by `ce00d61917` ("db: implement large_data virtual tables with feature flag gating", merged as `88a8324e68`), which introduced legacy_drop_table_on_all_shards as a second caller of lock_tables_metadata(). When LARGE_DATA_VIRTUAL_TABLES is enabled during topology_state_load (via feature_service::enable), two fibers can race: 1. activate_large_data_virtual_tables() — calls legacy_drop_table_on_all_shards() which calls lock_tables_metadata() synchronously via .get() 2. reload_schema_in_bg() — fires as a background fiber from TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches schema_applier::commit() which also calls lock_tables_metadata() If both reach lock_tables_metadata() while the lock is free on all shards, the parallel acquisition creates the deadlock opportunity. The deadlock blocks topology_state_load() from completing, which prevents the bootstrapping node from finishing its topology state transitions. The coordinator's topology coordinator then waits for the node to reach the expected state, but the node is stuck, so eventually the read_barrier times out after 300 seconds. Fix by acquiring the shard 0 lock first before attempting to acquire any other lock. Whichever fiber wins shard 0 is guaranteed to acquire all remaining shards before the other fiber can proceed past shard 0, eliminating the circular-wait condition. Tested manually with 2 approaches: 1. causing different shard locks to be acquired by different lock_tables_metadata() calls by adding different sleeps depending on the lock_tables_metadata() call and target shard - this reproduced the issue consistently 2. matching the time point at which both fibers reach lock_tables_metadata() adding a single sleep to one of the fibers - this heavily depends on the machine so we can't create a universal reproducer this way, but it did result in the observed failure on my machine after finding the right sleep time Also added a unit test for concurrent lock_tables_metadata() calls. Fixes: SCYLLADB-1784 Fixes: SCYLLADB-1785 Fixes: SCYLLADB-1786 Closes scylladb/scylladb#29678 (cherry picked from commit `ebaf536449`) Closes scylladb/scylladb#29709	2026-04-30 21:08:15 +03:00
Botond Dénes	9622291e07	Merge 'test/cluster/test_incremental_repair: fix flaky coordinator-change scenario' from Avi Kivity - Ensure servers[1] is not the topology coordinator before restarting it, preventing the leader death + re-election + re-repair sequence that masked the compaction-merge bug - Add a retry loop that detects post-restart leadership transfer to servers[1] via direct coordinator query, retrying up to 5 times Fixes: SCYLLADB-1743 Backporting to 2026.2, which sees the failure regularly. Closes scylladb/scylladb#29671 * github.com:scylladb/scylladb: test/cluster/test_incremental_repair: add retry for residual leadership race test/cluster/test_incremental_repair: fix flaky coordinator-change scenario (cherry picked from commit `3ea4af1c8c`) Closes scylladb/scylladb#29677	2026-04-30 08:46:36 +03:00
Piotr Szymaniak	d5efd1f676	test/cluster: wait for Alternator readiness in server startup server_add() only waits for CQL readiness before returning. The Alternator HTTP port may not be listening yet, causing ConnectionRefused with Alternator tests. Extend the ServerUpState enum and startup loop to also check Alternator port readiness when configured. Whenever Alternator port(s) is/are configured, each is verified if connectable and queryable, similar to how CQL ports are probed. Fixes SCYLLADB-1701 Closes scylladb/scylladb#29625	2026-04-25 16:35:44 +03:00
Piotr Smaron	d14d07a079	test: fix flaky test_sstable_write_large_{row,cell} by using a fixed partition key Commit `ce00d61917` ("db: implement large_data virtual tables with feature flag gating") changed these two tests to construct their mutation with a randomly generated partition key (simple_schema::make_pkey()) instead of the previously fixed pk "pv", with the comment that this avoids a "Failed to generate sharding metadata" error. simple_schema::make_pkey() delegates to tests::generate_partition_key(), which defaults to key_size{1, 128}, i.e. the partition key length is uniformly random in [1, 128] bytes. That interacts badly with the fact that both tests pick thresholds at exact byte boundaries of the MC sstable row encoding: - The large-data handler records a row's size as _data_writer->offset() - current_pos (sstables/mx/writer.cc: collect_row_stats()), i.e. the number of bytes the row took on disk. - For the first clustering row, the body includes a vint-encoded prev_row_size = pos - _prev_row_start. - _prev_row_start is captured at the start of the partition (consume_new_partition()) before the partition key is written to the data stream, so prev_row_size rolls in the partition key's serialized length (2-byte prefix + pk bytes) + deletion_time + static row size. A random-size partition key therefore perturbs the first clustering row's encoded size by 1-2 bytes across runs (the vint of prev_row_size crosses the 128 boundary), flipping the test's byte-exact threshold comparison. On seed 2104744000 this produced: critical check row_size_count == expected.size() has failed [3 != 2] Fix the two byte-exact-sensitive tests by reverting their partition key to the fixed s.new_mutation("pv") used before `ce00d61917`. Under smp=1 (which these tests run with, per -c1 in the test invocation) a fixed key is always shard-local, so no sharding-metadata issue arises here. The other tests modified by `ce00d61917` (test_sstable_log_too_many_rows, test_sstable_log_too_many_dead_rows, test_sstable_too_many_collection_elements, test_large_data_records_round_trip, etc.) assert on row/element counts or use thresholds with enough slack that the partition key size does not matter, and are left unchanged. Add an explanatory comment to each fixed site so the pitfall is not re-introduced by a future refactor. Verified stable with: ./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1 ./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1 ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1 ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1 All four invocations: 100/100 passed. Fixes: SCYLLADB-1685 Closes scylladb/scylladb#29621	2026-04-25 16:32:02 +03:00
Botond Dénes	70261dc674	Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz The failure_detector_timeout_in_ms override of 2000ms in 6 cluster test files is too aggressive for debug/sanitize builds. During node joins, the coordinator's failure detector times out on RPC pings to the joining node while it is still applying schema snapshots, marks it DOWN, and bans it — causing flaky test failures. Scale the timeout by MODES_TIMEOUT_FACTOR (3x for debug/sanitize, 2x for dev, 1x for release) via a shared failure_detector_timeout fixture in conftest.py. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1587 Backport: no, elasticsearch analyser shows only a single failure Closes scylladb/scylladb#29522 * github.com:scylladb/scylladb: test/cluster: scale failure_detector_timeout_in_ms by build mode test/cluster: add failure_detector_timeout fixture	2026-04-24 09:10:43 +03:00
Botond Dénes	d280517e27	test/cluster/test_incremental_repair: fix flaky do_tablet_incremental_repair_and_ops The log grep in get_sst_status searched from the beginning of the log (no from_mark), so the second-repair assertions were checking cumulative counts across both repairs rather than counts for the second repair alone. The expected values (sst_add==2, sst_mark==2) relied on this cumulative behaviour: 1 from the first repair + 1 from the second = 2. This works when the second repair encounters exactly one unrepaired sstable, but fails whenever the second repair sees two. The second repair can see two unrepaired sstables when the 100 keys inserted before it (via asyncio.gather) trigger a background auto-flush before take_storage_snapshot runs. take_storage_snapshot always flushes the memtable itself, so if an auto-flush already split the batch into two sstables on disk, the second repair's snapshot contains both and logs "Added sst" twice, making the cumulative count 3 instead of 2. Fix: take a log mark per-server before each repair call and pass it to get_sst_status so each check counts only the entries produced by that repair. The expected values become 1/0/1 and 1/1/1 respectively, independent of how many sstables happened to exist beforehand. get_sst_status gains an optional from_mark parameter (default None) which preserves existing call sites that intentionally grep from the start of the log. Fixes: SCYLLADB-1086 Closes scylladb/scylladb#29484	2026-04-23 17:17:16 +02:00
Wojciech Mitros	7634d3f7d4	test/cluster: fix flaky test_hints_consistency_during_replace The test creates a sync point immediately after writing 100 rows with CL=ANY, without waiting for pending hint writes to complete. store_hint() is fire-and-forget: it submits do_store_hint() to a gate and returns immediately. do_store_hint() updates _last_written_rp only after writing to the commitlog. If create_sync_point() is called before all do_store_hint() coroutines complete, the captured replay position is stale, and await_sync_point() returns DONE before all hints are replayed, leaving some rows missing. Fix by waiting for the size_of_hints_in_progress metric to reach zero before creating the sync point, ensuring all in-flight hint writes have completed and _last_written_rp is up to date. This follows the same pattern already used in test_sync_point. Fixes: SCYLLADB-1560 Closes scylladb/scylladb#29623	2026-04-23 17:03:48 +02:00
Botond Dénes	b49cf6247f	test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL Tracing events are written to system_traces.events with CL=ANY, so they are only guaranteed to be present on the local node of the query coordinator. Reading them back with the driver default (CL=LOCAL_ONE) may route the query to a replica that has not yet received all events, causing the assertion on 'digest mismatch, starting read repair' to fail intermittently. Fix execute_with_tracing() to read tracing via the ResponseFuture API with query_cl=ConsistencyLevel.ALL, so events from all replicas are merged before the caller inspects them. Fixes: SCYLLADB-1633 Closes scylladb/scylladb#29566	2026-04-23 16:57:29 +02:00
Michał Jadwiszczak	878f341338	test/cluster/test_view_building_coordinator: fix view_updates_drained predicate The previous fix for the flakiness in test_file_streaming waited for the scylla_database_view_update_backlog metric to drop to 0 via wait_for(view_updates_drained, ...). However, the predicate returned True/False, while wait_for treats any non-None result as 'done' and keeps retrying only on None. So when the backlog was non-zero the predicate returned False, which wait_for interpreted as success and returned immediately - the test could then stop servers[0]/servers[1] before the view updates generated by new_server from the migrated staging sstable were actually delivered, leading to a partially populated MV (e.g. 431/1000 rows) and a failing assertion. Fix the predicate to return None instead of False when the backlog is not yet drained, so wait_for will actually retry until the metric reaches 0 (or the deadline is hit). Fixes SCYLLADB-1182 Closes scylladb/scylladb#29587	2026-04-23 17:52:22 +03:00
Calle Wilund	c97ce32f47	Update position in dma_read(iovec) in create_file_for_seekable_source Fixes: SCYLLADB-1523 The returned file object does not increment file pos as is. One line fix. Added test to make sure this read path works as expected. Closes scylladb/scylladb#29456	2026-04-23 14:54:20 +03:00
Michael Litvak	3468e8de8b	test/mv/test_mv_staging: wait for cql after restart Wait for cql on all hosts after restarting a server in the test. The problem that was observed is that the test restarts servers[1] and doesn't wait for the cql to be ready on it. On test teardown it drops the keyspace, trying to execute it on the host that is not ready, and fails. Fixes SCYLLADB-1632 Closes scylladb/scylladb#29562	2026-04-23 12:40:19 +02:00
Marcin Maliszkiewicz	3df951bc9c	Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski Commit `16b56c2451` ("Audit: avoid dynamic_cast on a hot path") moved audit info into batch_statement via set_audit_info(), but only wired it for the CQL-text BATCH path (raw::batch_statement::prepare()). Native-protocol BATCH messages (opcode 0x0D), handled by process_batch_internal in transport/server.cc, construct a batch_statement without setting audit_info. This causes audit to silently skip the entire batch. Set audit_info on the batch_statement so these batches are audited. Fixes SCYLLADB-1652 No backport - bug introduced recently. Closes scylladb/scylladb#29570 * github.com:scylladb/scylladb: test/audit: add reproducer for native-protocol batch not being audited audit: set audit_info for native-protocol BATCH messages test/audit: rename internal test methods to avoid CI misdetection	2026-04-22 18:56:28 +02:00
Botond Dénes	eb3326b417	Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta should be merged after #29235 Complete the typed skip markers migration started in the plugin PR. Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call across the test suite is replaced with a typed equivalent, making skip reasons machine-readable in JUnit XML and Allure reports. 62 files changed across 8 commits, covering ~127 skip sites in total. Bare `pytest.skip` provides only a free-text reason string. CI dashboards (JUnit, Allure) cannot distinguish between a test skipped due to a known bug, a missing feature, a slow test, or an environment limitation. This makes it hard to track skip debt, prioritize fixes, or filter dashboards by skip category. The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`, `skip_env`) introduced by the `skip_reason_plugin` solve this by embedding a `skip_type` field into every skip report entry. \| Type \| Count \| Files \| Description \| \|------\|-------\|-------\|-------------\| \| `skip_bug` \| 24 \| 16 \| Skip reason references a known bug/issue \| \| `skip_not_implemented` \| 10 \| 5 \| Feature not yet implemented in Scylla \| \| `skip_slow` \| 4 \| 3 \| Test too slow for regular CI runs \| \| `skip_not_implemented` (bare) \| 2 \| 1 \| Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) \| \| Type \| Count \| Files \| Description \| \|------\|-------\|-------\|-------------\| \| `skip_env` \| ~85 \| 34 \| Feature/config/topology not available at runtime \| \| `skip_bug` \| 2 \| 2 \| Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) \| - Comments: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()` - Plugin hardened: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning - Guard tests: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression: - AST scan for bare `@pytest.mark.skip` decorators - AST scan for bare `pytest.skip()` runtime calls - Real `pytest --collect-only` against all Python test directories Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`: ```python from test.pylib.skip_types import skip_env ``` Usage: ```python skip_env("Tablets not enabled") ``` 1. test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs — 24 decorator sites, 16 files 2. test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented — 10 decorator sites, 5 files 3. test: migrate @pytest.mark.skip to @pytest.mark.skip_slow — 4 decorator sites, 3 files 4. test: migrate bare @pytest.mark.skip to skip_not_implemented — 2 bare decorators, 1 file 5. test: migrate runtime pytest.skip() to typed skip_env() — ~85 sites, 34 files 6. test: migrate runtime pytest.skip() to typed skip_bug() — 2 sites, 2 files 7. test: update comments referencing pytest.skip() to skip() — 7 comments, 5 files 8. test/pylib: reject bare pytest.mark.skip and add codebase guards — plugin hardening + 3 guard tests - All 60 plugin + guard tests pass (`test/pylib_test/`) - No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase - `pytest --collect-only` succeeds across all test directories with the hardened plugin SCYLLADB-1349 Closes scylladb/scylladb#29305 * github.com:scylladb/scylladb: test/alternator: replace bare pytest.skip() with typed skip helpers test: migrate new bare skips introduced by upstream after rebase test/pylib: reject bare pytest.mark.skip and add codebase guards test: update comments referencing pytest.skip() to skip_env() test: migrate runtime pytest.skip() to typed skip_bug() test: migrate runtime pytest.skip() to typed skip_env() test: migrate bare @pytest.mark.skip to skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_slow test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs	2026-04-22 15:48:27 +03:00
Artsiom Mishuta	183c6d120e	test: exclude pylib_test from default test runs Add pylib_test to norecursedirs in pytest.ini so it is not collected during ./test.py or pytest test/ runs, but can still be run directly via 'pytest test/pylib_test'. Also fix pytest log cleanup: worker log files (pytest_gw*) were not being deleted on success because cleanup was restricted to the main process only. Now each process (main and workers) cleans up its own log file on success. Closes scylladb/scylladb#29551	2026-04-22 11:38:40 +02:00
Botond Dénes	18ceeaf3ef	Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231. Closes scylladb/scylladb#29310 * github.com:scylladb/scylladb: compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode test/repair: Add tombstone GC safety tests for incremental repair	2026-04-22 10:21:37 +03:00
Avi Kivity	f5eb99f149	test: bump multishard_query_test querier_cache TTL to 60s to avoid flake Three test cases in multishard_query_test.cc set the querier_cache entry TTL to 2s and then assert, between pages of a stateful paged query, that cached queriers are still present (population >= 1) and that time_based_evictions stays 0. The 2s TTL is not load-bearing for what these tests exercise — they are checking the paging-cache handoff, not TTL semantics. But on busy CI runners (SCYLLADB-1642 was observed on aarch64 release), scheduling jitter between saving a reader and sampling the population can exceed 2s. When that happens, the TTL fires, both saved queriers are time-evicted, population drops to 0, and the assertion `require_greater_equal(saved_readers, 1u)` fails. The trailing `require_equal(time_based_evictions, 0)` check never runs because the earlier assertion has already aborted the iteration — which is why the Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93". Reproduced deterministically in test_read_with_partition_row_limits by injecting a `seastar::sleep(2500ms)` between the save and the sample: the hook then reports population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0 and the assertion fires — matching the Jenkins symptoms exactly. Bump the TTL to 60s in all three affected tests: - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642) - test_read_all (same pattern, same invariants — suspect) - test_read_all_multi_range (same pattern, same invariants — suspect) Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction) and test_evict_a_shard_reader_on_each_page (tests manual eviction via evict_one(); its TTL is not load-bearing but the fix is deferred for a separate review) unchanged. Fixes: SCYLLADB-1642 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Closes scylladb/scylladb#29564	2026-04-22 09:48:59 +03:00
Tomasz Grabiec	cddde464ca	Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature	2026-04-22 01:46:11 +02:00
Andrzej Jackowski	b6cb025e9b	test/audit: add reproducer for native-protocol batch not being audited The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a QUERY message, which goes through the CQL parser and raw::batch_statement:: prepare() — a path that correctly sets audit_info. This missed the bug where native-protocol BATCH messages (opcode 0x0D), handled by process_batch_internal in transport/server.cc, construct a batch_statement without setting audit_info, causing audit to silently skip the batch. Add _test_batch_native_protocol which uses the driver's BatchStatement (both unprepared and prepared variants) to exercise this code path. Refs SCYLLADB-1652	2026-04-21 21:52:26 +02:00
Andrzej Jackowski	5f93d57d6e	test/audit: rename internal test methods to avoid CI misdetection The CI heuristic picks up any function named test_* in changed files and tries to run it as a standalone pytest test. The AuditTester class methods (test_batch, test_dml, etc.) are not top-level pytest tests — they are internal helpers called from the actual test functions. Prefix them with underscore so CI does not mistake them for standalone tests.	2026-04-21 21:52:26 +02:00
Dario Mirovic	cf237e060a	test: auth_cluster: use safe_driver_shutdown() for Cluster teardown A handful of cassandra-driver Cluster.shutdown() call sites in the auth_cluster tests were missed by the previous sweep that introduced safe_driver_shutdown(), because the local variable holding the Cluster is named "c" rather than "cluster". Direct Cluster.shutdown() is racy: the driver's "Task Scheduler" thread may raise RuntimeError ("cannot schedule new futures after shutdown") during or after the call, occasionally failing tests. safe_driver_shutdown() suppresses this expected RuntimeError and joins the scheduler thread. Replace the remaining c.shutdown() calls in: - test/cluster/auth_cluster/test_startup_response.py - test/cluster/auth_cluster/test_maintenance_socket.py with safe_driver_shutdown(c) and add the corresponding import from test.pylib.driver_utils. No behavioral change to the tests; only the driver teardown is hardened against a known driver-side race. Fixes SCYLLADB-1662 Closes scylladb/scylladb#29576	2026-04-21 17:45:11 +02:00
Radosław Cybulski	6f7bf30a14	alternator: increase wait time to tablet sync When forcing tablet count change via cql command, the underlying tablet machinery takes some time to adjust. Original code waited at most 0.1s for tablet data to be synchronized. This seems to be not enough on debug builds, so we add exponential backoff and increase maximum waiting time. Now the code will wait 0.1s first time and continue waiting with each time doubling the time, up to maximum of 6 times - or total time ~6s. Fixes: SCYLLADB-1655 Closes scylladb/scylladb#29573	2026-04-21 17:38:07 +02:00
Piotr Dulikowski	cb8253067d	Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev When DROP TABLE races with an in-flight DML on a strongly-consistent table, the node aborts in `groups_manager::acquire_server()` because the raft group has already been erased from `_raft_groups`. A concurrent `DROP TABLE` may have already removed the table from database registries and erased the raft group via `schedule_raft_group_deletion`. The `schema.table()` in `create_operation_ctx()` might not fail though because someone might be holding `lw_shared_ptr<table>`, so that the table is dropped but the table object is still alive. Fix by accepting table_id in acquire_server and checking that the table still exists in the database via `find_column_family` before looking up the raft group. If the table has been dropped, find_column_family throws no_such_column_family instead of the node aborting via on_internal_error. When the table does exist, acquire_server proceeds to acquire state.gate; schedule_raft_group_deletion co_awaits gate::close, so it will wait for the DML operation to complete before erasing the group. backport: not needed (not released feature) Fixes SCYLLADB-1450 Closes scylladb/scylladb#29430 * github.com:scylladb/scylladb: strong_consistency: fix crash when DROP TABLE races with in-flight DML test: add regression test for DROP TABLE racing with in-flight DML	2026-04-21 16:54:20 +02:00

1 2 3 4 5 ...

11619 Commits