scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-24 00:32:15 +00:00

Author	SHA1	Message	Date
Dawid Pawlik	b6d5ff344b	test/cqlpy: add integration tests for `fulltext_index` Add `test_fulltext_index.py` covering the `fulltext_index` custom index: - Creation on text, varchar, and ascii columns - Rejection of non-text types (int, blob, vector) - Validation of analyzer and positions options - Rejection of unsupported option keys - Case-insensitive class name lookup - DESCRIBE INDEX output with and without options - No backing materialized view in `system_schema.views` - IF NOT EXISTS idempotent behavior - Metadata correctness in `system_schema.indexes`	2026-05-08 11:30:08 +02:00
Patryk Jędrzejczak	4c3a86c515	test/raft: fix duplicate check in connected::operator() The operator had a copy-paste bug: it checked disconnected.contains({id1, id2}) twice instead of checking both directions ({id1, id2} and {id2, id1}). Reduce the operator to a single directional check: {id1, id2}. It works for all current callers, and checking both directions correctly would break the new block_receive() function.	2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak	ccd92c0b6b	test/raft: add tests for add_entry snapshot interactions Add six tests covering add_entry with wait_type::applied and wait_type::committed for three snapshot scenarios affected in the previous commit: 1. Snapshot at the entry's index (wait_for_entry, term_for returns snapshot term). 2. Snapshot past the entry's index (wait_for_entry, term_for returns nullopt). 3. Follower's waiter is resolved via drop_waiters when a snapshot is loaded. Without the fix in the previous commit, 4 of 6 tests fail: all 3 wait_type::applied tests and the wait_type::committed drop_waiters test. The remaining two tests pass because the changes don't affect them. We don't write tests covering the scenarios when add_entry should still throw commit_status_unknown (that is when the entry's term doesn't match the snapshot's term) because: - these tests would be very complicated, - a bug that would make these tests fail should also make the nemesis tests fail, as there would be an issue with linearizability.	2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak	a7f204ee45	raft: do not throw commit_status_unknown from add_entry when possible Previously, when a snapshot load subsumed a committed entry before apply() was called locally, add_entry would throw commit_status_unknown -- even though the entry was known to be committed and included in the snapshot. This was overly pessimistic. Normal state machine implementations shouldn't care whether an entry was applied via apply() or via a snapshot load. Unnecessary commit_status_unknown caused flakiness of test_frequent_snapshotting and unnecessary retries in group0. Raft groups from strongly consistent tables couldn't hit unnecessary commit_status_unknown's because they use wait_type::committed and `enable_forwarding == false`. Three sites are changed: 1. wait_for_entry (truncation case): the snapshot-term match optimization that proved the entry was committed now applies to both wait_type::committed and wait_type::applied, not just committed. 2. wait_for_entry (snapshot covers entry): instead of throwing commit_status_unknown when the snapshot index >= entry index, return successfully. The entry's effects are included in the state machine's state via the snapshot. 3. drop_waiters: when called from load_snapshot, pass the snapshot term. Waiters whose term matches the snapshot term are resolved successfully (set_value) instead of failing with commit_status_unknown, since the Log Matching Property guarantees they were committed and included. This deflakes test_frequent_snapshotting: the test uses aggressive snapshot settings (snapshot_threshold=1) causing wait_for_entry to occasionally find the snapshot covering its entry. Previously this threw commit_status_unknown, failing the test. With this fix, wait_for_entry returns success. Note that apply() is never actually skipped in this test -- the leader always applies entries locally before taking a snapshot. The nemesis test is updated to handle the new behavior: call() detects when add_entry succeeded but the output channel was not written (apply() skipped locally) and returns apply_skipped instead of hanging. The linearizability checker in basic_generator_test counts skipped applies separately from failures. basic_generator_test exercises this path: skipped_applies > 0 occurs in some runs. Fixes: SCYLLADB-1264	2026-05-08 11:18:02 +02:00
Botond Dénes	a30ce98bc4	Merge 'test: speed up sstable compaction tests on remote storage (S3/GCS)' from Ernest Zaslavsky Several sstable_compaction_test cases run prohibitively slowly on S3 and GCS backends — some taking 4+ minutes — because they create hundreds of SSTables sequentially over high-latency HTTP connections and perform redundant validation (checksumming) round-trips on every one. The twcs_reshape_with_disjoint_set S3 variant was even disabled entirely because of this. The changes apply three complementary optimizations, per-test: Skip SSTable validation on remote storage. The compaction tests verify strategy logic, not data integrity. SSTable validation triggers additional read-back I/O which is cheap on local disk but expensive over HTTP. A `do_validate` flag now conditionally skips validation when the storage backend is not local. Parallelize SSTable creation with async coroutines. A new `make_sstable_containing_async` coroutine overload is added alongside the existing synchronous `make_sstable_containing`. Sequential creation loops are replaced with `parallel_for_each` using coroutine lambdas that call the async overload directly, overlapping S3/GCS uploads without spawning a dedicated Seastar thread per SSTable. The async validation path performs the same content checks as the synchronous version (mutation merging and `is_equal_to_compacted` assertions). Operations that depend on the created SSTables (e.g. `add_sstable_and_update_cache`, `owned_token_ranges` population) remain sequential. Reduce SSTable count for remote variants. Tests like twcs_reshape_with_disjoint_set and stcs_reshape_overlapping used a hardcoded count of 256. The count is now a function parameter (default 256 for local, 64 for S3/GCS), which is sufficient to exercise the compaction strategy logic while avoiding excessive remote I/O. Infrastructure changes: S3 endpoint max_connections raised from the default to 32 to support the higher upload concurrency, and trace-level logging added for s3, gcp_storage, http, and default_http_retry_strategy to aid future debugging. The previously disabled twcs_reshape_with_disjoint_set_s3_test is re-enabled with these optimizations. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1428 Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1843 No backport needed — this is a test-only performance improvement. Closes scylladb/scylladb#29416 * github.com:scylladb/scylladb: test: optimize compaction_strategy_cleanup_method for remote storage test: optimize stcs_reshape_overlapping for remote storage test: optimize twcs_reshape_with_disjoint_set for remote storage test: parallelize SSTable creation in cleanup_during_offstrategy_incremental test: parallelize SSTable creation in run_incremental_compaction_test test: parallelize SSTable creation in offstrategy_sstable_compaction test: parallelize SSTable creation in twcs_partition_estimate test: add trace-level logging for S3 and HTTP in compaction tests test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282 test: move make_memtable out of external_updater in row_cache_test test: increase S3 max connections for compaction tests	2026-05-08 06:40:20 +03:00
Piotr Szymaniak	744848a85f	test/alternator: fix stream wait timeouts to use wall-clock time Both disable_stream and wait_for_active_stream used time.process_time() for their timeouts, but process_time measures CPU time, not wall-clock time. Since these loops spend most of their time sleeping and waiting on API calls, the timeouts could last far longer than intended. Use time.time() instead to enforce actual wall-clock deadlines.	2026-05-07 14:45:42 +02:00
Piotr Szymaniak	38bd068f78	alternator/streams: keep disabled streams usable and purge on re-enable Previously, disabling Alternator Streams would create a blank cdc::options with only enabled=false, which meant losing access also to stored Streams's data (including preimage and postimage). Now, when a stream is disabled: - The existing CDC options are preserved (only 'enabled' is flipped to false), so StreamViewType remains available. - DescribeStream enumerates all shards with EndingSequenceNumber set, indicating they are closed. - GetRecords omits NextShardIterator for disabled streams. - DescribeTable (supplement_table_stream_info) reports the stream ARN and StreamEnabled: false when the CDC log table still exists. - ListStreams uses get_base_table instead of is_log_for_some_table so that disabled streams whose log table still exists are listed. When a stream is re-enabled on an Alternator table that has an existing (disabled) CDC log table, the old log table is dropped and a fresh one is created with a new UUID, producing a new StreamArn. This is Alternator-specific behavior; CQL CDC tables continue to reuse the existing log table. The old stream data is lost immediately upon re-enable. DynamoDB keeps it readable for 24 hours. Tests: - test_streams_closed_read, test_streams_disabled_stream: remove xfail now that disabled streams are usable. - test_streams_reenable: new test verifying that re-enabling produces a new ARN and the old data is still readable via the old ARN (xfail because Scylla currently purges old data on re-enable). Fixes scylladb/scylladb#7239	2026-05-07 14:45:42 +02:00
Ferenc Szili	f7bc8f5fa7	test: boost: add drain test for forced capacity-based balancing Add a Boost unit test that forces capacity-based balancing through configuration and verifies that a drained and excluded node will be drained of its tablets when tablet size stats are missing. The test covers the regression where the allocator rejected the plan due to incomplete tablet stats, even though forced capacity-based balancing does not depend on tablet sizes.	2026-05-07 13:56:36 +02:00
Wojciech Mitros	ab12083525	test: propagate view update backlog before partition delete In the test_delete_partition_rows_from_table_with_mv case we perform a deletion of a large partition to verify that the deletion will self-throttle when generating many view updates. Before the deletion, we first build the materialized view, which causes the view update backlog to grow. The backlog should be back to empty when the view building finishes, and we do wait for that to happen, but the information about the backlog drop may not be propagated to the delete coordinator in time - the gossip interval is 1s and we perform no other writes between the nodes in the meantime, so we don't make use of the "piggyback" mechanism of propagating view backlog either. If the coordinator thinks that the backlog is high on the replica, it may reject the delete, failing this test. We change this in this patch - after the view is built, we perform an extra write from the coordinator. When the write finishes, the coordinator will have the up-to-date view backlog and can proceed with the DELETE. Additionally, we enable the "update_backlog_immediately" injection, which makes the node backlog (the highest backlog across shards) update immediately after each change. Fixes: SCYLLADB-1795 Closes scylladb/scylladb#29775	2026-05-07 11:33:13 +03:00
Andrzej Jackowski	eb241a7048	test: make preemptive abort coverage deterministic The test used a real-time sleep to move the queued permit into the preemptive-abort window. If the reactor did not get CPU for long enough, admission could run only after the permit's timeout had expired, making the expected abort path flaky. The test also exhausted memory together with count resources, so the queued permit could wait for memory. Preemptive abort is intentionally not applied to permits waiting for memory, so keep enough memory available and assert that the permit is queued only on count. Use an immediate preemptive-abort threshold and a long finite timeout to exercise admission-time abort without relying on scheduler timing. Fixes: SCYLLADB-1796 Closes scylladb/scylladb#29736	2026-05-07 09:59:53 +03:00
Ferenc Szili	ec4b483e88	test: fix flaky test_tablets_split_merge_with_many_tables In debug mode, this test can timeout during tablets merge. While the test already decreases the number of tables in debug mode (20 tables, instead of 200 for dev mode), this is not enough, and the test can still timeout during merge. This change reduces the number of tables from 20 to 5 in debug mode. It also drops the log level for lead_balancer to debug. This should make any potential future problems with this test easier to investigate. Fixes: SCYLLADB-1717 Closes scylladb/scylladb#29682	2026-05-06 17:02:10 +03:00
Petr Gusev	cab043323d	test/cluster: fix test_lwt_fencing_upgrade flakiness during rolling upgrade Replace the naive host.is_up check with wait_for_cql_and_get_hosts() which actually executes a query against each host, ensuring the driver's connection pool is fully re-established before proceeding to stop the last server. The is_up flag is set asynchronously via gossip and doesn't guarantee the connection pool has live TCP connections. After a server restart, the flag may be True while the pool still holds stale connections. When the pool monitor later discovers them dead it briefly marks the host DOWN, causing NoHostAvailable if another server is being stopped concurrently. Fixes SCYLLADB-1840 Closes scylladb/scylladb#29769	2026-05-06 15:40:09 +03:00
Tomasz Grabiec	d6346e68c1	Merge 'prevent gossiper from marking nodes as down in tests unexpectedly' from Patryk Jędrzejczak This PR includes two changes that make gossiper much less likely to mark nodes as down in tests unexpectedly, and cause test flakiness in issues like SCYLLADB-864: - fixing false node conviction when echo succeeds, - increasing the failure_detector_timeout fixture. Fixes: SCYLLADB-864 No need for backport: related CI failures are rare, and merging #29522 made them even more unlikely (I haven't seen one since then, but it's still possible to reproduce locally on dev machines). Closes scylladb/scylladb#29755 * github.com:scylladb/scylladb: test/cluster: increase failure_detector_timeout gossiper: fix false node conviction when echo succeeds	2026-05-06 14:01:15 +02:00
Piotr Dulikowski	1dccfeb988	Merge 'vector_search: test: fix flaky test_dns_resolving_repeated' from Karol Nowacki The `vector_store_client_test_dns_resolving_repeated` test was intermittently timing out on CI. The exact root cause is not fully understood, but the hypothesis is that a single trigger signal can be lost somewhere (not exactly known where). This is not an issue for the production code because refresh trigger will be called multiple times whenever all configured nodes will be unreachable. Fixes SCYLLADB-1794 Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches. Closes scylladb/scylladb#29752 * github.com:scylladb/scylladb: vector_search: test: default timeout in test_dns_resolving_repeated vector_search: test: fix flaky test_dns_resolving_repeated	2026-05-06 13:46:36 +02:00
Botond Dénes	8d22ef3058	Merge 'commitlog_test.py: Fix size check aliasing, and threshold calc and fix CL chunk size est.' from Calle Wilund Fixes: SCYLLADB-1815 If we're in a brand new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation. As for the test: Checking segment sizes should not use a size filter that rounds (up) sizes. More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure. Closes scylladb/scylladb#29753 * github.com:scylladb/scylladb: commitlog_test.py: Fix size check aliasing, and threshold calc. commitlog: Fix segment/chunk overhead maybe not included in next_position calculation	2026-05-06 13:48:41 +03:00
Piotr Dulikowski	321006ecbd	Merge 'auth: fix crash on ghost rows in role_permissions' from Marcin Maliszkiewicz The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE. When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells. As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception. The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816 Backport: all supported versions 2026.1, 2025.4, 2025.1 Closes scylladb/scylladb#29757 * github.com:scylladb/scylladb: test: add reproducer for auth cache crash on missing permissions column auth: tolerate missing permissions column in authorize() auth: add defensive has() guard for role_attributes value column auth: remove unused permissions field from cache role_record	2026-05-06 12:00:17 +02:00
Yaniv Michael Kaul	7557c64f20	test/cqlpy: add tests for hyphenated column names Verify that double-quoted column names with hyphens (e.g. "my-col") work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that unquoted hyphenated names are rejected with a syntax error.	2026-05-06 11:32:04 +03:00
Nadav Har'El	19555bc2cf	test/pylib: fix missing protocol_version=4 on control_cluster get_cql_up_state() creates two Cluster instances: a short-lived one used to probe CQL readiness, and a persistent control_cluster kept alive for the lifetime of the server. The probe cluster was created with protocol_version=4 (the highest version Scylla supports), but the control_cluster was not, causing the driver to do a superfluous version-negotiation round-trip on every server start. Fix by extracting the shared constructor arguments into a cluster_kwargs dict and using **cluster_kwargs for both calls, so the two Cluster instances are created with identical parameters. This deduplication can help avoid more instances of this bug, where someone modifies the options in one call but forgets to change the options in the other call. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:57:49 +03:00
Nadav Har'El	f977621e40	scylla_cluster: guard poll_status() set_result() calls against cancelled future The poll_status() background thread resolves `serving_signal` by scheduling `f.set_result(...)` on the event loop via `call_soon_threadsafe`. In parallel, `_cleanup_notify_socket()` can cancel `serving_signal` at any time - for example when a server fails to start and `stop()` -> `shutdown_control_connection()` is called while the thread is still blocked in `recv()` (the socket close unblocks the `recv()` with an exception, sending it down the error path). When that race fires the scheduled `f.set_result(...)` callback runs after `cancel()` has already put the future into the cancelled state, raising `asyncio.InvalidStateError: Result is not allowed in cancelled state`. This bug predates the SERVING work, but the original CQL_ALTERNATOR_QUERIED default meant the notify socket was torn down quickly most of the time, making the window very narrow. Now that SERVING is the default the socket stays open throughout the full startup wait, widening the race significantly. Fix: replace every bare `f.set_result(v)` call with `lambda: f.done() or f.set_result(v)`, which is a no-op when the future is already done (cancelled, or resolved by another path). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:34:58 +03:00
Nadav Har'El	ff33440c6c	test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING With ServerUpState.SERVING now the default, server_add() and server_start() wait for sd_notify readiness after CQL is already up. During that window the startup polling loop was calling get_cql_alternator_up_state() on every iteration (every 100ms). Each successful call recreated self.control_cluster and self.control_connection without closing the previous ones, leaking driver connections and adding unnecessary CQL load to a node that was already known to be queryable. Fix in two places: - Startup loop: skip the get_cql_alternator_up_state() call once server_up_state has reached CQL_ALTERNATOR_QUERIED. After that point only the cheap non-blocking check_serving_notification() is needed. - get_cql_up_state(): guard control_cluster/control_connection creation with `if self.control_connection is None` so the persistent driver connection is only established once, even if the function is called multiple times. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:29:18 +03:00
Karol Nowacki	20b953ef8c	vector_search: test: migrate paging warnings tests to Python Move the paging warning related tests from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock.	2026-05-05 18:23:30 +02:00
Karol Nowacki	84787ce6a5	vector_search: test: migrate local_vector_index to Python Move the local vector index test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a local vector index on ((pk1, pk2), embedding) and verifies that SELECT with partition key restriction and ANN ordering works correctly.	2026-05-05 18:23:30 +02:00
Karol Nowacki	0bb7e47090	vector_search: test: migrate vector_index_with_additional_filtering_column to Python Move the SCYLLADB-635 regression test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a vector index on (embedding, ck1) and verifies that SELECT with ANN ordering works correctly when additional filtering columns are included in the index definition.	2026-05-05 18:23:30 +02:00
Karol Nowacki	5a8af3c727	vector_search: test: migrate cql_error_contains_http_error_description to Python Move the test that verifies HTTP error descriptions from the vector store are propagated through CQL InvalidRequest messages from the C++ vector_store_client_test to the Python test_vector_search_with_vector_store_mock. The test configures the mock to return HTTP 404 with 'index does not exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.	2026-05-05 18:23:30 +02:00
Karol Nowacki	b672972c5f	vector_search: test: migrate pk in restriction test to Python Move vector search (ANN ordered select query) with IN restrictions on partition key from C++/Boost test suite to pytest (cqlpy). Add VectorStoreMock server as pytest fixture to simulate vector store responses.	2026-05-05 18:23:30 +02:00
Nadav Har'El	417b4e0765	test/cluster: fix check_serving_notification() inefficiency When the sd_notify future completed, check_serving_notification() correctly updated _received_serving to True but still returned False on that same call. The SERVING state was only recognized on the next polling iteration, 100ms later, for no reason. Return self._received_serving instead of False after updating it.	2026-05-05 18:56:37 +03:00
Nadav Har'El	67384dbb96	test/cluster: remove now-redundant expected_server_up_state=SERVING ServerUpState.SERVING is now the default for server_add() and server_start(), so the explicit argument in various tests are no longer needed. Remove it along with the unused ServerUpState imports and the docstring comments that explained why it was there. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:56:37 +03:00
Nadav Har'El	3734afe193	test/cluster: document that add/start waits for all ports to be ready Add docstrings to server_add(), server_start(), and servers_add() explaining that they wait for ServerUpState.SERVING before returning, which means Scylla has finished listening on all configured ports (including non-default ones). Note that server_add() and server_start() accept expected_server_up_state to return earlier if needed, while servers_add() always waits for SERVING. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:56:32 +03:00
Nadav Har'El	90eef72794	test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING ScyllaServer.install_and_start() and ScyllaServer.start() still had ServerUpState.CQL_ALTERNATOR_QUERIED as their default for expected_server_up_state. In practice these defaults are never reached - both call sites in ScyllaCluster always pass the value explicitly, forwarding it from the higher-level add_server() and server_start() whose defaults were already fixed. Update them to SERVING anyway for consistency, so that the low-level methods agree with the policy established at the higher layers and won't silently revert to the wrong behavior if a new call site is added without an explicit argument. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:19 +03:00
Nadav Har'El	af03f0e8c4	test/cluster: fix server_add/server_start hanging when starting in maintenance mode When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering maintenance mode") instead of sd_notify("STATUS=serving"), and does not open the standard CQL port. This caused two independent bugs after the default was changed to ServerUpState.SERVING: 1. poll_status() resolved serving_signal to False on the maintenance notification, so check_serving_notification() would never return True, and start() would time out waiting for SERVING. 2. The readiness check in start() was guarded by `server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in maintenance mode (the standard CQL port is not open). Even if bug 1 were fixed, SERVING would never be recognized. Fix both: - Treat STATUS=entering maintenance mode as a successful readiness signal in poll_status(), resolving serving_signal to True just like STATUS=serving. Both mean "all configured ports are now open". - Remove the CQL_ALTERNATOR_QUERIED precondition from the check_serving_notification() call in start(). The sd_notify signal is authoritative: Scylla sends it only when fully ready, regardless of which ports it opened. No CQL precondition is needed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:18 +03:00
Karol Nowacki	207de967fb	vector_search: test: default timeout in test_dns_resolving_repeated Replace explicit 1-second timeouts in repeat_until() with the default STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for loaded CI environments where lowres_clock granularity (~10ms) combined with OS scheduling delays and resource contention (-c2 -m2G) could cause the loop to expire before the DNS refresh task completes its cycle. This also unifies test timeouts across test cases.	2026-05-05 17:23:39 +02:00
Karol Nowacki	4722be1289	vector_search: test: fix flaky test_dns_resolving_repeated Move trigger_dns_resolver() inside the repeat_until loop instead of calling it once before the loop. The test was intermittently timing out on CI. The exact root cause is not fully understood, but the hypothesis is that a single trigger signal can be lost somewhere (not exactly known where). This is not an issue for the production code because refresh trigger will be called multiple times - in every query where all configured nodes will be unreachable. By triggering inside the loop, we ensure the signal is re-sent on each iteration until the resolver actually performs the refresh and picks up the new (failing) DNS resolution. This makes the test resilient to timing-dependent signal loss without changing production code. Fixes: SCYLLADB-1794	2026-05-05 17:23:39 +02:00
Nadav Har'El	e014521565	test/cluster: make server_start() default to ServerUpState.SERVING For the same reason server_add() was changed to default to SERVING (see previous commit), server_start() had the same bug: after restarting a node that listens on non-default ports, the polling of the hardcoded CQL/Alternator ports could succeed before the custom ports were ready, causing intermittent failures. Apply the same fix to server_start() in manager_client.py, ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Nadav Har'El	f91525c5df	test/cluster: make server_add() default to ServerUpState.SERVING server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED, which polls the standard CQL and Alternator ports to determine when the server is ready. This is wrong when a test configures Scylla to listen on non-default ports: the polling succeeds on the default ports while the custom ports may not yet be ready, making such tests intermittently flaky. The correct behavior is ServerUpState.SERVING, which waits for Scylla's sd_notify("READY=1") signal. This signal is sent only after all configured listeners — including custom ports — are fully open, so it is the right readiness signal regardless of the port configuration. Up to now, the fix for each affected test was to pass expected_server_up_state=ServerUpState.SERVING explicitly once the flakiness was noticed (e.g. #29737). Change the default so that all future tests get the correct behavior automatically. Changed in manager_client.server_add(), ScyllaCluster.add_server(), and the _cluster_server_add HTTP handler. The multi-server servers_add() path already inherits the new default through add_server(). Fixes SCYLLADB-1822 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Marcin Maliszkiewicz	5c5306c692	test: add reproducer for auth cache crash on missing permissions column	2026-05-05 17:16:25 +02:00
Marcin Maliszkiewicz	c00fee0316	Merge 'utils: loading_cache: add `insert()` that is a no-op when caching is disabled' from Dario Mirovic When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with: ``` Assertion `caching_enabled()' failed. at utils/loading_cache.hh:319 in authorized_prepared_statements_cache::insert ``` `loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off. Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side. Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload. Fixes SCYLLADB-1699 The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported. Closes scylladb/scylladb#29638 * github.com:scylladb/scylladb: test: boost: regression test for loading_cache::insert with caching disabled utils: loading_cache: add insert() that is a no-op when caching is disabled	2026-05-05 15:33:49 +02:00
Patryk Jędrzejczak	9f692857be	test/cluster: increase failure_detector_timeout Scaling the timeout by build mode (#29522) turned out to be not sufficient. Nodes can still be unexpectedly marked as down, even with a 4s timeout in dev mode. I managed to reproduce SCYLLADB-864 in such conditions. Increasing failure_detector_timeout will proportionally slow down tests that use it. That's bad, but currently these tests' flakiness is a much bigger problem than the tests' slowness. Also, not many tests use this fixture, and we hope to make it unneeded eventually (see #28495).	2026-05-05 15:12:33 +02:00
Patryk Jędrzejczak	b69d00b0a7	Merge 'Barrier and drain logging' from Gleb Natapov Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281 Bakport since we want to have it if it happens in the field. Fixes: SCYLLADB-1821 Refs: #26281 Closes scylladb/scylladb#29735 * https://github.com/scylladb/scylladb: session, raft_topology: add periodic warnings for hung drain and stale version waits session: add info-level logging to drain_closing_sessions raft_topology: log sub-step progress in local_topology_barrier raft_topology: log read_barrier progress in topology cmd handler	2026-05-05 15:04:50 +02:00
Calle Wilund	5cdfdd9ba3	commitlog_test.py: Fix size check aliasing, and threshold calc. Fixes: SCYLLADB-1815 Checking segment sizes should not use a size filter that rounds (up) sizes. More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.	2026-05-05 14:42:55 +02:00
Nadav Har'El	1f15e05946	test: fix replica_read_timeout_no_exception flakiness on slow systems The test uses a 10ms read timeout to exercise code paths that handle timed-out reads without throwing C++ exceptions. As part of setup, it inserts rows and flushes them to two SSTables, then runs a warm-up SELECT to populate internal caches (e.g. the auth cache) before the real test begins. The reason for this warm-up read was the possibility that the first read does additional operations (such as reading and caching authentication) that might throw exceptions internally. I couldn't verify that such exceptions actually happen in today's code, but they might (re)appear in the future, so we should keep the warm-up SELECT. On slow CI machines (aarch64, debug build), that warm-up SELECT can take longer than 10ms to read from the two SSTables. When it does, the read times out: the coordinator receives 0 responses from the local replica within the deadline and propagates a read_timeout_exception. Since the exception is not caught, it escapes the test lambda, is logged as "cql env callback failed", and causes Boost.Test to report a C++ failure at the do_with_cql_env_thread call site. This matches the CI failure seen in SCYLLADB-1774: ERROR ... replica_read_timeout_no_exception: cql env callback failed, error: exceptions::read_timeout_exception (Operation timed out for replica_read_timeout_no_exception.tbl - received only 0 responses from 1 CL=ONE.) The CI log also shows that only 12 reads were admitted (the warm-up read plus the 11 reads from the two prepare() calls and CREATE/INSERT statements made earlier), and the current permit was stuck in need_cpu state -- the reactor hadn't had a chance to schedule the read before the 10ms window elapsed. The fix catches read_timeout_exception from the warm-up SELECT and retries until the read succeeds. The warm-up is required for correctness: some lazy-init code paths (e.g. auth cache population) use C++ exceptions for control flow internally. Those exceptions must be absorbed before the cxx_exceptions baseline is sampled inside execute_test(); otherwise they would appear in the delta and cause a false test failure. Simply ignoring a timed-out warm-up is not safe, because the lazy-init exceptions would then fire during the 1000 test reads, inflating cxx_exceptions_after relative to cxx_exceptions_before. No other calls in setup are susceptible to the 10ms read timeout: - CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write timeout (10s) and are not reads. - e.prepare() goes through the query processor without reading table data, so it is not subject to the read timeout. - The semaphore manipulation in Test 2 is internal and has no timeout. - All 1000 reads in execute_test() are expected to fail, so a timeout there is the happy path, not a failure. The 10ms timeout itself is fine for the test's purpose: it is deliberately aggressive so that reads reliably time out on the hot path being tested. The problem was only that the pre-test warm-up was not guarded against the same timeout. Fixes: SCYLLADB-1774 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29731	2026-05-05 15:13:13 +03:00
Botond Dénes	afd9a55891	Merge 'test/cluster: wait for custom listener readiness' from Piotr Smaron server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every custom listener the test will connect to is already accepting raw TCP connections. test_proxy_protocol_ssl_shard_aware connects directly to the shard-aware TLS proxy-protocol CQL port immediately after server startup. Wait for ServerUpState.SERVING in the fixture so the custom proxy-protocol listener is registered before opening raw sockets. test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets. test_perf_alternator_remote now asks server_add() to wait for SERVING and uses the returned server address directly. This removes the redundant running_servers() plus get_ready_cql() sequence noted in review. Fixes: SCYLLADB-1797 No backport as of now, only appeared on master. Closes scylladb/scylladb#29737 * github.com:scylladb/scylladb: test/cluster: avoid redundant perf alternator CQL wait test/cluster: wait for shard-aware CQL listener test/cluster: wait for proxy protocol ports to serve	2026-05-05 14:45:58 +03:00
Piotr Dulikowski	efcc0b6376	Merge 'table_helper: fix use-after-free on prepared-statement invalidation' from Marcin Maliszkiewicz insert() held no local strong ref to the prepared modification_statement across the suspension in execute(). On a single shard: 1. Fiber A suspends inside _insert_stmt->execute(). 2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes the prepared_statements_cache entry, releasing its strong ref. 3. Fiber B re-enters cache_table_info(), sees _prepared_stmt (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr, releasing the last strong ref. The modification_statement is freed. 4. Fiber A resumes inside execute() and touches freed this. Pin strong ref to _insert_stmt locally before the suspension. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667 Backport: all supported branches, it's memory corruption bug, long present Closes scylladb/scylladb#29588 github.com:scylladb/scylladb: test/boost: add dummy case to table_helper_test for non-injection modes test/boost: add regression test for table_helper insert() UAF utils/error_injection: add waiters() API table_helper: fix use-after-free on prepared-statement invalidation	2026-05-04 17:21:05 +02:00
Piotr Smaron	a3360ee385	test/nodetool: fix mock server port race by using a fixed port on a unique IP Symptom: the rest_api_mock subprocess exits with status 1 during fixture setup, e.g.: subprocess.CalledProcessError: Command '[..., 'rest_api_mock.py', '127.29.88.1', '34093']' returned non-zero exit status 1 Root cause: aiohttp's TCPSite.start() raises OSError(EADDRINUSE) and the process exits 1. The bind fails because of how the (ip, port) pair is chosen across modules within one test.py process: * Each test module leases a 127.x.y.z IP from the host registry. The registry recycles released IPs, so the same IP is shared across modules sequentially. * The original code picked the port via random.randint(10000, 65535). A previous module on the same IP could have left that port in TIME_WAIT (or worse, still actively in use) when a later module happened to pick the same port. SCYLLADB-1275 (PR 29314) tried to fix this by binding a probe socket to (ip, 0) to obtain an OS-assigned free port, closing the probe, then launching the mock server which would bind to that port. Two issues remained: 1. TOCTOU: between probe close and mock-server bind, any other process on the host could grab the just-freed port. 2. TIME_WAIT could still bite if the host registry recycled an IP and the OS reused the same port number for the probe. Fix: drop port discovery entirely. Use a fixed port (12345, matching the unshare-namespace path already in this fixture) on the unique IP from the host registry. Because IPs are unique per test module within one test.py process, the (ip, 12345) pair is unique to each module, so no port-collision dance is needed. reuse_address=True on TCPSite handles the residual TIME_WAIT case when the host registry recycles an IP within the same test.py process and the previous mock server's socket has not finished TIME_WAIT yet. reuse_port=True is dropped, as it was only useful while attempting to have multiple processes share a single port. This mirrors the design used in test/cqlpy/run.py: pick a unique IP, keep the port fixed. Fixes: SCYLLADB-1718 Closes scylladb/scylladb#29656	2026-05-04 15:33:19 +02:00
Gleb Natapov	e88ce09372	raft_topology: log sub-step progress in local_topology_barrier When a node processes a barrier_and_drain topology command, it performs two potentially long-running operations inside local_topology_barrier(): waiting for stale token metadata versions to be released (stale_versions_in_use) and draining closing sessions (drain_closing_sessions). Either of these can hang indefinitely -- for example, stale_versions_in_use blocks until all references to previous token metadata versions are released, which depends on in-flight requests completing. Previously, the only logging was a single 'done' message at the end, making it impossible to determine which sub-step was blocking when a barrier_and_drain RPC appeared stuck on a node. In a recent CI failure, a node never responded to barrier_and_drain during a removenode operation, and the logs showed the RPC was received but nothing about what it was waiting on internally. Add info-level logging before each blocking sub-step, including the topology version for correlation. This allows diagnosing hangs by showing whether the node is stuck waiting for stale metadata versions, stuck draining sessions, or never reached these steps at all.	2026-05-04 15:58:45 +03:00
Piotr Smaron	0a780d0ea1	test/cluster: avoid redundant perf alternator CQL wait server_add() already waits for the requested server-up state. For the remote perf-alternator test, request SERVING from server_add() and use the returned server address directly instead of asking for running servers and then calling get_ready_cql() again. This keeps the listener-readiness intent explicit while removing the redundant CQL readiness probe noted in review.	2026-05-04 14:09:28 +02:00
Piotr Smaron	c90012c22b	test/cluster: wait for shard-aware CQL listener server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every CQL listener configured for the process is already accepting raw TCP connections. test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets. Scylla sends that notification only after protocol servers are registered, so this closes the startup window without adding sleeps or local retry loops. Fixes: SCYLLADB-1797	2026-05-04 13:36:43 +02:00
Nadav Har'El	983eb5ab43	test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old nodes concurrently, then adds a new node before issuing CREATE ROLE. The concurrent bootstraps trigger the well-known Python driver bug (scylladb/python-driver#317): two on_add notifications race in update_created_pools, causing a second pool to be created for a host whose pool was already established. If CREATE ROLE is in-flight on the old pool when it is closed, the driver retries on the new pool, executing the statement twice. The second execution fails with "Role ... already exists", making the test flaky. Fix by using CREATE ROLE IF NOT EXISTS. This is safe because unique_name() generates a timestamp+random suffix that is guaranteed to be unique; the role can "already exist" only due to the driver double-execution bug, never due to a real conflict. This is the same workaround that has been applied many times elsewhere in our test suite for exactly the same root cause: - CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368, later generalised in scylladb#22399 via new_test_keyspace helpers) - DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487) Fixes: SCYLLADB-1742 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29732	2026-05-04 11:47:11 +02:00
Yaniv Michael Kaul	6179406467	raft/group0: fix destroy assertion on startup failure If start_server_for_group0() successfully registers a server in _raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine()) throws, the server is never destroyed because abort_and_drain()/destroy() check std::get_if<raft::group_id>(&_group0) which was only set after the entire with_scheduling_group block completed. Move _group0.emplace<raft::group_id>() inside the lambda, immediately after start_server_for_group() succeeds, so that cleanup paths can always find and destroy the registered server. This fixes the assertion: "raft_group_registry - stop(): server for group ... is not destroyed" which manifests during shutdown after an upgrade where topology_state_load() fails due to netw::unknown_address. Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades Refs: SCYLLADB-1217 Refs: CUSTOMER-340 Refs: CUSTOMER-335 Fixes: SCYLLADB-1801 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: Yes, Opencode/Opus 4.6 Closes scylladb/scylladb#29702	2026-05-04 11:25:46 +02:00
Piotr Smaron	689117f706	test/cluster: wait for proxy protocol ports to serve server_add()'s default readiness only waits until CQL can be queried, but these tests immediately connect to custom proxy protocol listeners. Wait for SERVING so the shard-aware TLS proxy port is accepting connections before the test starts, matching the Alternator proxy protocol readiness fix.	2026-05-04 10:23:03 +02:00
Nadav Har'El	d33bb6ea00	Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened). After restarting servers[1], the topology coordinator can initiate a residual re-repair when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the next retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys. 1. Propagating `current_key` through the exception — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt. 2. DROP TABLE + CREATE TABLE between retries — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues. Instead of trying to clean up contaminated state, each retry creates a completely fresh keyspace (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures. The detection is now comprehensive: - Broadened coordinator check: any coordinator change (`new_coord != coord`), not just migration to servers[1] - Re-repair detection at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log 1. `test: extract _setup_table_for_race_window helper` — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change. 2. `test: fix race window test flakiness from residual re-repair` — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception. Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode. Fixes: SCYLLADB-1478 So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen. Closes scylladb/scylladb#29721 * github.com:scylladb/scylladb: test: fix race window test flakiness from residual re-repair test: extract _setup_table_for_race_window helper for race window test	2026-05-03 14:47:19 +03:00

1 2 3 4 5 ...

11801 Commits