scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-25 09:11:10 +00:00

Author	SHA1	Message	Date
Łukasz Paszkowski	6f364fd3b7	db: fix system.size_estimates to aggregate sstable estimates across all shards The estimate() function in the size_estimates virtual reader only considered sstables local to the shard that happened to own the keyspace's partition key token. Since sstables are distributed across shards, this caused partition count estimates to be approximately 1/smp_count of the actual value. This bug has been present since the virtual reader was introduced in `225648780d`. Use db.container().map_reduce0() to aggregate sstable estimates across all shards. Each shard contributes its local count and estimated_histogram, which are then merged to produce the correct total. Also fix the `test_partitions_estimate_full_overlap` test which becomes flaky (xpassing ~1% of runs) because autocompaction could merge the two overlapping sstables before the size estimate was read. Wrap the test body in nodetool.no_autocompaction_context to prevent this race. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179 Refs https://github.com/scylladb/scylladb/issues/9083 Closes scylladb/scylladb#29286	2026-04-07 14:13:26 +03:00
Avi Kivity	8b4a91982b	cmake: add missing rolling_max_tracker_test and symmetric_key_test Added in `5b2a07b408` and `c596ae6eb1` without cmake integration. Closes scylladb/scylladb#29328	2026-04-07 14:09:00 +03:00
Avi Kivity	d01c9a425f	test: test_out_of_storage_prevention: fix invalid escape in regex Python warns that the sequence "\(" is an invalid escape and might be rejected in the future. Protect against that by using a raw string. Closes scylladb/scylladb#29334	2026-04-07 14:06:32 +03:00
Pavel Emelyanov	0ae781c008	Merge 'test: auth_test: coroutinize' from Avi Kivity Convert auth_test.cc to coroutines for improved readability. Each test is converted in its own commit. Some are trivial. Indentation is left broken in some commits to reduce the diff, then fixed up in the last commit. Code cleanup, so no backport. Closes scylladb/scylladb#29336 * github.com:scylladb/scylladb: auth_test: fix whitespace auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password auth_test: coroutinize test_alter_with_workload_type auth_test: coroutinize test_alter_with_timeouts auth_test: coroutinize role_permissions_table_is_protected auth_test: coroutinize role_members_table_is_protected auth_test: coroutinize roles_table_is_protected auth_test: coroutinize test_password_authenticator_operations auth_test: coroutinize test_password_authenticator_attributes auth_test: coroutinize test_default_authenticator	2026-04-07 14:05:32 +03:00
Botond Dénes	513af59130	encryption: improve error message when KMS host is not configured When an SSTable was encrypted with a KMS host that is not present in scylla.yaml, the error thrown was: std::invalid_argument (No such host: <host-name>) This message is very obscure in general, and especially confusing when encountered while using the scylla-sstable tool: it gives no indication that the SSTable is encrypted, that a KMS host lookup is involved, or what the user needs to do to fix the problem. Replace it with a message that names the missing host and points directly to the relevant scylla.yaml section: Encryption host "<host-name>" is not defined in scylla.yaml. Make sure it is listed under the "kmip_hosts" section. The wording is intentionally kept neutral (not framed as an SSTable tool problem) because the same code path is exercised by production ScyllaDB when a node's configuration no longer contains a host referenced by an existing data file (e.g. after a config rollback or when restoring data from a different cluster). The production use-case takes precedence, but the message is equally actionable from the tool. Closes scylladb/scylladb#29228	2026-04-07 14:00:27 +03:00
Pavel Emelyanov	d6df5ef60a	Merge 'compaction_test: Make compaction tests backend‑agnostic and add S3/GCS support' from Ernest Zaslavsky This series updates the storage abstraction and extends the compaction tests to support object‑storage backends (S3 and GCS), while tightening several parts of the test environment. The changes include: - New exists/object_exists helpers across storage backends and clock fixes in the S3 client to make signature generation stable under test conditions. - A new get_storage_for_tests accessor and adjustments to the test environment to avoid premature teardown of the sstable registry. - Refactoring of compaction tests to remove direct sstable access, ensure proper schema setup, and avoid use of moved‑from objects. - Extraction of test_env‑based logic into reusable functions and addition of S3/GCS variants of the compaction tests. Not all tests were converted to be backend‑agnostic yet, and a few require further investigation before they can run cleanly against S3/GCS backends. These will be addressed in follow‑up work. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-704 however, followup is needed No backport needed since this change targeting future feature Closes scylladb/scylladb#28790 * github.com:scylladb/scylladb: compaction_test: fix formatting after previous patches compaction_test: add S3/GCS variations to tests compaction_test: extract test_env-based tests into functions compaction_test: replace file_exists with storage::exists compaction_test: initialize tables with schema via make_table_for_tests compaction_test: use sstable APIs to manipulate component files compaction_test: fix use-after-move issue sstable_utils: add `get_storage` and `open_file` helpers test_env: delay unplugging sstable registry storage: add `exists` method to storage abstraction s3_client: use lowres_system_clock for aws_sigv4 s3_client: add `object_exists` helper gcs_client: add `object_exists` helper	2026-04-07 13:53:48 +03:00
Avi Kivity	bc10e1a171	test: fix flaky test_login by not retrying authentication failures The fix for SCYLLADB-1373 (`b4f652b7c1`) changed get_session() to use the default timeout=30 for the retry loop in patient_*_cql_connection (previously timeout=0.1). This correctly allowed retrying transient NoHostAvailable errors during node startup, but introduced a new flakiness in test_login and other auth tests. The failure chain: 1. test_login connects with bad credentials (e.g. user="doesntexist") 2. get_session() calls patient_exclusive_cql_connection(), which calls retry_till_success() with bypassed_exception=NoHostAvailable 3. The first attempt correctly fails: the server rejects the credentials with AuthenticationFailed, wrapped in NoHostAvailable 4. retry_till_success() catches NoHostAvailable indiscriminately and retries, not distinguishing between transient errors (node not ready) and permanent errors (bad credentials) 5. A subsequent retry attempt times out (connect_timeout=5), producing OperationTimedOut wrapped in NoHostAvailable 6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping OperationTimedOut instead of the original AuthenticationFailed 7. The assertion `isinstance(..., AuthenticationFailed)` fails With the old timeout=0.1, the deadline was already exceeded after the first attempt, so the original AuthenticationFailed propagated. Fix: Add a `should_retry` predicate parameter to retry_till_success() and use it in patient_cql_connection() and patient_exclusive_cql_connection() to immediately re-raise NoHostAvailable when it wraps AuthenticationFailed. Retrying authentication failures is never useful since the credentials won't change between attempts. Fixes: SCYLLADB-1382 Closes scylladb/scylladb#29348	2026-04-07 10:17:31 +03:00
Avi Kivity	b4f652b7c1	test: fix flaky test_create_ks_auth by removing bad retry timeout get_session() was passing timeout=0.1 to patient_exclusive_cql_connection and patient_cql_connection, leaving only 0.1 seconds for the retry loop in retry_till_success(). Since each connection attempt can take up to 5 seconds (connect_timeout=5), the retry loop effectively got only one attempt with no chance to retry on transient NoHostAvailable errors. Use the default timeout=30 seconds, consistent with all other callers. Fixes: SCYLLADB-1373 Closes scylladb/scylladb#29332	2026-04-05 19:13:15 +03:00
Avi Kivity	2f0d178510	auth_test: fix whitespace Fix over-indented lines inside do_with_mc lambda bodies introduced during coroutinization.	2026-04-05 18:28:23 +03:00
Avi Kivity	7a24da9e88	auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	e1b52cf337	auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	24d36ad459	auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	6f20129eec	auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	cece181113	auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	752391f757	auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	287625b297	auth_test: coroutinize test_alter_with_workload_type Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	4eeb5ef54d	auth_test: coroutinize test_alter_with_timeouts Use co_await instead of return for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	170c71b25d	auth_test: coroutinize role_permissions_table_is_protected Use co_await for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	13eccf519f	auth_test: coroutinize role_members_table_is_protected Use co_await for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	43ff3798ad	auth_test: coroutinize roles_table_is_protected Use co_await for improved readability.	2026-04-05 18:26:30 +03:00
Avi Kivity	c586eeb003	auth_test: coroutinize test_password_authenticator_operations Flatten continuation chains (.then()) into linear thread-style code with .get() calls for improved readability. Remove the now-unused require_throws helper template.	2026-04-05 18:26:25 +03:00
Avi Kivity	fbccfe5c9d	auth_test: coroutinize test_password_authenticator_attributes Use co_await instead of return+do_with_cql_env+make_ready_future for improved readability.	2026-04-05 17:28:09 +03:00
Avi Kivity	e3dee64003	auth_test: coroutinize test_default_authenticator Use co_await instead of return+do_with_cql_env+make_ready_future for improved readability.	2026-04-05 17:27:45 +03:00
Tomasz Grabiec	74542be5aa	test: pylib: Ignore exceptions in wait_for() ManagerClient::get_ready_cql() calls server_sees_others(), which waits for servers to see each other as alive in gossip. If one of the servers is still early in boot, RESTful API call to "gossiper/endpoint/live" may fail. It throws an exception, which currently terminates the wait_for() and propagates up, failing the test. Fix this by ignoring errors when polling inside wait_for. In case of timeout, we log the last exception. This should fix the problem not only in this case, for all uses of wait_for(). Example output: ``` pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140> deadline = 1775218828.9172852, period = 1.0, before_retry = None backoff_factor = 1.5, max_period = 1.0, label = None async def wait_for( pred: Callable[[], Awaitable[Optional[T]]], deadline: float, period: float = 0.1, before_retry: Optional[Callable[[], Any]] = None, backoff_factor: float = 1.5, max_period: float = 1.0, label: Optional[str] = None) -> T: tag = label or getattr(pred, '__name__', 'unlabeled') start = time.time() retries = 0 last_exception: Exception \| None = None while True: elapsed = time.time() - start if time.time() >= deadline: timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)" if last_exception is not None: timeout_msg += ( f"; last exception: {type(last_exception).__name__}: {last_exception}" ) raise AssertionError(timeout_msg) from last_exception raise AssertionError(timeout_msg) try: > res = await pred() test/pylib/util.py:80: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ async def _sees_min_others(): > raise Exception("asd") E Exception: asd test/pylib/manager_client.py:802: Exception The above exception was the direct cause of the following exception: manager = <test.pylib.manager_client.ManagerClient object at 0x7f022af7e7b0> @pytest.mark.asyncio async def test_auth_after_reset(manager: ManagerClient) -> None: servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1") > cql, _ = await manager.get_ready_cql(servers) test/cluster/auth_cluster/test_auth_after_reset.py:33: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test/pylib/manager_client.py:137: in get_ready_cql await self.servers_see_each_other(servers) test/pylib/manager_client.py:820: in servers_see_each_other await asyncio.gather(others) test/pylib/manager_client.py:806: in server_sees_others await wait_for(_sees_min_others, time() + interval, period=.5) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140> deadline = 1775218828.9172852, period = 1.0, before_retry = None backoff_factor = 1.5, max_period = 1.0, label = None async def wait_for( pred: Callable[[], Awaitable[Optional[T]]], deadline: float, period: float = 0.1, before_retry: Optional[Callable[[], Any]] = None, backoff_factor: float = 1.5, max_period: float = 1.0, label: Optional[str] = None) -> T: tag = label or getattr(pred, '__name__', 'unlabeled') start = time.time() retries = 0 last_exception: Exception \| None = None while True: elapsed = time.time() - start if time.time() >= deadline: timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)" if last_exception is not None: timeout_msg += ( f"; last exception: {type(last_exception).__name__}: {last_exception}" ) > raise AssertionError(timeout_msg) from last_exception E AssertionError: wait_for(_sees_min_others) timed out after 45.30s (46 retries); last exception: Exception: asd test/pylib/util.py:76: AssertionError ``` Fixes a failure observed in test_auth_after_reset: ``` manager = <test.pylib.manager_client.ManagerClient object at 0x7fb3740e1630> @pytest.mark.asyncio async def test_auth_after_reset(manager: ManagerClient) -> None: servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1") cql, _ = await manager.get_ready_cql(servers) await cql.run_async("ALTER ROLE cassandra WITH PASSWORD = 'forgotten_pwd'") logging.info("Stopping cluster") await asyncio.gather([manager.server_stop_gracefully(server.server_id) for server in servers]) logging.info("Deleting sstables") for table in ["roles", "role_members", "role_attributes", "role_permissions"]: await asyncio.gather([manager.server_wipe_sstables(server.server_id, "system", table) for server in servers]) logging.info("Starting cluster") # Don't try connect to the servers yet, with deleted superuser it will be possible only after # quorum is reached. await asyncio.gather([manager.server_start(server.server_id, connect_driver=False) for server in servers]) logging.info("Waiting for CQL connection") await repeat_until_success(lambda: manager.driver_connect(auth_provider=PlainTextAuthProvider(username="cassandra", password="cassandra"))) > await manager.get_ready_cql(servers) test/cluster/auth_cluster/test_auth_after_reset.py:50: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test/pylib/manager_client.py:137: in get_ready_cql await self.servers_see_each_other(servers) test/pylib/manager_client.py:819: in servers_see_each_other await asyncio.gather(*others) test/pylib/manager_client.py:805: in server_sees_others await wait_for(_sees_min_others, time() + interval, period=.5) test/pylib/util.py:71: in wait_for res = await pred() test/pylib/manager_client.py:802: in _sees_min_others alive_nodes = await self.api.get_alive_endpoints(server_ip) test/pylib/rest_client.py:243: in get_alive_endpoints data = await self.client.get_json(f"/gossiper/endpoint/live", host=node_ip) test/pylib/rest_client.py:99: in get_json ret = await self._fetch("GET", resource_uri, response_type = "json", host = host, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <test.pylib.rest_client.TCPRESTClient object at 0x7fb2404a0650> method = 'GET', resource = '/gossiper/endpoint/live', response_type = 'json' host = '127.15.252.8', port = 10000, params = None, json = None, timeout = None allow_failed = False async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None, host: Optional[str] = None, port: Optional[int] = None, params: Optional[Mapping[str, str]] = None, json: Optional[Mapping] = None, timeout: Optional[float] = None, allow_failed: bool = False) -> Any: # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}" assert response_type is None or response_type in ["text", "json"], \ f"Invalid response type requested {response_type} (expected 'text' or 'json')" # Build the URI port = port if port else self.default_port if hasattr(self, "default_port") else None port_str = f":{port}" if port else "" assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \ "{method} {resource}" host_str = host if host is not None else self.default_host uri = self.uri_scheme + "://" + host_str + port_str + resource logging.debug(f"RESTClient fetching {method} {uri}") client_timeout = ClientTimeout(total = timeout if timeout is not None else 300) async with request(method, uri, connector = self.connector if hasattr(self, "connector") else None, params = params, json = json, timeout = client_timeout) as resp: if allow_failed: return await resp.json() if resp.status != 200: text = await resp.text() > raise HTTPError(uri, resp.status, params, json, text) E test.pylib.rest_client.HTTPError: HTTP error 404, uri: http://127.15.252.8:10000/gossiper/endpoint/live, params: None, json: None, body: E {"message": "Not found", "code": 404} test/pylib/rest_client.py:77: HTTPError ``` Fixes: SCYLLADB-1367 Closes scylladb/scylladb#29323	2026-04-05 13:52:26 +03:00
Ernest Zaslavsky	c7a74237b3	compaction_test: fix formatting after previous patches	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	101b4ad7fa	compaction_test: add S3/GCS variations to tests Add S3 and GCS variants of the compaction tests to expand coverage for keyspaces configured to use object_storage backends.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	03bd3010bf	compaction_test: extract test_env-based tests into functions Move all test code that relies on test_env into standalone free functions so they can be reused by upcoming S3 and GCS test suites.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	b18528e97e	compaction_test: replace file_exists with storage::exists Replace direct filesystem checks (file_exists) with the storage-agnostic exists() method in unsealed_sstable_compaction, sstable_clone_leaving_unsealed_dest_sstable, and failure_when_adding_new_sstable tests, making them compatible with object-storage backends (S3, GCS).	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	98492e4ea8	compaction_test: initialize tables with schema via make_table_for_tests Start using `table_for_tests::make_default_schema` so test tables are created with a real schema. This is required for object-storage backends, which cannot operate correctly without proper schema initialization.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	5ba79e2ed4	compaction_test: use sstable APIs to manipulate component files Switch tests to use sstable member functions for file manipulation instead of opening files directly on the filesystem. This affects the helpers that emulate sstable corruption: we now overwrite the entire component file rather than just the first few kilobytes, which is sufficient for producing a corrupted sstable.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	405c032f48	compaction_test: fix use-after-move issue We were moving `compaction_type_options` inside a loop, so on the second iteration the test received an already moved-from instance.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	437a581b04	sstable_utils: add `get_storage` and `open_file` helpers Add a non-const `get_storage` accessor to expose underlying storage, and an `open_file` helper to access sstable component files directly. These are needed so compaction tests can read and write sstable components.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	2ad2dbae03	test_env: delay unplugging sstable registry Unplugging the mock sstable_registry happened too early in the test environment. During sstable destruction, components may still need access to the registry, so the unplugging is moved to a later stage.	2026-04-05 11:07:17 +03:00
Andrzej Jackowski	8c0920202b	test: protect populate_range in row_cache_test from bad_alloc When test_exception_safety_of_update_from_memtable was converted from manual fail_after()/catch to with_allocation_failures() in `74db08165d`, the populate_range() call ended up inside the failure injection scope without a scoped_critical_alloc_section guard. The other two tests converted in the same commit (test_exception_safety_of_transitioning... and test_exception_safety_of_partition_scan) were correctly guarded. Without the guard, the allocation failure injector can sometimes target an allocation point inside the cleanup path of populate_range(). In a rare corner case, this triggers a bad_alloc in a noexcept context (reader_concurrency_semaphore::stop()), causing std::terminate. Fixes SCYLLADB-1346 Closes scylladb/scylladb#29321	2026-04-04 21:13:26 +03:00
Andrzej Jackowski	ec274cf7b6	test: add test_upgrade_preserves_ddl_audit_for_tables Verify that upgrading from 2025.1 to master does not silently drop DDL auditing for table-scoped audit configurations (SCYLLADB-1155). Test time in dev: 4s Refs: SCYLLADB-1155 Fixes: SCYLLADB-1305	2026-04-03 13:53:28 +02:00
Andrzej Jackowski	9c7b7ac3e3	test: audit: split validate helper so callers need not pass audit_settings The old execute_and_validate_audit_entry required every caller to pass audit_settings so it could decide internally whether to expect an entry. A test added later in this series needs to simply assert an entry was produced, without specifying audit_settings at all. Split into two methods: - execute_and_validate_new_audit_entry: unconditionally expects an audit entry. - execute_and_validate_if_category_enabled: checks audit_settings to decide whether to expect an entry or assert absence. Local wrapper functions and **kwargs forwarding are removed in favor of explicit arguments at each call site, and expected-error cases are handled inline with assert_invalid + assert_entries_were_added.	2026-04-03 13:52:47 +02:00
Andrzej Jackowski	189bff1d5c	test: audit: declare manager attribute in AuditTester base class AuditTester uses self.manager throughout but never declares it. The attribute is only assigned in the CQLAuditTester subclass __init__, so the type checker reports 'Attribute "manager" is unknown' on every self.manager reference in the base class. Add an __init__ to AuditTester that accepts and stores the manager instance, and update CQLAuditTester to forward it via super().__init__ instead of assigning self.manager directly.	2026-04-03 13:52:47 +02:00
Botond Dénes	2c22d69793	Merge 'Pytest: fix variable handling in GSServer (mock) and ensure docker service logs go to test log as well' from Calle Wilund Fixes: SCYLLADB-1106 * Small fix in scylla_cluster - remove debug print * Fix GSServer::unpublish so it does not except if publish was not called beforehand * Improve dockerized_server so mock server logs echo to the test log to help diagnose CI failures (because we don't collect log files from mocks etc, and in any case correlation will be much easier). No backport needed. Closes scylladb/scylladb#29112 * github.com:scylladb/scylladb: dockerized_service: Convert log reader to pipes and push to test log test::cluster::conftest::GSServer: Fix unpublish for when publish was not called scylla_cluster: Use thread safe future signalling scylla_cluster: Remove left-over debug printout	2026-04-03 06:38:05 +03:00
Raphael S. Carvalho	b6ebbbf036	test/cluster/test_tablets2: Fix test_split_stopped_on_shutdown race with stale log messages The test was failing because the call to: await log.wait_for('Stopping.ongoing compactions') was missing the 'from_mark=log_mark' argument. The log mark was updated (line: log_mark = await log.mark()) immediately after detecting 'splitting_mutation_writer_switch_wait: waiting', and just before launching the shutdown task. However, the wait_for call on the following line was scanning from the beginning of the log, not from that mark. As a result, the search immediately matched old 'Stopping N tasks for N ongoing compactions for table system.X due to table removal' messages emitted during initial server bootstrap (for system.large_partitions, system.large_rows, system.large_cells), rather than waiting for the shutdown to actually stop the user-table split compaction. This caused the test to prematurely send the message to the 'splitting_mutation_writer_switch_wait' injection. The split compaction was unblocked before the shutdown had aborted it, so it completed successfully. Since the split succeeded, 'Failed to complete splitting of table' was never logged. Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting for a message. With the split already done, the test was stuck waiting for the expected failure log that would never come (600s timeout). At the same time, after 60s the 'storage_service_drain_wait' injection timed out internally, triggering on_internal_error() which -- with --abort-on-internal-error=1 -- crashed the server (exit code -6). Fix: pass from_mark=log_mark to the wait_for('Stopping.ongoing compactions') call so it only matches messages that appear after the shutdown has started, ensuring the test correctly synchronizes with the shutdown aborting the user-table split compaction before releasing the injection. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29311	2026-04-03 06:28:51 +03:00
Andrei Chekun	6526a78334	test.py: fix nodetool mock server port collision Replace the random port selection with an OS-assigned port. We open a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back the port number the OS selected, then close the socket before launching rest_api_mock.py. Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py so the server itself can also reclaim a TIME_WAIT port if needed. Fixes: SCYLLADB-1275 Closes scylladb/scylladb#29314	2026-04-02 16:24:07 +02:00
Botond Dénes	eb78498e07	test: fix flaky test_timeout_is_applied_on_lookup by using eventually_true On slow/overloaded CI machines the lowres_clock timer may not have fired after the fixed 2x sleep, causing the assertion on get_abort_exception() to fail. Replace the fixed sleep with sleep(1x) + eventually_true() which retries with exponential backoff, matching the pattern already used in test_time_based_cache_eviction. Fixes: SCYLLADB-1311 Closes scylladb/scylladb#29299	2026-04-01 18:20:11 +03:00
Robert Bindar	e7527392c4	test: close clients if cluster teardown throws make sure the driver is stopped even though cluster teardown throws and avoid potential stale driver connections entering infinite reconnect loops which exhaust cpu resources. Fixes: SCYLLADB-1189 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#29230	2026-04-01 17:22:19 +03:00
Tomasz Grabiec	2ec47a8a21	tests: address_map_test: Fix flakiness in debug mode due to task reordering Debug mode shuffles task position in the queue. So the following is possible: 1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there 2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call 3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor 4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..) 5) shard 0 inserts the completion from (4) before task from (1) 6) the check on shard 0: m.find(id1) fails because the timer is not expired yet To fix that, wait for timer expiration on shard 0, so that the test doesn't depend on task execution order. Note: I was not able to reproduce the problem locally using test.py --mode debug --repeat 1000. It happens in jenkins very rarely. Which is expected as the scenario which leads to this is quite unlikely. Fixes SCYLLADB-1265 Closes scylladb/scylladb#29290	2026-04-01 17:17:35 +03:00
Aleksandra Martyniuk	4d4ce074bb	test: node_ops_tasks_tree: reconnect driver after topology changes The test exercises all five node operations (bootstrap, replace, rebuild, removenode, decommission) and by the end only one node out of four remains alive. The CQL driver session, however, still holds stale references to the dead hosts in its connection pool and load-balancing policy state. When the new_test_keyspace context manager exits and attempts DROP KEYSPACE, the driver routes the query to the dead hosts first, gets ConnectionShutdown from each, and throws NoHostAvailable before ever trying the single live node. Fix by calling driver_connect() after the decommission step, which closes the old session and creates a fresh one connected only to the servers the test manager reports as running. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1313. Closes scylladb/scylladb#29306	2026-04-01 17:13:11 +03:00
Dario Mirovic	85127fded8	test: boost: test null data value to_parsable_string Add tests for null value in data_type::to_parsable_string(). We now explicitly return "null". Refs SCYLLADB-1350	2026-04-01 14:15:25 +02:00
Andrzej Jackowski	cccb014747	test: ldap: add regression test for double-free on unregistered message ID Sends a search via the raw LDAP handle (bypassing _msgid_to_promise registration), then triggers poll_results() through the public API to exercise the unregistered-ID branch. Refs: SCYLLADB-1344	2026-04-01 12:57:50 +02:00
Botond Dénes	0351756b15	Merge 'test: fix fuzzy_test timeout in release mode' from Piotr Smaron The multishard_query_test/fuzzy_test was timing out (SIGKILL after 15 minutes) in release mode CI. In release mode the test generates up to 64 partitions with up to 1000 clustering rows and 1000 range tombstones each. With deeply nested randomly-generated types (e.g. frozen<map<varint, frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed the 15-minute CI timeout. Reduce the release-mode clustering-row and range-tombstone distributions from 0-1000 to 0-200. This caps the worst case at ~12,800 rows -- still 2x the devel-mode maximum (0-100) and sufficient to exercise multi-partition paged scanning with many pages. Fixes: SCYLLADB-1270 No need to backport for now, only appeared on master. Closes scylladb/scylladb#29293 * github.com:scylladb/scylladb: test: clean up fuzzy_test_config and add comments test: fix fuzzy_test timeout in release mode	2026-04-01 11:50:15 +03:00
Avi Kivity	d438e35cdd	test/cluster: fix race in test_insert_failure_standalone audit log query get_audit_partitions_for_operation() returns None when no audit log rows are found. In _test_insert_failure_doesnt_report_success_assign_nodes, this None is passed to set(), causing TypeError: 'NoneType' object is not iterable. The audit log entry may not yet be visible immediately after executing the INSERT, so use wait_for() from test.pylib.util with exponential backoff to poll until the entry appears. Import it as wait_for_async to avoid shadowing the existing wait_for from test.cluster.dtest.dtest_class, which has a different signature (timeout vs deadline). Fixes SCYLLADB-1330 Closes scylladb/scylladb#29289	2026-04-01 10:59:02 +03:00
Michael Litvak	35547bfb6e	test: logstor: additional logstor tests	2026-03-31 18:45:08 +02:00
Michael Litvak	6ace823ee4	test: logstor: tablet split/merge and migration add basic logstor tests for tablet split/merge and migration to verify it works as expected	2026-03-31 18:45:08 +02:00

... 8 9 10 11 12 ...

11801 Commits