scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Yaniv Michael Kaul	3e2b0f844c	docs/cql: fix missing opening quote in ALTER KEYSPACE example The dc2 key was missing its opening single quote: dc2' should be 'dc2'.	2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul	815aad50af	docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING) The grammar requires IF NOT EXISTS to appear before USING TTL, not after. The example had 'USING TTL 86400 IF NOT EXISTS' which produces a syntax error.	2026-05-06 11:32:04 +03:00
Nadav Har'El	19555bc2cf	test/pylib: fix missing protocol_version=4 on control_cluster get_cql_up_state() creates two Cluster instances: a short-lived one used to probe CQL readiness, and a persistent control_cluster kept alive for the lifetime of the server. The probe cluster was created with protocol_version=4 (the highest version Scylla supports), but the control_cluster was not, causing the driver to do a superfluous version-negotiation round-trip on every server start. Fix by extracting the shared constructor arguments into a cluster_kwargs dict and using **cluster_kwargs for both calls, so the two Cluster instances are created with identical parameters. This deduplication can help avoid more instances of this bug, where someone modifies the options in one call but forgets to change the options in the other call. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:57:49 +03:00
Nadav Har'El	f977621e40	scylla_cluster: guard poll_status() set_result() calls against cancelled future The poll_status() background thread resolves `serving_signal` by scheduling `f.set_result(...)` on the event loop via `call_soon_threadsafe`. In parallel, `_cleanup_notify_socket()` can cancel `serving_signal` at any time - for example when a server fails to start and `stop()` -> `shutdown_control_connection()` is called while the thread is still blocked in `recv()` (the socket close unblocks the `recv()` with an exception, sending it down the error path). When that race fires the scheduled `f.set_result(...)` callback runs after `cancel()` has already put the future into the cancelled state, raising `asyncio.InvalidStateError: Result is not allowed in cancelled state`. This bug predates the SERVING work, but the original CQL_ALTERNATOR_QUERIED default meant the notify socket was torn down quickly most of the time, making the window very narrow. Now that SERVING is the default the socket stays open throughout the full startup wait, widening the race significantly. Fix: replace every bare `f.set_result(v)` call with `lambda: f.done() or f.set_result(v)`, which is a no-op when the future is already done (cancelled, or resolved by another path). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:34:58 +03:00
Nadav Har'El	ff33440c6c	test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING With ServerUpState.SERVING now the default, server_add() and server_start() wait for sd_notify readiness after CQL is already up. During that window the startup polling loop was calling get_cql_alternator_up_state() on every iteration (every 100ms). Each successful call recreated self.control_cluster and self.control_connection without closing the previous ones, leaking driver connections and adding unnecessary CQL load to a node that was already known to be queryable. Fix in two places: - Startup loop: skip the get_cql_alternator_up_state() call once server_up_state has reached CQL_ALTERNATOR_QUERIED. After that point only the cheap non-blocking check_serving_notification() is needed. - get_cql_up_state(): guard control_cluster/control_connection creation with `if self.control_connection is None` so the persistent driver connection is only established once, even if the function is called multiple times. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:29:18 +03:00
Karol Nowacki	20b953ef8c	vector_search: test: migrate paging warnings tests to Python Move the paging warning related tests from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock.	2026-05-05 18:23:30 +02:00
Karol Nowacki	84787ce6a5	vector_search: test: migrate local_vector_index to Python Move the local vector index test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a local vector index on ((pk1, pk2), embedding) and verifies that SELECT with partition key restriction and ANN ordering works correctly.	2026-05-05 18:23:30 +02:00
Karol Nowacki	0bb7e47090	vector_search: test: migrate vector_index_with_additional_filtering_column to Python Move the SCYLLADB-635 regression test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a vector index on (embedding, ck1) and verifies that SELECT with ANN ordering works correctly when additional filtering columns are included in the index definition.	2026-05-05 18:23:30 +02:00
Karol Nowacki	5a8af3c727	vector_search: test: migrate cql_error_contains_http_error_description to Python Move the test that verifies HTTP error descriptions from the vector store are propagated through CQL InvalidRequest messages from the C++ vector_store_client_test to the Python test_vector_search_with_vector_store_mock. The test configures the mock to return HTTP 404 with 'index does not exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.	2026-05-05 18:23:30 +02:00
Karol Nowacki	b672972c5f	vector_search: test: migrate pk in restriction test to Python Move vector search (ANN ordered select query) with IN restrictions on partition key from C++/Boost test suite to pytest (cqlpy). Add VectorStoreMock server as pytest fixture to simulate vector store responses.	2026-05-05 18:23:30 +02:00
Nadav Har'El	417b4e0765	test/cluster: fix check_serving_notification() inefficiency When the sd_notify future completed, check_serving_notification() correctly updated _received_serving to True but still returned False on that same call. The SERVING state was only recognized on the next polling iteration, 100ms later, for no reason. Return self._received_serving instead of False after updating it.	2026-05-05 18:56:37 +03:00
Nadav Har'El	67384dbb96	test/cluster: remove now-redundant expected_server_up_state=SERVING ServerUpState.SERVING is now the default for server_add() and server_start(), so the explicit argument in various tests are no longer needed. Remove it along with the unused ServerUpState imports and the docstring comments that explained why it was there. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:56:37 +03:00
Nadav Har'El	3734afe193	test/cluster: document that add/start waits for all ports to be ready Add docstrings to server_add(), server_start(), and servers_add() explaining that they wait for ServerUpState.SERVING before returning, which means Scylla has finished listening on all configured ports (including non-default ones). Note that server_add() and server_start() accept expected_server_up_state to return earlier if needed, while servers_add() always waits for SERVING. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:56:32 +03:00
Nadav Har'El	90eef72794	test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING ScyllaServer.install_and_start() and ScyllaServer.start() still had ServerUpState.CQL_ALTERNATOR_QUERIED as their default for expected_server_up_state. In practice these defaults are never reached - both call sites in ScyllaCluster always pass the value explicitly, forwarding it from the higher-level add_server() and server_start() whose defaults were already fixed. Update them to SERVING anyway for consistency, so that the low-level methods agree with the policy established at the higher layers and won't silently revert to the wrong behavior if a new call site is added without an explicit argument. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:19 +03:00
Nadav Har'El	af03f0e8c4	test/cluster: fix server_add/server_start hanging when starting in maintenance mode When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering maintenance mode") instead of sd_notify("STATUS=serving"), and does not open the standard CQL port. This caused two independent bugs after the default was changed to ServerUpState.SERVING: 1. poll_status() resolved serving_signal to False on the maintenance notification, so check_serving_notification() would never return True, and start() would time out waiting for SERVING. 2. The readiness check in start() was guarded by `server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in maintenance mode (the standard CQL port is not open). Even if bug 1 were fixed, SERVING would never be recognized. Fix both: - Treat STATUS=entering maintenance mode as a successful readiness signal in poll_status(), resolving serving_signal to True just like STATUS=serving. Both mean "all configured ports are now open". - Remove the CQL_ALTERNATOR_QUERIED precondition from the check_serving_notification() call in start(). The sd_notify signal is authoritative: Scylla sends it only when fully ready, regardless of which ports it opened. No CQL precondition is needed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:18 +03:00
Nadav Har'El	597838c501	main: notify "entering maintenance mode" after the maintenance CQL server is ready The sd_notify "entering maintenance mode" status was emitted before start_cql() was called, so clients that waited for this notification could attempt to connect to the maintenance socket before it was actually accepting connections. Move the checkpoint() call to after start_cql(), matching how the normal startup path emits "serving" only after all configured listeners are open. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:17 +03:00
Karol Nowacki	207de967fb	vector_search: test: default timeout in test_dns_resolving_repeated Replace explicit 1-second timeouts in repeat_until() with the default STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for loaded CI environments where lowres_clock granularity (~10ms) combined with OS scheduling delays and resource contention (-c2 -m2G) could cause the loop to expire before the DNS refresh task completes its cycle. This also unifies test timeouts across test cases.	2026-05-05 17:23:39 +02:00
Karol Nowacki	4722be1289	vector_search: test: fix flaky test_dns_resolving_repeated Move trigger_dns_resolver() inside the repeat_until loop instead of calling it once before the loop. The test was intermittently timing out on CI. The exact root cause is not fully understood, but the hypothesis is that a single trigger signal can be lost somewhere (not exactly known where). This is not an issue for the production code because refresh trigger will be called multiple times - in every query where all configured nodes will be unreachable. By triggering inside the loop, we ensure the signal is re-sent on each iteration until the resolver actually performs the refresh and picks up the new (failing) DNS resolution. This makes the test resilient to timing-dependent signal loss without changing production code. Fixes: SCYLLADB-1794	2026-05-05 17:23:39 +02:00
Nadav Har'El	e014521565	test/cluster: make server_start() default to ServerUpState.SERVING For the same reason server_add() was changed to default to SERVING (see previous commit), server_start() had the same bug: after restarting a node that listens on non-default ports, the polling of the hardcoded CQL/Alternator ports could succeed before the custom ports were ready, causing intermittent failures. Apply the same fix to server_start() in manager_client.py, ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Nadav Har'El	f91525c5df	test/cluster: make server_add() default to ServerUpState.SERVING server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED, which polls the standard CQL and Alternator ports to determine when the server is ready. This is wrong when a test configures Scylla to listen on non-default ports: the polling succeeds on the default ports while the custom ports may not yet be ready, making such tests intermittently flaky. The correct behavior is ServerUpState.SERVING, which waits for Scylla's sd_notify("READY=1") signal. This signal is sent only after all configured listeners — including custom ports — are fully open, so it is the right readiness signal regardless of the port configuration. Up to now, the fix for each affected test was to pass expected_server_up_state=ServerUpState.SERVING explicitly once the flakiness was noticed (e.g. #29737). Change the default so that all future tests get the correct behavior automatically. Changed in manager_client.server_add(), ScyllaCluster.add_server(), and the _cluster_server_add HTTP handler. The multi-server servers_add() path already inherits the new default through add_server(). Fixes SCYLLADB-1822 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Marcin Maliszkiewicz	5c5306c692	test: add reproducer for auth cache crash on missing permissions column	2026-05-05 17:16:25 +02:00
Marcin Maliszkiewicz	df69a5c79b	auth: tolerate missing permissions column in authorize() Ghost rows in role_permissions with a live row marker but no permissions column can occur when permissions created via INSERT (e.g. by the removed auth v2 migration) are later revoked. The row marker survives the revoke, leaving a row visible to queries but with permissions=null. Add a has() guard before accessing the permissions column, matching the pattern already used in list_all(). Return NONE permissions for such ghost rows instead of crashing.	2026-05-05 15:50:40 +02:00
Marcin Maliszkiewicz	c44625ebdf	auth: add defensive has() guard for role_attributes value column Add a has() check before accessing the value column in role_attributes to tolerate ghost rows with missing regular columns. In practice this is unlikely to be a problem since attributes are not typically revoked, but the guard is added for consistency and defensive programming.	2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz	797bc28aae	auth: remove unused permissions field from cache role_record The permissions field in role_record was populated by fetch_role() but never read. Authorization uses cached_permissions instead, which is loaded via the permission_loader callback. Remove the dead field and its fetch code. The removed code also did not check for missing columns before accessing the permissions set, which could crash on ghost rows left by the removed auth v2 migration. The migration used INSERT (creating row markers), and when permissions were later revoked, the row marker survived while the permissions column became null.	2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz	c00fee0316	Merge 'utils: loading_cache: add `insert()` that is a no-op when caching is disabled' from Dario Mirovic When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with: ``` Assertion `caching_enabled()' failed. at utils/loading_cache.hh:319 in authorized_prepared_statements_cache::insert ``` `loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off. Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side. Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload. Fixes SCYLLADB-1699 The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported. Closes scylladb/scylladb#29638 * github.com:scylladb/scylladb: test: boost: regression test for loading_cache::insert with caching disabled utils: loading_cache: add insert() that is a no-op when caching is disabled	2026-05-05 15:33:49 +02:00
Patryk Jędrzejczak	9f692857be	test/cluster: increase failure_detector_timeout Scaling the timeout by build mode (#29522) turned out to be not sufficient. Nodes can still be unexpectedly marked as down, even with a 4s timeout in dev mode. I managed to reproduce SCYLLADB-864 in such conditions. Increasing failure_detector_timeout will proportionally slow down tests that use it. That's bad, but currently these tests' flakiness is a much bigger problem than the tests' slowness. Also, not many tests use this fixture, and we hope to make it unneeded eventually (see #28495).	2026-05-05 15:12:33 +02:00
Patryk Jędrzejczak	efe0e39d85	gossiper: fix false node conviction when echo succeeds failure_detector_loop_for_node() could falsely convict a healthy node even when the echo succeeded. The code computed diff = now - last (time since last successful echo) and checked diff > max_duration unconditionally, regardless of whether the current echo failed or succeeded. This caused flakiness in tests that decrease the failure detector timeout. We currently run #CPUs tests concurrently, and since cluster tests start multiple nodes with 2 shards, multiple shards contend for one CPU. As a result, some tasks can become abnormally slow and block the failure detector loop execution for a few seconds. Fix by only checking diff > max_duration when the echo actually failed. Note that we send echo with the timeout equal to `max_duration` anyway, so the receiver will be marked as down if it really doesn't respond.	2026-05-05 15:12:32 +02:00
Patryk Jędrzejczak	b69d00b0a7	Merge 'Barrier and drain logging' from Gleb Natapov Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281 Bakport since we want to have it if it happens in the field. Fixes: SCYLLADB-1821 Refs: #26281 Closes scylladb/scylladb#29735 * https://github.com/scylladb/scylladb: session, raft_topology: add periodic warnings for hung drain and stale version waits session: add info-level logging to drain_closing_sessions raft_topology: log sub-step progress in local_topology_barrier raft_topology: log read_barrier progress in topology cmd handler	2026-05-05 15:04:50 +02:00
Calle Wilund	5cdfdd9ba3	commitlog_test.py: Fix size check aliasing, and threshold calc. Fixes: SCYLLADB-1815 Checking segment sizes should not use a size filter that rounds (up) sizes. More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.	2026-05-05 14:42:55 +02:00
Nadav Har'El	b70beb3e13	alternator: improve CreateTable/UpdateTable schema agreement timeout CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky. This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long. The patch also improves the behavior of a schema-agreement timeout, when it happens: 1. Provide an InternalServerError with more descriptive text. 2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait. Fixes SCYLLADB-1804 Refs #5052 (claiming CreateTable shouldn't wait at all) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 15:41:06 +03:00
Calle Wilund	8d65a03951	commitlog: Fix segment/chunk overhead maybe not included in next_position calculation Refs: SCYLLADB-1757 Refs: SCYLLADB-1815 If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.	2026-05-05 14:39:06 +02:00
Nadav Har'El	1f15e05946	test: fix replica_read_timeout_no_exception flakiness on slow systems The test uses a 10ms read timeout to exercise code paths that handle timed-out reads without throwing C++ exceptions. As part of setup, it inserts rows and flushes them to two SSTables, then runs a warm-up SELECT to populate internal caches (e.g. the auth cache) before the real test begins. The reason for this warm-up read was the possibility that the first read does additional operations (such as reading and caching authentication) that might throw exceptions internally. I couldn't verify that such exceptions actually happen in today's code, but they might (re)appear in the future, so we should keep the warm-up SELECT. On slow CI machines (aarch64, debug build), that warm-up SELECT can take longer than 10ms to read from the two SSTables. When it does, the read times out: the coordinator receives 0 responses from the local replica within the deadline and propagates a read_timeout_exception. Since the exception is not caught, it escapes the test lambda, is logged as "cql env callback failed", and causes Boost.Test to report a C++ failure at the do_with_cql_env_thread call site. This matches the CI failure seen in SCYLLADB-1774: ERROR ... replica_read_timeout_no_exception: cql env callback failed, error: exceptions::read_timeout_exception (Operation timed out for replica_read_timeout_no_exception.tbl - received only 0 responses from 1 CL=ONE.) The CI log also shows that only 12 reads were admitted (the warm-up read plus the 11 reads from the two prepare() calls and CREATE/INSERT statements made earlier), and the current permit was stuck in need_cpu state -- the reactor hadn't had a chance to schedule the read before the 10ms window elapsed. The fix catches read_timeout_exception from the warm-up SELECT and retries until the read succeeds. The warm-up is required for correctness: some lazy-init code paths (e.g. auth cache population) use C++ exceptions for control flow internally. Those exceptions must be absorbed before the cxx_exceptions baseline is sampled inside execute_test(); otherwise they would appear in the delta and cause a false test failure. Simply ignoring a timed-out warm-up is not safe, because the lazy-init exceptions would then fire during the 1000 test reads, inflating cxx_exceptions_after relative to cxx_exceptions_before. No other calls in setup are susceptible to the 10ms read timeout: - CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write timeout (10s) and are not reads. - e.prepare() goes through the query processor without reading table data, so it is not subject to the read timeout. - The semaphore manipulation in Test 2 is internal and has no timeout. - All 1000 reads in execute_test() are expected to fail, so a timeout there is the happy path, not a failure. The 10ms timeout itself is fine for the test's purpose: it is deliberately aggressive so that reads reliably time out on the hot path being tested. The problem was only that the pre-test warm-up was not guarded against the same timeout. Fixes: SCYLLADB-1774 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29731	2026-05-05 15:13:13 +03:00
Botond Dénes	afd9a55891	Merge 'test/cluster: wait for custom listener readiness' from Piotr Smaron server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every custom listener the test will connect to is already accepting raw TCP connections. test_proxy_protocol_ssl_shard_aware connects directly to the shard-aware TLS proxy-protocol CQL port immediately after server startup. Wait for ServerUpState.SERVING in the fixture so the custom proxy-protocol listener is registered before opening raw sockets. test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets. test_perf_alternator_remote now asks server_add() to wait for SERVING and uses the returned server address directly. This removes the redundant running_servers() plus get_ready_cql() sequence noted in review. Fixes: SCYLLADB-1797 No backport as of now, only appeared on master. Closes scylladb/scylladb#29737 * github.com:scylladb/scylladb: test/cluster: avoid redundant perf alternator CQL wait test/cluster: wait for shard-aware CQL listener test/cluster: wait for proxy protocol ports to serve	2026-05-05 14:45:58 +03:00
Nadav Har'El	5895dff03b	migration_manager: unique timeout exception for wait_for_schema_agreement() Before this patch, if wait_for_schema_agreement() times out, it threw a generic std::runtime_error, making it inconvenient for callers to catch this error only. So in this patch we create and use a new exception type, schema_agreement_timeout, based on seastar::timed_out_error. Although wait_for_schema_agreement() was added in commit `a429018a8a` was a utility function used in a dozen places, it has become less interesting after we introduced schema changes over Raft, and over the years most of the callers to this function were removed, except one in view.cc which uses an infinite timeout, so doesn't care about the timeout exception type. In the next patch we want to add a new caller which does care about the time exception type - hence this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 10:38:38 +03:00
Piotr Dulikowski	efcc0b6376	Merge 'table_helper: fix use-after-free on prepared-statement invalidation' from Marcin Maliszkiewicz insert() held no local strong ref to the prepared modification_statement across the suspension in execute(). On a single shard: 1. Fiber A suspends inside _insert_stmt->execute(). 2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes the prepared_statements_cache entry, releasing its strong ref. 3. Fiber B re-enters cache_table_info(), sees _prepared_stmt (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr, releasing the last strong ref. The modification_statement is freed. 4. Fiber A resumes inside execute() and touches freed this. Pin strong ref to _insert_stmt locally before the suspension. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667 Backport: all supported branches, it's memory corruption bug, long present Closes scylladb/scylladb#29588 github.com:scylladb/scylladb: test/boost: add dummy case to table_helper_test for non-injection modes test/boost: add regression test for table_helper insert() UAF utils/error_injection: add waiters() API table_helper: fix use-after-free on prepared-statement invalidation	2026-05-04 17:21:05 +02:00
Piotr Smaron	a3360ee385	test/nodetool: fix mock server port race by using a fixed port on a unique IP Symptom: the rest_api_mock subprocess exits with status 1 during fixture setup, e.g.: subprocess.CalledProcessError: Command '[..., 'rest_api_mock.py', '127.29.88.1', '34093']' returned non-zero exit status 1 Root cause: aiohttp's TCPSite.start() raises OSError(EADDRINUSE) and the process exits 1. The bind fails because of how the (ip, port) pair is chosen across modules within one test.py process: * Each test module leases a 127.x.y.z IP from the host registry. The registry recycles released IPs, so the same IP is shared across modules sequentially. * The original code picked the port via random.randint(10000, 65535). A previous module on the same IP could have left that port in TIME_WAIT (or worse, still actively in use) when a later module happened to pick the same port. SCYLLADB-1275 (PR 29314) tried to fix this by binding a probe socket to (ip, 0) to obtain an OS-assigned free port, closing the probe, then launching the mock server which would bind to that port. Two issues remained: 1. TOCTOU: between probe close and mock-server bind, any other process on the host could grab the just-freed port. 2. TIME_WAIT could still bite if the host registry recycled an IP and the OS reused the same port number for the probe. Fix: drop port discovery entirely. Use a fixed port (12345, matching the unshare-namespace path already in this fixture) on the unique IP from the host registry. Because IPs are unique per test module within one test.py process, the (ip, 12345) pair is unique to each module, so no port-collision dance is needed. reuse_address=True on TCPSite handles the residual TIME_WAIT case when the host registry recycles an IP within the same test.py process and the previous mock server's socket has not finished TIME_WAIT yet. reuse_port=True is dropped, as it was only useful while attempting to have multiple processes share a single port. This mirrors the design used in test/cqlpy/run.py: pick a unique IP, keep the port fixed. Fixes: SCYLLADB-1718 Closes scylladb/scylladb#29656	2026-05-04 15:33:19 +02:00
Gleb Natapov	d2b695aa64	session, raft_topology: add periodic warnings for hung drain and stale version waits Add periodic warning timers (every 5 minutes) to help diagnose hangs in barrier_and_drain: - drain_closing_sessions(): warn if semaphore acquisition or session gate close is taking too long, reporting the gate count to show how many guards are still alive. - local_topology_barrier(): warn if stale_versions_in_use() is taking too long, reporting the current stale version trackers. - session::gate_count(): new public accessor for diagnostic purposes. These warnings help distinguish between the two possible hang points in barrier_and_drain (stale versions vs session drain) and provide ongoing visibility into what's blocking progress.	2026-05-04 15:58:45 +03:00
Gleb Natapov	385915c101	session: add info-level logging to drain_closing_sessions drain_closing_sessions() is called as part of the barrier_and_drain topology command and can block on two things: acquiring the drain semaphore (if another drain is in progress) and waiting for individual sessions to close (which blocks until all session guards are released). Previously, all logging in this function was at debug level, making it invisible in production logs. When barrier_and_drain hangs, there is no way to tell whether the function is waiting for the semaphore, waiting for a specific session to close, or was never called. Promote logging to info level and add messages at each blocking point: before/after semaphore acquisition (with count of sessions to drain), before/after each individual session close (with session id), and at function completion. This makes it possible to identify the exact session blocking a topology operation from the node log alone.	2026-05-04 15:58:45 +03:00
Gleb Natapov	e88ce09372	raft_topology: log sub-step progress in local_topology_barrier When a node processes a barrier_and_drain topology command, it performs two potentially long-running operations inside local_topology_barrier(): waiting for stale token metadata versions to be released (stale_versions_in_use) and draining closing sessions (drain_closing_sessions). Either of these can hang indefinitely -- for example, stale_versions_in_use blocks until all references to previous token metadata versions are released, which depends on in-flight requests completing. Previously, the only logging was a single 'done' message at the end, making it impossible to determine which sub-step was blocking when a barrier_and_drain RPC appeared stuck on a node. In a recent CI failure, a node never responded to barrier_and_drain during a removenode operation, and the logs showed the RPC was received but nothing about what it was waiting on internally. Add info-level logging before each blocking sub-step, including the topology version for correlation. This allows diagnosing hangs by showing whether the node is stuck waiting for stale metadata versions, stuck draining sessions, or never reached these steps at all.	2026-05-04 15:58:45 +03:00
Piotr Smaron	0a780d0ea1	test/cluster: avoid redundant perf alternator CQL wait server_add() already waits for the requested server-up state. For the remote perf-alternator test, request SERVING from server_add() and use the returned server address directly instead of asking for running servers and then calling get_ready_cql() again. This keeps the listener-readiness intent explicit while removing the redundant CQL readiness probe noted in review.	2026-05-04 14:09:28 +02:00
Piotr Smaron	c90012c22b	test/cluster: wait for shard-aware CQL listener server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every CQL listener configured for the process is already accepting raw TCP connections. test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets. Scylla sends that notification only after protocol servers are registered, so this closes the startup window without adding sleeps or local retry loops. Fixes: SCYLLADB-1797	2026-05-04 13:36:43 +02:00
Nadav Har'El	983eb5ab43	test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old nodes concurrently, then adds a new node before issuing CREATE ROLE. The concurrent bootstraps trigger the well-known Python driver bug (scylladb/python-driver#317): two on_add notifications race in update_created_pools, causing a second pool to be created for a host whose pool was already established. If CREATE ROLE is in-flight on the old pool when it is closed, the driver retries on the new pool, executing the statement twice. The second execution fails with "Role ... already exists", making the test flaky. Fix by using CREATE ROLE IF NOT EXISTS. This is safe because unique_name() generates a timestamp+random suffix that is guaranteed to be unique; the role can "already exist" only due to the driver double-execution bug, never due to a real conflict. This is the same workaround that has been applied many times elsewhere in our test suite for exactly the same root cause: - CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368, later generalised in scylladb#22399 via new_test_keyspace helpers) - DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487) Fixes: SCYLLADB-1742 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29732	2026-05-04 11:47:11 +02:00
Yaniv Michael Kaul	6179406467	raft/group0: fix destroy assertion on startup failure If start_server_for_group0() successfully registers a server in _raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine()) throws, the server is never destroyed because abort_and_drain()/destroy() check std::get_if<raft::group_id>(&_group0) which was only set after the entire with_scheduling_group block completed. Move _group0.emplace<raft::group_id>() inside the lambda, immediately after start_server_for_group() succeeds, so that cleanup paths can always find and destroy the registered server. This fixes the assertion: "raft_group_registry - stop(): server for group ... is not destroyed" which manifests during shutdown after an upgrade where topology_state_load() fails due to netw::unknown_address. Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades Refs: SCYLLADB-1217 Refs: CUSTOMER-340 Refs: CUSTOMER-335 Fixes: SCYLLADB-1801 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: Yes, Opencode/Opus 4.6 Closes scylladb/scylladb#29702	2026-05-04 11:25:46 +02:00
Piotr Smaron	689117f706	test/cluster: wait for proxy protocol ports to serve server_add()'s default readiness only waits until CQL can be queried, but these tests immediately connect to custom proxy protocol listeners. Wait for SERVING so the shard-aware TLS proxy port is accepting connections before the test starts, matching the Alternator proxy protocol readiness fix.	2026-05-04 10:23:03 +02:00
Nadav Har'El	d33bb6ea00	Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened). After restarting servers[1], the topology coordinator can initiate a residual re-repair when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the next retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys. 1. Propagating `current_key` through the exception — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt. 2. DROP TABLE + CREATE TABLE between retries — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues. Instead of trying to clean up contaminated state, each retry creates a completely fresh keyspace (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures. The detection is now comprehensive: - Broadened coordinator check: any coordinator change (`new_coord != coord`), not just migration to servers[1] - Re-repair detection at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log 1. `test: extract _setup_table_for_race_window helper` — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change. 2. `test: fix race window test flakiness from residual re-repair` — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception. Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode. Fixes: SCYLLADB-1478 So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen. Closes scylladb/scylladb#29721 * github.com:scylladb/scylladb: test: fix race window test flakiness from residual re-repair test: extract _setup_table_for_race_window helper for race window test	2026-05-03 14:47:19 +03:00
Gleb Natapov	11b838e71e	raft_topology: log read_barrier progress in topology cmd handler When a raft topology command (e.g. barrier_and_drain) is received by a node, the handler first performs a raft read_barrier to ensure it sees the latest topology state. This read_barrier can hang indefinitely if raft cannot achieve quorum, but there was no logging around it, making it impossible to tell whether the handler was stuck at this step or somewhere else. Add info-level logging before and after the read_barrier call in raft_topology_cmd_handler, including the command type, index, and term. This allows diagnosing hangs by showing whether the node entered the read_barrier and whether it completed, narrowing down the root cause when a topology command RPC appears stuck on the receiver side.	2026-05-03 13:56:25 +03:00
Aleksandr Bykov	8afdae24d2	test: fix flaky test_kill_coordinator_during_op The test hardcoded the expected number of coordinator elections (2, 3, 4, 5) for each phase. If a prior phase triggered an extra election, subsequent phases would wait for a count that was already reached or would never match. Fix by reading the current election count before each operation and expecting exactly one more, making each phase independent of prior history. Also add wait_for_no_pending_topology_transition() calls after each coordinator election to ensure the topology state machine has fully settled before proceeding with restarts and further operations. Decrease the failure detector timeout (failure_detector_timeout_in_ms) to 2000 ms on all test nodes so that coordinator crashes are detected faster, reducing test wallclock time and timeout-related flakiness. Enable raft_topology=trace logging on all test nodes to aid post-failure diagnosis. Add diagnostic logging in wait_new_coordinator_elected(). Fixes: SCYLLADB-1089 Closes scylladb/scylladb#29284	2026-04-30 21:27:56 +03:00
Avi Kivity	795478fa7a	test: fix race window test flakiness from residual re-repair The test_incremental_repair_race_window_promotes_unrepaired_data test was still flaky because: 1. Only coordinator changes TO servers[1] were detected, but ANY coordinator change can trigger a residual re-repair that flushes memtables on all replicas and marks post-repair data as repaired. 2. Even without a coordinator change, the topology coordinator can initiate a residual re-repair when it sees tablets stuck in the repair stage after the servers[1] restart. This re-repair contaminates the repaired set with post-repair data, masking the compaction-merge bug the test detects. Fix by: - Broadening the coordinator check from == servers[1] to != coord - Adding re-repair detection (grep for 'Initiating tablet repair host=') at three points: post-restart, during the compaction poll, and after injection release - On retry, creating a completely fresh keyspace+table via _setup_table_for_race_window() so the new attempt starts with clean tablet metadata uncontaminated by prior re-repairs Fixes: SCYLLADB-1478	2026-04-30 18:40:18 +03:00
Avi Kivity	12d5e758ed	test: extract _setup_table_for_race_window helper for race window test Move the keyspace+table setup logic for test_incremental_repair_race_window_promotes_unrepaired_data into a dedicated helper function _setup_table_for_race_window(). The helper creates a fresh keyspace (unique name via create_new_test_keyspace), the table, configures STCS min_threshold=2, inserts baseline keys, runs repair 1, inserts keys for repair 2, and flushes. This is a pure refactor with no behavioral change: the test function now calls the helper once instead of inlining the setup. The extraction enables a subsequent commit to call the helper again on retry when a leadership transfer is detected.	2026-04-30 18:37:42 +03:00
Dario Mirovic	3875d79ac6	test: boost: regression test for loading_cache::insert with caching disabled Add two test cases for the new loading_cache::insert() method: * test_loading_cache_insert verifies that insert() populates the cache and invokes the loader exactly once per key when caching is enabled. * test_loading_cache_insert_caching_disabled is a regression test for SCYLLADB-1699: when the cache is constructed with expiry == 0 (caching disabled), insert() must be a no-op rather than asserting in loading_cache::get_ptr() via caching_enabled(). The loader must not be invoked and the cache must remain empty. Refs SCYLLADB-1699	2026-04-30 16:52:51 +02:00

... 4 5 6 7 8 ...

53948 Commits