gms/gossiper: fix use-after-move in do_send_ack2_msg

The second logger.debug() call on line 405 accesses ack2_msg after it was moved via std::move() in the co_await call on line 404. This is undefined behavior. Fix by formatting ack2_msg to a string before the move, then using that cached string in both debug log calls.
test: cluster: fix log clear race condition in test_audit.py
2026-03-25 13:04:05 +02:00 · 2026-03-19 16:12:13 +01:00 · 2026-03-19 16:12:13 +01:00 · 2026-03-19 16:12:13 +01:00 · 2026-03-19 16:12:13 +01:00 · 2026-03-19 16:11:47 +01:00
5 changed files with 742 additions and 343 deletions
--- a/gms/gossiper.cc
+++ b/gms/gossiper.cc
@@ -400,9 +400,10 @@ future<> gossiper::do_send_ack2_msg(locator::host_id from, utils::chunked_vector
        }
    }
    gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
-    logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
+    auto ack2_msg_str = fmt::format("{}", ack2_msg);
+    logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
    co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
-    logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
+    logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
 }

 // Depends on
--- a/test/cluster/dtest/audit_test.py
+++ b/test/cluster/dtest/audit_test.py
--- a/test/cluster/test_config.yaml
+++ b/test/cluster/test_config.yaml
@@ -44,6 +44,7 @@ run_in_dev:
  - dtest/bypass_cache_test
  - dtest/auth_roles_test
  - dtest/audit_test
+  - audit/test_audit
  - dtest/commitlog_test
  - dtest/cfid_test
  - dtest/rebuild_test
--- a/test/pylib/manager_client.py
+++ b/test/pylib/manager_client.py
@@ -60,6 +60,7 @@ class ManagerClient:
        self.con_gen = con_gen
        self.ccluster: Optional[CassandraCluster] = None
        self.cql: Optional[CassandraSession] = None
+        self.exclusive_clusters: List[CassandraCluster] = []
        # A client for communicating with ScyllaClusterManager (server)
        self.sock_path = sock_path
        self.client_for_asyncio_loop = {asyncio.get_running_loop(): UnixRESTClient(sock_path)}
@@ -113,6 +114,9 @@ class ManagerClient:

    def driver_close(self) -> None:
        """Disconnect from cluster"""
+        for cluster in self.exclusive_clusters:
+            cluster.shutdown()
+        self.exclusive_clusters.clear()
        if self.ccluster is not None:
            logger.debug("shutting down driver")
            safe_driver_shutdown(self.ccluster)
@@ -134,9 +138,12 @@ class ManagerClient:
        hosts = await wait_for_cql_and_get_hosts(cql, servers, time() + 60)
        return cql, hosts

-    async def get_cql_exclusive(self, server: ServerInfo):
-        cql = self.con_gen([server.ip_addr], self.port, self.use_ssl, self.auth_provider,
-                                     WhiteListRoundRobinPolicy([server.ip_addr])).connect()
+    async def get_cql_exclusive(self, server: ServerInfo, auth_provider: Optional[AuthProvider] = None):
+        cluster = self.con_gen([server.ip_addr], self.port, self.use_ssl,
+                               auth_provider if auth_provider else self.auth_provider,
+                               WhiteListRoundRobinPolicy([server.ip_addr]))
+        self.exclusive_clusters.append(cluster)
+        cql = cluster.connect()
        await wait_for_cql_and_get_hosts(cql, [server], time() + 60)
        return cql

--- a/test/pylib/scylla_cluster.py
+++ b/test/pylib/scylla_cluster.py
@@ -1394,7 +1394,11 @@ class ScyllaCluster:
                                   f"the test must drop all keyspaces it creates.")
        for server in itertools.chain(self.running.values(), self.stopped.values()):
            server.write_log_marker(f"------ Ending test {name} ------\n")
-            if not server.log_file.closed:
+            # Only close log files when the cluster is dirty (will be destroyed).
+            # If the cluster is clean and will be reused, keep the log file open
+            # so that write_log_marker() and take_log_savepoint() work in the
+            # next test's before_test().
+            if self.is_dirty and not server.log_file.closed:
                server.log_file.close()

    async def server_stop(self, server_id: ServerNum, gracefully: bool) -> None:
Author	SHA1	Message	Date
Marcin Maliszkiewicz	90027db532	gms/gossiper: fix use-after-move in do_send_ack2_msg The second logger.debug() call on line 405 accesses ack2_msg after it was moved via std::move() in the co_await call on line 404. This is undefined behavior. Fix by formatting ack2_msg to a string before the move, then using that cached string in both debug log calls.	2026-03-25 13:04:05 +02:00
Dario Mirovic	d2c44722e1	test: cluster: fix log clear race condition in test_audit.py assert_entries_were_added: - takes a "before" snapshot of the audit log - yields to execute a statement - takes an "after" snapshot of the audit log - computes new rows by diffing "after" minus "before" If an audit entry generated by prepare() arrives between the snapshot and the diff, it inflates the new row count and the test fails with assert 2 <= 1. Fix by: - Adding clear_audit_logs() at the end of prepare(), after all setup - Waiting for the "completed re-reading configuration file" log message after server_update_config - Draining pending syslog lines before clearing the buffer Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	821f8696a7	test: pylib: shut down exclusive cql connections in ManagerClient get_cql_exclusive() creates a Cluster object per call, but never records it. driver_close() cannot shut it down. The cluster's internal scheduler thread then tries to submit work to an already shut down executor. This causes RuntimeError: RuntimeError: cannot schedule new futures after shutdown Fix this by tracking every exclusive Cluster in a list and shutting them all down in driver_close(). Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	d94999f87b	test: cluster: fix multinode audit entry comparison in test_audit.py assert_entries_were_added computes new audit rows by slicing the "after" list at the length of the "before" list: rows_after[len(rows_before):]. This assumes new rows always appear at the tail of the combined sorted list. In a multinode setup, each node generates its own event_time timestamps. A new row from node A can sort before an old row from node B, breaking the tail assumption. The assertion "new rows are not the last rows in the audit table" then fires. Fix this by splitting the before/after lists per node and computing the new rows tail independently for each node. This guarantees that per node ordering, which is monotonic, is respected, and the combined new rows are sorted afterwards. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	249a6cec1b	test: cluster: dtest: remove old audit tests Since audit tests have been migrated to test/cluster/test_audit.py, old tests in test/cluster/dtest/audit_test.py have to be removed. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	adc790a8bf	test: cluster: group migrated audit tests for cluster reuse This patch reorganizes the execution flow of the test functions. They are grouped to enable cluster reuse between specific test functions. One of the main contributors to the test execution time is the cluster preparation. This patch significantly reduces the total test execution time by having way less new cluster preparation calls and more cluster reuse. Performance increase on the developer machine is around 38%: - before: 4m 29s - after: 2m 47s Fixes SCYLLADB-573	2026-03-19 16:11:47 +01:00
Dario Mirovic	967b7ff6bf	test: cluster: enable migrated audit tests and make them work Make audit tests from test/cluster/dtest to test/cluster. test/cluster environment has less overhead, and audit tests are heavy, their execution taking lots of time. This patch is part of an effort to improve audit test suite performance. This patch refactors the tests so that they execute correctly, as well as enables them. A follow up patch will remove the audit tests in test/cluster/dtest. All the tests are confirmed to be running after the change. No dead code present. Test test_audit_categories_invalid is not parametrized anymore. It never used the parametrized helper class, so it just ran the same logic three times. This is why there are now 74, and not 76, test executions. Refs SCYLLADB-573	2026-03-19 16:07:28 +01:00
Dario Mirovic	8367509b3b	test: pylib: manager_client: specify AuthProvider in get_cql_exclusive This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider as parameter. This will be used in a follow up patch which migrates audit test suite to test/cluster and requires this functionality for some tests. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	0a7a69345c	test: pylib: scylla cluster after_test log fix Before any test, a pool of ScyllaCluster objects is created. At the beginning of a test suite, a ScyllaClusterManager is created, and given a reference to the pool. At the end of a test suite, the ScyllaClusterManager is destroyed. Before each test case: - ManagerClient is constructed and connected to the ScyllaClusterManager of that test suite - A ScyllaCluster object is fetched from the pool - If the pool is empty, a new ScyllaCluster object is created - If the pool is not empty, a cached ScyllaCluster object is returned After each test case: - Return ScyllaCluster object from ManagerClient to the pool - If the cluster is dirty, the pool destroys it - If the cluster is clean, the pool caches it - ManagerClient is destroyed Many actions mark a cluster as dirty. Normal test execution will always make the cluster be destroyed upon returning to the pool. ManagerClient.mark_clean is not used in the tests. When it is used, the flow with cluster reuse happens. The bug is that the log file is closed even if cluster is not dirty. This causes an error when trying to log to a reused cluster server. The solution in this patch is to not close the log file if the cluster is not dirty. Upon cluster reuse the log file will be open and functional. Another approach would be to reopen the log file if closed, but this approach seems more clean. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	899ae71349	test: audit: copy audit test from dtest This patch just copies the audit test suite from dtest and disables it in the test config file. Later patches will update the code and enable the test suite. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00