Compare commits

...

10 Commits

Author SHA1 Message Date
Marcin Maliszkiewicz
90027db532 gms/gossiper: fix use-after-move in do_send_ack2_msg
The second logger.debug() call on line 405 accesses ack2_msg after
it was moved via std::move() in the co_await call on line 404.
This is undefined behavior.

Fix by formatting ack2_msg to a string before the move, then using
that cached string in both debug log calls.
2026-03-25 13:04:05 +02:00
Dario Mirovic
d2c44722e1 test: cluster: fix log clear race condition in test_audit.py
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"

If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.

Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
  after server_update_config
- Draining pending syslog lines before clearing the buffer

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
d94999f87b test: cluster: fix multinode audit entry comparison in test_audit.py
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.

Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
249a6cec1b test: cluster: dtest: remove old audit tests
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
adc790a8bf test: cluster: group migrated audit tests for cluster reuse
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.

Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s

Fixes SCYLLADB-573
2026-03-19 16:11:47 +01:00
Dario Mirovic
967b7ff6bf test: cluster: enable migrated audit tests and make them work
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.

This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.

All the tests are confirmed to be running after the change.
No dead code present.

Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.

Refs SCYLLADB-573
2026-03-19 16:07:28 +01:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
0a7a69345c test: pylib: scylla cluster after_test log fix
Before any test, a pool of ScyllaCluster objects is created.

At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.

Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
  of that test suite
- A ScyllaCluster object is fetched from the pool
  - If the pool is empty, a new ScyllaCluster object is created
  - If the pool is not empty, a cached ScyllaCluster object is returned

After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
  - If the cluster is dirty, the pool destroys it
  - If the cluster is clean, the pool caches it
- ManagerClient is destroyed

Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.

The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.

The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.

Another approach would be to reopen the log file if closed, but this
approach seems more clean.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
899ae71349 test: audit: copy audit test from dtest
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
5 changed files with 742 additions and 343 deletions

View File

@@ -400,9 +400,10 @@ future<> gossiper::do_send_ack2_msg(locator::host_id from, utils::chunked_vector
}
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
auto ack2_msg_str = fmt::format("{}", ack2_msg);
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
}
// Depends on

File diff suppressed because it is too large Load Diff

View File

@@ -44,6 +44,7 @@ run_in_dev:
- dtest/bypass_cache_test
- dtest/auth_roles_test
- dtest/audit_test
- audit/test_audit
- dtest/commitlog_test
- dtest/cfid_test
- dtest/rebuild_test

View File

@@ -60,6 +60,7 @@ class ManagerClient:
self.con_gen = con_gen
self.ccluster: Optional[CassandraCluster] = None
self.cql: Optional[CassandraSession] = None
self.exclusive_clusters: List[CassandraCluster] = []
# A client for communicating with ScyllaClusterManager (server)
self.sock_path = sock_path
self.client_for_asyncio_loop = {asyncio.get_running_loop(): UnixRESTClient(sock_path)}
@@ -113,6 +114,9 @@ class ManagerClient:
def driver_close(self) -> None:
"""Disconnect from cluster"""
for cluster in self.exclusive_clusters:
cluster.shutdown()
self.exclusive_clusters.clear()
if self.ccluster is not None:
logger.debug("shutting down driver")
safe_driver_shutdown(self.ccluster)
@@ -134,9 +138,12 @@ class ManagerClient:
hosts = await wait_for_cql_and_get_hosts(cql, servers, time() + 60)
return cql, hosts
async def get_cql_exclusive(self, server: ServerInfo):
cql = self.con_gen([server.ip_addr], self.port, self.use_ssl, self.auth_provider,
WhiteListRoundRobinPolicy([server.ip_addr])).connect()
async def get_cql_exclusive(self, server: ServerInfo, auth_provider: Optional[AuthProvider] = None):
cluster = self.con_gen([server.ip_addr], self.port, self.use_ssl,
auth_provider if auth_provider else self.auth_provider,
WhiteListRoundRobinPolicy([server.ip_addr]))
self.exclusive_clusters.append(cluster)
cql = cluster.connect()
await wait_for_cql_and_get_hosts(cql, [server], time() + 60)
return cql

View File

@@ -1394,7 +1394,11 @@ class ScyllaCluster:
f"the test must drop all keyspaces it creates.")
for server in itertools.chain(self.running.values(), self.stopped.values()):
server.write_log_marker(f"------ Ending test {name} ------\n")
if not server.log_file.closed:
# Only close log files when the cluster is dirty (will be destroyed).
# If the cluster is clean and will be reused, keep the log file open
# so that write_log_marker() and take_log_savepoint() work in the
# next test's before_test().
if self.is_dirty and not server.log_file.closed:
server.log_file.close()
async def server_stop(self, server_id: ServerNum, gracefully: bool) -> None: