Files
scylladb/test/cluster
Patryk Jędrzejczak 0693091aff test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632
2026-02-16 12:56:18 +01:00
..