Compare commits

..

4 Commits
master ... next

Author SHA1 Message Date
Avi Kivity
b708e5d7c9 Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin
`test_crashed_node_substitution` intermittently failed:
```python
   assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node.

This change: Wait until the gossiper state is visible on peers before continuing the test and asserting.

Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256).

backport: this issue may affect CI for all branches, so should be backported to all versions.

[SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29254

* github.com:scylladb/scylladb:
  test: test_crashed_node_substitution: add docstring and fix whitespace
  test: fix race condition in test_crashed_node_substitution
2026-03-26 21:40:33 +02:00
Petr Gusev
c38e312321 test_lwt_fencing_upgrade: fix quorum failure due to gossip lag
If lwt_workload() sends an update immediately after a
rolling restart, the coordinator might still see a replica as
down due to gossip lagging behind. Concurrently restarting another
node leaves only one available replica, failing the
LOCAL_QUORUM requirement for learn or eventually consistent
sp::query() in sp::cas() and resulting in
a mutation_write_failure_exception.

We fix this problem by waiting for the restarted server
to see 2 other peers. The server_change_version
doesn't do that by default -- it passes
wait_others=0 to server_start().

Fixes SCYLLADB-1136

Closes scylladb/scylladb#29234
2026-03-26 21:25:53 +02:00
bitpathfinder
627a8294ed test: test_crashed_node_substitution: add docstring and fix whitespace
Add a description of the test's intent and scenario; remove extra blanks.
2026-03-26 18:40:17 +01:00
bitpathfinder
5a086ae9b7 test: fix race condition in test_crashed_node_substitution
`test_crashed_node_substitution` intermittently failed:
```
    assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake
("finished do_send_ack2_msg"), assuming the node state was
visible to all peers. However, since gossip is eventually
consistent, the update may not have propagated yet, so some
nodes did not see the failed node.

This change: Wait until the gossiper state is visible on
peers before continuing the test and asserting.

Fixes: SCYLLADB-1256.
2026-03-26 18:25:05 +01:00
2 changed files with 40 additions and 2 deletions

View File

@@ -438,6 +438,7 @@ async def test_lwt_fencing_upgrade(manager: ManagerClient, scylla_2025_1: Scylla
await wait_for(all_hosts_are_alive, deadline=time.time() + 60, period=0.1)
logger.info(f"Upgrading {s.server_id}")
await manager.server_change_version(s.server_id, scylla_binary)
await manager.server_sees_others(s.server_id, 2, interval=60.0)
logger.info("Done upgrading servers")

View File

@@ -8,7 +8,10 @@ import asyncio
import time
import pytest
import logging
from functools import partial
from test.pylib.manager_client import ManagerClient
from test.pylib.util import wait_for
from test.pylib.internal_types import ServerInfo
logger = logging.getLogger(__name__)
@@ -16,6 +19,26 @@ logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_crashed_node_substitution(manager: ManagerClient):
"""Test that a node which crashed after starting gossip but before joining group0
(an 'orphan' node) is eventually removed from gossip by the gossiper_orphan_remover_fiber.
The scenario:
1. Start 3 nodes with the 'fast_orphan_removal_fiber' injection enabled. This freezes
the gossiper_orphan_remover_fiber on each node before it enters its polling loop,
so it cannot remove any orphan until explicitly unblocked.
2. Start a 4th node with the 'crash_before_group0_join' injection enabled. This node
starts gossip normally but blocks inside pre_server_start(), just before sending
the join RPC to the topology coordinator. It never joins group0.
3. Wait until the 4th node's gossip state has fully propagated to all 3 running peers,
then trigger its crash via the injection. At this point all peers see it as an orphan:
present in gossip but absent from the group0 topology.
4. Assert the orphan is visible in gossip (live or down) on the surviving nodes.
5. Unblock the gossiper_orphan_remover_fiber on all 3 nodes (via message_injection) and
enable the 'speedup_orphan_removal' injection so the fiber removes the orphan immediately
without waiting for the normal 60-second age threshold.
6. Wait for the 'Finished to force remove node' log line confirming removal, then assert
the orphan is no longer present in gossip.
"""
servers = await manager.servers_add(3, config={
'error_injections_at_startup': ['fast_orphan_removal_fiber']
})
@@ -30,10 +53,24 @@ async def test_crashed_node_substitution(manager: ManagerClient):
log = await manager.server_open_log(failed_server.server_id)
await log.wait_for("finished do_send_ack2_msg")
failed_id = await manager.get_host_id(failed_server.server_id)
# Wait until the failed server's gossip state has propagated to all running peers.
# "finished do_send_ack2_msg" only guarantees that one peer completed a gossip round
# with the failed server; other nodes learn about it only in subsequent gossip rounds.
# Querying gossip before propagation completes would cause the assertion below to fail
# because the orphan node would not yet appear as live or down on every peer.
async def gossip_has_node(server: ServerInfo):
live = await manager.api.client.get_json("/gossiper/endpoint/live", host=server.ip_addr)
down = await manager.api.client.get_json("/gossiper/endpoint/down", host=server.ip_addr)
return True if failed_server.ip_addr in live + down else None
for s in servers:
await wait_for(partial(gossip_has_node, s), deadline=time.time() + 30)
await manager.api.message_injection(failed_server.ip_addr, 'crash_before_group0_join')
await task
live_eps = await manager.api.client.get_json("/gossiper/endpoint/live", host=servers[0].ip_addr)
down_eps = await manager.api.client.get_json("/gossiper/endpoint/down", host=servers[0].ip_addr)