Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin

`test_crashed_node_substitution` intermittently failed: ```python assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256). backport: this issue may affect CI for all branches, so should be backported to all versions. [SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29254 * github.com:scylladb/scylladb: test: test_crashed_node_substitution: add docstring and fix whitespace test: fix race condition in test_crashed_node_substitution
test_lwt_fencing_upgrade: fix quorum failure due to gossip lag
2026-03-26 21:40:33 +02:00 · 2026-03-26 21:25:53 +02:00 · 2026-03-26 18:40:17 +01:00 · 2026-03-26 18:25:05 +01:00
2 changed files with 40 additions and 2 deletions
--- a/test/cluster/test_fencing.py
+++ b/test/cluster/test_fencing.py
@@ -438,6 +438,7 @@ async def test_lwt_fencing_upgrade(manager: ManagerClient, scylla_2025_1: Scylla
                await wait_for(all_hosts_are_alive, deadline=time.time() + 60, period=0.1)
            logger.info(f"Upgrading {s.server_id}")
            await manager.server_change_version(s.server_id, scylla_binary)
            await manager.server_sees_others(s.server_id, 2, interval=60.0)
        logger.info("Done upgrading servers")
--- a/test/cluster/test_gossiper_orphan_remover.py
+++ b/test/cluster/test_gossiper_orphan_remover.py
@@ -8,7 +8,10 @@ import asyncio
 import time
 import pytest
 import logging
 from functools import partial
 from test.pylib.manager_client import ManagerClient
 from test.pylib.util import wait_for
 from test.pylib.internal_types import ServerInfo
 logger = logging.getLogger(__name__)
@@ -16,6 +19,26 @@ logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
 async def test_crashed_node_substitution(manager: ManagerClient):
    """Test that a node which crashed after starting gossip but before joining group0
    (an 'orphan' node) is eventually removed from gossip by the gossiper_orphan_remover_fiber.
    The scenario:
    1. Start 3 nodes with the 'fast_orphan_removal_fiber' injection enabled. This freezes
       the gossiper_orphan_remover_fiber on each node before it enters its polling loop,
       so it cannot remove any orphan until explicitly unblocked.
    2. Start a 4th node with the 'crash_before_group0_join' injection enabled. This node
       starts gossip normally but blocks inside pre_server_start(), just before sending
       the join RPC to the topology coordinator. It never joins group0.
    3. Wait until the 4th node's gossip state has fully propagated to all 3 running peers,
       then trigger its crash via the injection. At this point all peers see it as an orphan:
       present in gossip but absent from the group0 topology.
    4. Assert the orphan is visible in gossip (live or down) on the surviving nodes.
    5. Unblock the gossiper_orphan_remover_fiber on all 3 nodes (via message_injection) and
       enable the 'speedup_orphan_removal' injection so the fiber removes the orphan immediately
       without waiting for the normal 60-second age threshold.
    6. Wait for the 'Finished to force remove node' log line confirming removal, then assert
       the orphan is no longer present in gossip.
    """
    servers = await manager.servers_add(3, config={
        'error_injections_at_startup': ['fast_orphan_removal_fiber']
    })
@@ -30,6 +53,20 @@ async def test_crashed_node_substitution(manager: ManagerClient):
    log = await manager.server_open_log(failed_server.server_id)
    await log.wait_for("finished do_send_ack2_msg")
    failed_id = await manager.get_host_id(failed_server.server_id)
    # Wait until the failed server's gossip state has propagated to all running peers.
    # "finished do_send_ack2_msg" only guarantees that one peer completed a gossip round
    # with the failed server; other nodes learn about it only in subsequent gossip rounds.
    # Querying gossip before propagation completes would cause the assertion below to fail
    # because the orphan node would not yet appear as live or down on every peer.
    async def gossip_has_node(server: ServerInfo):
        live = await manager.api.client.get_json("/gossiper/endpoint/live", host=server.ip_addr)
        down = await manager.api.client.get_json("/gossiper/endpoint/down", host=server.ip_addr)
        return True if failed_server.ip_addr in live + down else None
    for s in servers:
        await wait_for(partial(gossip_has_node, s), deadline=time.time() + 30)
    await manager.api.message_injection(failed_server.ip_addr, 'crash_before_group0_join')
    await task
Author	SHA1	Message	Date
Avi Kivity	b708e5d7c9	Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin `test_crashed_node_substitution` intermittently failed: ```python assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256). backport: this issue may affect CI for all branches, so should be backported to all versions. [SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29254 * github.com:scylladb/scylladb: test: test_crashed_node_substitution: add docstring and fix whitespace test: fix race condition in test_crashed_node_substitution	2026-03-26 21:40:33 +02:00
Petr Gusev	c38e312321	test_lwt_fencing_upgrade: fix quorum failure due to gossip lag If lwt_workload() sends an update immediately after a rolling restart, the coordinator might still see a replica as down due to gossip lagging behind. Concurrently restarting another node leaves only one available replica, failing the LOCAL_QUORUM requirement for learn or eventually consistent sp::query() in sp::cas() and resulting in a mutation_write_failure_exception. We fix this problem by waiting for the restarted server to see 2 other peers. The server_change_version doesn't do that by default -- it passes wait_others=0 to server_start(). Fixes SCYLLADB-1136 Closes scylladb/scylladb#29234	2026-03-26 21:25:53 +02:00
bitpathfinder	627a8294ed	test: test_crashed_node_substitution: add docstring and fix whitespace Add a description of the test's intent and scenario; remove extra blanks.	2026-03-26 18:40:17 +01:00
bitpathfinder	5a086ae9b7	test: fix race condition in test_crashed_node_substitution `test_crashed_node_substitution` intermittently failed: ``` assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake ("finished do_send_ack2_msg"), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: SCYLLADB-1256.	2026-03-26 18:25:05 +01:00