gossiper: check for a race condition in do_apply_state_locally

In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change
1. adds a check after locking the map entry: if a gossip ACK update
   does not contain a host ID, we verify that an entry with that host ID
   still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
   fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
   nodes that sent an empty host ID.

This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.

Fixes: scylladb/scylladb#25702
Fixes: scylladb/scylladb#25621

Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
This commit is contained in:
Sergey Zolotukhin
2025-08-28 14:30:30 +02:00
committed by GitHub Action
parent e8b903979e
commit e157e8577e
2 changed files with 12 additions and 2 deletions

View File

@@ -15,7 +15,6 @@ from test.pylib.manager_client import ManagerClient
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.xfail(reason="https://github.com/scylladb/scylladb/issues/25621")
async def test_gossiper_race_on_decommission(manager: ManagerClient):
"""
Test for gossiper race scenario (https://github.com/scylladb/scylladb/issues/25621):