mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-01 21:55:50 +00:00
When a node is being replaced, it enters a "left" state while still owning tokens. Before this patch, this is also the time when we start draining hints targeted to this node, so the hints may get sent before the token ownership gets migrated to another replica, and these hints may get lost. In this patch we postpone the hint draining for the "left" nodes to the time when we know that the target nodes no longer hold ownership of any tokens - so they're no longer referenced in topology. I'm calling such nodes "released". Before this change, when we were starting draining hints, we knew the IP addresses of the target nodes. We lose this information after entering "left" stage, so when draining hints after a node is "released", we can't drain the hints targeted to a specific IP instead of host_id. We may have hints targeted to IPs if the migration rom IP-based to host_ID-based hints didn't happen yet. The migration happens when enabling a cluster feature since 2024.2.0, so such hints can only exist if we perform a direct upgrade from a version before 2024.2.0 to a version that has this change (2025.4.0+). To avoid losing hints completely when such an upgrade is combined with a node removal/replacement, we still drain hints when the node enters a "left" state and the migration of hints to host_id wasn't performed yet. For these drains, the problematic scenario can't occur because it only affects tablets, and when upgrading from a version before 2024.2.0, no tablets can exist yet. If we perform such a drain, we no longer need to drain hints when entering the "released" state, so we only drain when entering that state if the migration was already completed. With this setup, we'll always drain hints at least once when a node is leaving. However, if the migration to host_ids finishes between entering the "left" state and the "released" state, we'll attempt to drain the hints twice. This shouldn't be problem though because each `drain_for()` is performed with the `_drain_lock` and after a `hint_endpoint_manger` is drained, it's removed, so we won't try to drain it twice. This patch also includes a test for verifying that hints are properly replayed after a node replace. Fixes https://github.com/scylladb/scylladb/issues/24980 Closes scylladb/scylladb#24981