raft topology: preserve IP -> ID mapping of a replacing node on restart

We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine).
2026-05-30 19:46:48 +00:00 · 2025-11-28 14:13:32 +01:00
parent 2e33234e91
commit fc4c2df2ce
1 changed files with 27 additions and 0 deletions
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -512,6 +512,17 @@ future<> storage_service::raft_topology_update_ip(locator::host_id id, gms::inet
                sys_ks_futures.push_back(_sys_ks.local().update_peer_info(ip, id, {}));
            }
        break;
+        case node_state::replacing:
+            // Save the mapping just like for bootstrap above, but only in the case of replace with different IP, so
+            // that we don't have to delete the row of the node being replaced. For replace with the same IP, the
+            // mapping is recovered on restart based on the state of the topology state machine and system.peers.
+            if (!is_me(ip)) {
+                auto replaced_id = std::get<replace_param>(t.req_param.at(raft_id)).replaced_id;
+                if (const auto it = host_id_to_ip_map.find(locator::host_id(replaced_id.uuid())); it == host_id_to_ip_map.end() || it->second != ip) {
+                    sys_ks_futures.push_back(_sys_ks.local().update_peer_info(ip, id, {}));
+                }
+            }
+        break;
        default:
        break;
    }
@@ -3074,6 +3085,22 @@ future<> storage_service::join_cluster(sharded<service::storage_proxy>& proxy,
    set_mode(mode::STARTING);

    std::unordered_map<locator::host_id, gms::loaded_endpoint_state> loaded_endpoints = co_await _sys_ks.local().load_endpoint_state();
+    if (_group0->joined_group0() && raft_topology_change_enabled()) {
+        // Recover the endpoint state of the node replacing with the same IP. Its IP mapping is not in system.peers.
+        const auto& topo = _topology_state_machine._topology;
+        for (const auto& [id, replica_state]: topo.transition_nodes) {
+            auto host_id = locator::host_id(id.uuid());
+            if (replica_state.state == node_state::replacing && !loaded_endpoints.contains(host_id)) {
+                auto replaced_id = std::get<replace_param>(topo.req_param.at(id)).replaced_id;
+                if (const auto it = loaded_endpoints.find(locator::host_id(replaced_id.uuid())); it != loaded_endpoints.end()) {
+                    auto ip = it->second.endpoint;
+                    slogger.info("Adding node {}/{} that is replacing {}/{} to loaded_endpoints", host_id, ip, replaced_id, ip);
+                    gms::loaded_endpoint_state st{.endpoint = ip};
+                    loaded_endpoints.emplace(host_id, std::move(st));
+                }
+            }
+        }
+    }

    // Seeds are now only used as the initial contact point nodes. If the
    // loaded_endpoints are empty which means this node is a completely new