raft topology: preserve IP -> ID mapping of a replacing node on restart

We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
This commit is contained in:
Patryk Jędrzejczak
2025-11-28 14:13:32 +01:00
parent 2e33234e91
commit fc4c2df2ce

View File

@@ -512,6 +512,17 @@ future<> storage_service::raft_topology_update_ip(locator::host_id id, gms::inet
sys_ks_futures.push_back(_sys_ks.local().update_peer_info(ip, id, {}));
}
break;
case node_state::replacing:
// Save the mapping just like for bootstrap above, but only in the case of replace with different IP, so
// that we don't have to delete the row of the node being replaced. For replace with the same IP, the
// mapping is recovered on restart based on the state of the topology state machine and system.peers.
if (!is_me(ip)) {
auto replaced_id = std::get<replace_param>(t.req_param.at(raft_id)).replaced_id;
if (const auto it = host_id_to_ip_map.find(locator::host_id(replaced_id.uuid())); it == host_id_to_ip_map.end() || it->second != ip) {
sys_ks_futures.push_back(_sys_ks.local().update_peer_info(ip, id, {}));
}
}
break;
default:
break;
}
@@ -3074,6 +3085,22 @@ future<> storage_service::join_cluster(sharded<service::storage_proxy>& proxy,
set_mode(mode::STARTING);
std::unordered_map<locator::host_id, gms::loaded_endpoint_state> loaded_endpoints = co_await _sys_ks.local().load_endpoint_state();
if (_group0->joined_group0() && raft_topology_change_enabled()) {
// Recover the endpoint state of the node replacing with the same IP. Its IP mapping is not in system.peers.
const auto& topo = _topology_state_machine._topology;
for (const auto& [id, replica_state]: topo.transition_nodes) {
auto host_id = locator::host_id(id.uuid());
if (replica_state.state == node_state::replacing && !loaded_endpoints.contains(host_id)) {
auto replaced_id = std::get<replace_param>(topo.req_param.at(id)).replaced_id;
if (const auto it = loaded_endpoints.find(locator::host_id(replaced_id.uuid())); it != loaded_endpoints.end()) {
auto ip = it->second.endpoint;
slogger.info("Adding node {}/{} that is replacing {}/{} to loaded_endpoints", host_id, ip, replaced_id, ip);
gms::loaded_endpoint_state st{.endpoint = ip};
loaded_endpoints.emplace(host_id, std::move(st));
}
}
}
}
// Seeds are now only used as the initial contact point nodes. If the
// loaded_endpoints are empty which means this node is a completely new