From fefa35987bdffdcd34fd817dcd65f87a619b634c Mon Sep 17 00:00:00 2001 From: Asias He Date: Thu, 6 Aug 2020 11:47:08 +0800 Subject: [PATCH] storage_service: Avoid updating tokens in system.peers for nodes to be removed Consider: 1) Start n1,n2,n3 2) Stop n3 3) Start n4 to replace n3 but list n4 as seed node 4) Node n4 finishes replacing operation 5) Restart n2 6) Run SELECT * from system.peers on node or node 1. cqlsh> SELECT * from system.peers ; peer| data_center | host_id| preferred_ip | rack | release_version | rpc_address | schema_version| supported_features| tokens 127.0.0.3 |null |null | null | null | null |null |null |null | {'-90410082611643223', '5874059110445936121'} The replaced old node 127.0.0.3 shows in system.peers. (Note, since commit 399d79fc6f1413f68f88e617386d9c4f54da1889 (init: do not allow replace-address for seeds), step 3 will be rejected. Assume we use a version without it) The problem is that n2 sees n3 is in gossip status of SHUTDOWN after restart. The storage_service::handle_state_normal callback is called for 127.0.0.3. Since n4 is using different token as n3 (seed node does not bootstrap so it uses new tokens instead of tokens of n3 which is being replaced), so owned_tokens will be set. We see logs like: [shard 0] storage_service - handle_state_normal: New node 127.0.0.3 at token 5874059110445936121 [shard 0] storage_service - Host ID collision for cbec60e5-4060-428e-8d40-9db154572df7 between 127.0.0.4 and 127.0.0.3; ignored 127.0.0.3 As a result, db::system_keyspace::update_tokens will be called to write to system.peers for 127.0.0.3 wrongly. if (!owned_tokens.empty()) { db::system_keyspace::update_tokens(endpoint, owned_tokens) } To fix, we should skip calling db::system_keyspace::update_tokens if the nodes is present in endpoints_to_remove. Refs: #4652 Refs: #6397 --- service/storage_service.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/service/storage_service.cc b/service/storage_service.cc index b6eec00310..c2ea1dc945 100644 --- a/service/storage_service.cc +++ b/service/storage_service.cc @@ -1174,7 +1174,7 @@ void storage_service::handle_state_normal(inet_address endpoint) { remove_endpoint(ep); } slogger.debug("handle_state_normal: endpoint={} owned_tokens = {}", endpoint, owned_tokens); - if (!owned_tokens.empty()) { + if (!owned_tokens.empty() && !endpoints_to_remove.count(endpoint)) { db::system_keyspace::update_tokens(endpoint, owned_tokens).then_wrapped([endpoint] (auto&& f) { try { f.get();