load_balancer: fix std::out_of_bounds when decommissioning with empty nodes

Consider the following:

The tablet load balancer is working on:

- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas

In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:

// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};

load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.

Let's assume dst is a shard on node1.

Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:

// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
                                            drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;

If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.

This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.

Fixes: #26203

Closes scylladb/scylladb#26207

(cherry picked from commit c6c9c316a7)

Closes scylladb/scylladb#26240
This commit is contained in:
Ferenc Szili
2025-09-23 11:28:52 +02:00
committed by Botond Dénes
parent c7091b61e4
commit 99b69092ef
2 changed files with 47 additions and 0 deletions

View File

@@ -3726,6 +3726,43 @@ static void execute_tablet_for_new_rf_test(calculate_tablet_replicas_for_new_rf_
}
}
SEASTAR_THREAD_TEST_CASE(test_ensure_node_for_load_sketch) {
// This tests reproduces the balancer crash when a node is drained and there are more then one
// empty destination nodes. If one of these destination nodes has a lower capacity then the other,
// and the initial target node selected is the one with lower capacity, pick_candidate() will then
// change the target node to the one with higher capacity. The problem is that
// load_sketch::get_least_loaded_shard() and consequently load_sketch::ensure_node() have not yet
// been called for the new, larger target (only for the initial, smaller one), and load_sketch will
// not have the larger node in its _nodes member hash map. This will cause an std::out_of_bounds
// exception when load_sketch::pick() is called with the host_id of the larger node.
do_with_cql_env_thread([] (auto& e) {
logging::logger_registry().set_logger_level("load_balancer", logging::log_level::debug);
topology_builder topo(e);
auto host1 = topo.add_node(node_state::normal, 2);
const uint64_t node1_capacity = 50UL * 1024UL * 1024UL * 1024UL;
topo.get_shared_load_stats().set_capacity(host1, node1_capacity);
auto ks_name = add_keyspace(e, {{topo.dc(), 1}}, 4);
add_table(e, ks_name).get();
auto host2 = topo.add_node(node_state::normal, 2);
const uint64_t node2_capacity = 70UL * 1024UL * 1024UL * 1024UL;
topo.get_shared_load_stats().set_capacity(host2, node2_capacity);
auto host3 = topo.add_node(node_state::normal, 2);
const uint64_t node3_capacity = 60UL * 1024UL * 1024UL * 1024UL;
topo.get_shared_load_stats().set_capacity(host3, node3_capacity);
topo.set_node_state(host1, node_state::removing);
auto& talloc = e.get_tablet_allocator().local();
auto& stm = e.shared_token_metadata().local();
talloc.balance_tablets(stm.get(), topo.get_shared_load_stats().get()).get();
}).get();
}
SEASTAR_THREAD_TEST_CASE(test_calculate_tablet_replicas_for_new_rf_upsize_one_dc) {
calculate_tablet_replicas_for_new_rf_config config;
config.ring_points = {