Files
Aleksandra Martyniuk d874d355c2 service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.

Closes scylladb/scylladb#29622
2026-05-15 17:46:28 +02:00
..
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00