locator: tablets: Distribute data evenly among primary replicas during restore

Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.

split from bhalevy/load-balance-primary-replica

Fixes #27281

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This commit is contained in:
Robert Bindar
2025-12-06 09:11:32 +02:00
parent 92c988514c
commit d88036db48

View File

@@ -576,7 +576,23 @@ std::optional<tablet_replica> maybe_get_primary_replica(tablet_id id, const tabl
tablet_replica_set replica_set_copy = replica_set;
std::ranges::sort(replica_set_copy, tablet_replica_comparator(topo));
const auto replicas = replica_set_copy | std::views::filter(std::move(filter)) | std::ranges::to<tablet_replica_set>();
return !replicas.empty() ? std::make_optional(replicas.at(size_t(id) % replicas.size())) : std::nullopt;
if (replicas.empty()) {
tablet_logger.debug("No replicas in scope for tablet {}", id);
return std::nullopt;
}
// Once the filter was pushed down, we got here a set of replicas which live within the selected scope.
// replicas[ (id + id / size) % size ] is used to distribute the load evenly across all replicas in the scope.
// Let's take for instance a cluster with 1 dc, 3 racks, 9 nodes, rf=3, scope=dc. The `replicas` array here will contain 3
// replicas - one from each rack.
// Tablet id=0 will end up choosing replicas[0]
// Tablet id=1 will end up choosing replicas[1]
// Tablet id=2 will end up choosing replicas[2]
// We want tablet id=3 to choose replicas[1] and subsequently tablet id=6 to pick replicas[2] to ensure even distribution.
auto primary = replicas.at((size_t(id) + size_t(id) / replicas.size()) % replicas.size());
tablet_logger.debug("Primary replica of tablet {} is {}: replicas in scope: {}", id, primary, fmt::join(replicas, ","));
return primary;
}
tablet_replica tablet_map::get_primary_replica(tablet_id id, const locator::topology& topo) const {