In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.
Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.
For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions. One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2. Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):
Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}
The worst state before the patch had the following distribution of tablets for the smaller table:
Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33
One node has as many as 171 tablets of that table and another one has as few as 3.
After the patch, the worst distribution looks like this:
Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65
Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.
Refs #16824
(cherry picked from commit 3be6120e3b)
(cherry picked from commit c9bcb5e400)
(cherry picked from commit 7b1eea794b)
(cherry picked from commit 603abddca9)
Refs #18885Closesscylladb/scylladb#19036
* github.com:scylladb/scylladb:
tablets: load balancer: Use random selection of candidates when moving tablets
test: perf: Add test for tablet load balancer effectiveness
load_sketch: Extract get_shard_minmax()
load_sketch: Allow populating only for a given table