0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.
Closes scylladb/scylladb#28196
6.3 KiB
Size based load balancing
Until size based load balancing was introduced, ScyllaDB performed disk based balancing based on disk capacity and tablet count with the assumption that every tablet uses the same amount of disk space. This means that the number of tablets located on a node was proportional to the gross disk capacity of that node. Because the used disk space of different tablets can vary greatly, this could create imbalance in disk utilization.
Size based load balancing aims to achieve better disk utilization across nodes in a cluster. The load balancer will continuously gather information about available disk space and tablet sizes from all the nodes. It then incrementally computes tablet migration plans which equalize disk utilization across the cluster.
Basic operation
The load balancer runs as a background task on the same shard that the topology
coordinator is running on. The topology coordinator collects information needed
for the load balancer to make decisions about which tablets to migrate. This
information is collected periodically by the coordinator from all the nodes in
the cluster in the form of the data structure load_stats. The information in
this struct relevant for the load balancer is stored in its member tablet_stats,
which contains:
tablet_sizes: the disk size in bytes of all the tablets on the given node.effective_capacity: contains the sum of available disk space and all the tablet sizes on the given node.
The balancer will use this information to compute the disk load (disk utilization) on every node and shard. It will then migrate tablets from the most to the least loaded nodes and shards.
Table balance
The secondary goal of the load balancer is to achieve table balance. This means that the balancer needs to equalize the following ratios:
- disk used by the table on a shard / total disk used by the table in the rack
- effective capacity of a shard / effective capacity of the rack
Otherwise, we could have imbalance where a shard contains more data from a table, which can potentially overload the CPU of that shard in case the given table is hot.
load_stats reconciliation
Because the load_stats collection interval is 1 minute, by the time the balancer
starts using the information in load_stats, that information can be stale. This can
be caused by tablet migrations or table resize (split or merge). To get around this,
we need to update the information about tablet sizes in load_stats after a migration
or resize. Issued tablet migrations update this information in load_stats by also
migrating the tablet size from one host to another. This tablet size migration
reconciliation is performed during the end_migration stage of the tablet migration.
Tablet size reconciliation for split or merge is performed during tablet resize
finalization. For a split, the reconciliation will divide the tablet size, and create
two tablets in place of the original tablet pre-split. For merge, it will accumulate
the tablet sizes of the tablets pre-merge, create a merged tablet size and remove the
tablet sizes pre-merge from load_stats.
Force capacity based balancing
The load balancer has the ability to fall back on capacity based balancing. This can be
enforced by a config parameter force_capacity_based_balancing. During capacity based
balancing, the balancer will not look up the actual tablet sizes from load_stats,
and will instead assume each tablet size is equal to default_target_tablet_sizes. It will
also use the gross disk capacity (sent in load_stats struct in the capacity member),
instead of effective capacity.
Excluding nodes with incomplete tablet sizes
Even with tablet size reconciliation (during migration and table resize), it is still
possible for the tablet size information in load_stats to not match the current tablet
information found in the tablet map. In order to avoid problems with the balancer
making decisions based on incomplete or incorrect data, size based load balancing
will not balance nodes which have incomplete tablet sizes. Instead, it will ignore
these nodes (these nodes will not be selected as sources or destinations for tablet
migrations), and will wait for correct tablet sizes to arrive after the next load_stats
refresh by the topology coordinator.
One exception to this are nodes which have been excluded from the cluster. These nodes
are down and therefore are not able to send fresh load_stats. But they have to be drained
of their tablets (via tablet rebuild), and the balancer must do this even with incomplete
tablet data. So, only excluded nodes are allowed to have missing tablet sizes.
Size based balancing cluster feature
During rolling upgrades, it can take hours or even days until all the nodes in a cluster
have been upgraded. This means that the non-upgraded nodes will send load_stats without
tablet sizes. Considering the balancer ignores these nodes during balancing, we would
have a problem with some of the nodes not being load balanced for extended periods of
time. To avoid this, size based load balancing is only enabled after the cluster feature
is enabled. Until that time, the balancer will fall back on capacity based balancing.
Minimal tablet size
After a table is created and it begins to receive data, there can be a period of time where the data in some of the tablets has been flushed, while others have not. During this time, the load balancer will migrate seemingly empty (but actually not yet flushed) tablet away from the nodes where they have been created. Later, when all the tablets have been flushed, the balancer will migrate the tablets again. In order to avoid these unnecessary early migrations, we introduce the idea of minimal tablet size. The balancer will treat any tablet smaller than minimal tablet size as having a size of minimal tablet size. This reduces early migrations.
Balance threshold percentage
The balancer considers if a set nodes are balanced by computing the delta of disk load between the most loaded and least loaded node, and dividing it with the load of the most loaded node:
delta = (most_loaded - least_loaded) / most_loaded
If this computed value is below a threshold, the nodes are considered balanced. This threshold
can be configured with the size_based_balance_threshold_percentage config option.