scylladb/docs/dev/size-based-load-balancing.md

# Size based load balancing

Until size based load balancing was introduced, ScyllaDB performed disk based balancing
based on disk capacity and tablet count with the assumption that every tablet uses the
same amount of disk space. This means that the number of tablets located on a node was
proportional to the gross disk capacity of that node. Because the used disk space of
different tablets can vary greatly, this could create imbalance in disk utilization.

Size based load balancing aims to achieve better disk utilization across nodes in a
cluster. The load balancer will continuously gather information about available disk
space and tablet sizes from all the nodes. It then incrementally computes tablet
migration plans which equalize disk utilization across the cluster.

# Basic operation

The load balancer runs as a background task on the same shard that the topology
coordinator is running on. The topology coordinator collects information needed
for the load balancer to make decisions about which tablets to migrate. This
information is collected periodically by the coordinator from all the nodes in
the cluster in the form of the data structure ``load_stats``. The information in
this struct relevant for the load balancer is stored in its member tablet_stats,
which contains:

- ``tablet_sizes``: the disk size in bytes of all the tablets on the given node.
- ``effective_capacity``: contains the sum of available disk space and all the tablet sizes on the given node.

The balancer will use this information to compute the disk load (disk utilization)
on every node and shard. It will then migrate tablets from the most to the least
loaded nodes and shards.

# Table balance

The secondary goal of the load balancer is to achieve table balance. This means that
the balancer needs to equalize the following ratios:

- disk used by the table on a shard / total disk used by the table in the rack
- effective capacity of a shard / effective capacity of the rack

Otherwise, we could have imbalance where a shard contains more data from a table,
which can potentially overload the CPU of that shard in case the given table is hot.

# ``load_stats`` reconciliation

Because the ``load_stats`` collection interval is 1 minute, by the time the balancer
starts using the information in ``load_stats``, that information can be stale. This can
be caused by tablet migrations or table resize (split or merge). To get around this,
we need to update the information about tablet sizes in ``load_stats`` after a migration
or resize. Issued tablet migrations update this information in ``load_stats`` by also
migrating the tablet size from one host to another. This tablet size migration
reconciliation is performed during the end_migration stage of the tablet migration.
Tablet size reconciliation for split or merge is performed during tablet resize
finalization. For a split, the reconciliation will divide the tablet size, and create
two tablets in place of the original tablet pre-split. For merge, it will accumulate
the tablet sizes of the tablets pre-merge, create a merged tablet size and remove the
tablet sizes pre-merge from ``load_stats``.

# Force capacity based balancing

The load balancer has the ability to fall back on capacity based balancing. This can be
enforced by a config parameter force_capacity_based_balancing. During capacity based
balancing, the balancer will not look up the actual tablet sizes from ``load_stats``,
and will instead assume each tablet size is equal to default_target_tablet_sizes. It will
also use the gross disk capacity (sent in ``load_stats`` struct in the capacity member),
instead of effective capacity.

# Excluding nodes with incomplete tablet sizes

Even with tablet size reconciliation (during migration and table resize), it is still
possible for the tablet size information in ``load_stats`` to not match the current tablet
information found in the tablet map. In order to avoid problems with the balancer
making decisions based on incomplete or incorrect data, size based load balancing
will not balance nodes which have incomplete tablet sizes. Instead, it will ignore
these nodes (these nodes will not be selected as sources or destinations for tablet
migrations), and will wait for correct tablet sizes to arrive after the next ``load_stats``
refresh by the topology coordinator.

One exception to this are nodes which have been excluded from the cluster. These nodes
are down and therefore are not able to send fresh ``load_stats``. But they have to be drained
of their tablets (via tablet rebuild), and the balancer must do this even with incomplete
tablet data. So, only excluded nodes are allowed to have missing tablet sizes.

# Size based balancing cluster feature

During rolling upgrades, it can take hours or even days until all the nodes in a cluster
have been upgraded. This means that the non-upgraded nodes will send ``load_stats`` without
tablet sizes. Considering the balancer ignores these nodes during balancing, we would
have a problem with some of the nodes not being load balanced for extended periods of
time. To avoid this, size based load balancing is only enabled after the cluster feature
is enabled. Until that time, the balancer will fall back on capacity based balancing.

# Minimal tablet size

After a table is created and it begins to receive data, there can be a period of time
where the data in some of the tablets has been flushed, while others have not. During
this time, the load balancer will migrate seemingly empty (but actually not yet flushed)
tablet away from the nodes where they have been created. Later, when all the tablets
have been flushed, the balancer will migrate the tablets again. In order to avoid these
unnecessary early migrations, we introduce the idea of minimal tablet size. The balancer
will treat any tablet smaller than minimal tablet size as having a size of minimal
tablet size. This reduces early migrations.

# Balance threshold percentage

The balancer considers if a set nodes are balanced by computing the delta of disk load
between the most loaded and least loaded node, and dividing it with the load of the most
loaded node:

    delta = (most_loaded - least_loaded) / most_loaded

If this computed value is below a threshold, the nodes are considered balanced. This threshold
can be configured with the ``size_based_balance_threshold_percentage`` config option.