Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes https://github.com/scylladb/scylladb/issues/16824
Closes scylladb/scylladb#19779
* github.com:scylladb/scylladb:
test: perf: tablet_load_balancing: Test with higher shard and tablet counts
tablets: load_balancer: Avoid quadratic complexity when finding best candidate
tablets: load_balancer: Maintain load sketch properly during intra-node migration
tablets: load_balancer: Use "drained" flag
test: perf: tablet_load_balancing: Report load balancer stats
tablets: load_balancer: Move load_balancer_stats_manager to header file
tablets: load_balancer: Split evaluate_candidate() into src and dst part
tablets: load_balancer: Optimize evaluate_candidate()
tablets: load_balancer: Add more statistics
tablets: load_balancer: Track load per table on cluster level
tablets: load_balancer: Track load per table on node level
tablets: load_balancer: Use a single load sketch for tracking all nodes
locator: load_sketch: Introduce populate_dc()
tablets: load_balancer: Modify target load sketch only when emitting migration
locator: load_sketch: Introduce get_most_loaded_shard()
locator: load_sketch: Introduce get_least_loaded_shard()
locator: load_sketch: Optimize pick()/unload()
locator: load_sketch: Introduce load_type
test: perf: tablet_load_balancing: Report total tablet counts
test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
test: perf: tablet_load_balancing: Report time it took to schedule migrations
tablets: load_balancer: Log table load stats after each migration
tablets: load_balancer: Log per-shard load distribution in debug level
tablets: load_balancer: Improve per-table balance
tablets: load_balancer: Extract check_convergence()
tablets: load_balancer: Extract nodes_by_load_cmp
tablets: load_balancer: Maintain tablet count per table
tablets: load_balancer: Reuse src_node_info
test: perf: tablet_load_balancing: Print warnings about bad overcommit
test: perf: tablet_load_balancing: Allow running a single simulation
test: perf: tablet_load_balancing: Report best possible shard overcommit
test: perf: tablet_load_balancing: Report global shard overcommit
Scylla
What is Scylla?
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.
For more information, please see the ScyllaDB web site.
Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent versions of the C++20 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain, This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).
Building Scylla
Building Scylla with the frozen toolchain dbuild is as easy as:
$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
For further information, please see:
- Developer documentation for more information on building Scylla.
- Build documentation on how to build Scylla binaries, tests, and packages.
- Docker image build documentation for information on how to build Docker images.
Running Scylla
To start Scylla server, run:
$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1
This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory.
The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations).
Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.
For more run options, run:
$ ./tools/toolchain/dbuild ./build/release/scylla --help
Testing
See test.py manual.
Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.
Documentation
Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.
Training
Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.
Contributing to Scylla
If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.
If you are a developer working on Scylla, please read the developer guidelines.
Contact
- The community forum and Slack channel are for users to discuss configuration, management, and operations of the ScyllaDB open source.
- The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.