mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 14:03:06 +00:00

Go to file

Avi Kivity 99d0aaa7d2 Merge 'tablets: load_balancer: Improve per-table balance' from Tomasz Grabiec

Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.

In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.

This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance.  Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.

We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.

The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.

For example, for the following parameters:

  params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}

The results are:

Before:

  Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
  Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}

After:

  Overcommit       : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit       : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
  Overcommit       : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}

So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.

Also, node overcommit for table1 was reduced from 1.81 to 1.02.

The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.

The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).

One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.

Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.

Fixes https://github.com/scylladb/scylladb/issues/16824

Closes scylladb/scylladb#19779

* github.com:scylladb/scylladb:
  test: perf: tablet_load_balancing: Test with higher shard and tablet counts
  tablets: load_balancer: Avoid quadratic complexity when finding best candidate
  tablets: load_balancer: Maintain load sketch properly during intra-node migration
  tablets: load_balancer: Use "drained" flag
  test: perf: tablet_load_balancing: Report load balancer stats
  tablets: load_balancer: Move load_balancer_stats_manager to header file
  tablets: load_balancer: Split evaluate_candidate() into src and dst part
  tablets: load_balancer: Optimize evaluate_candidate()
  tablets: load_balancer: Add more statistics
  tablets: load_balancer: Track load per table on cluster level
  tablets: load_balancer: Track load per table on node level
  tablets: load_balancer: Use a single load sketch for tracking all nodes
  locator: load_sketch: Introduce populate_dc()
  tablets: load_balancer: Modify target load sketch only when emitting migration
  locator: load_sketch: Introduce get_most_loaded_shard()
  locator: load_sketch: Introduce get_least_loaded_shard()
  locator: load_sketch: Optimize pick()/unload()
  locator: load_sketch: Introduce load_type
  test: perf: tablet_load_balancing: Report total tablet counts
  test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
  test: perf: tablet_load_balancing: Report time it took to schedule migrations
  tablets: load_balancer: Log table load stats after each migration
  tablets: load_balancer: Log per-shard load distribution in debug level
  tablets: load_balancer: Improve per-table balance
  tablets: load_balancer: Extract check_convergence()
  tablets: load_balancer: Extract nodes_by_load_cmp
  tablets: load_balancer: Maintain tablet count per table
  tablets: load_balancer: Reuse src_node_info
  test: perf: tablet_load_balancing: Print warnings about bad overcommit
  test: perf: tablet_load_balancing: Allow running a single simulation
  test: perf: tablet_load_balancing: Report best possible shard overcommit
  test: perf: tablet_load_balancing: Report global shard overcommit

2024-08-01 21:12:14 +03:00

.github

Merge 'github: disable scheduled workflow on forks' from Kefu Chai

2024-07-24 07:50:39 +03:00

abseil @ d7aaad83b4

build: bring abseil submodule back

2024-05-05 23:31:09 +03:00

alternator

alternator: exclude CDC log table from ListTables

2024-07-30 10:43:29 +03:00

api

api: Unset cache_service endpoints on stop

2024-07-24 18:51:32 +03:00

auth

raft: use the abort source reference in raft group0 client interface

2024-07-31 09:18:54 +02:00

bin

install.sh: use the native nodetool directly

2024-04-25 22:52:00 +03:00

cdc

token: make kind-based ctor private

2024-07-20 21:21:42 +03:00

cmake

scylla-gdb.py: add $coro_frame()

2024-07-10 21:46:27 +03:00

compaction

replica: remove rwlock for protecting iteration over storage group map

2024-07-09 16:59:24 -03:00

conf

conf: scylla.yaml: enable_tablets: expand documentation

2024-06-27 14:41:43 +03:00

cql3

cql3/statement: use compile-time format string

2024-07-28 21:54:43 +03:00

data_dictionary

data_dictionary: keyspace_metadata: format: print also initial_tablets

2024-05-31 10:09:58 +03:00

Merge 'Remove gossiper argument from storage_service::join_cluster()' from Pavel Emelyanov

2024-08-01 10:18:14 +02:00

debug

…

dht

token: initialize non-key tokens with min() value

2024-07-20 21:21:42 +03:00

direct_failure_detector

direct_failure_detector: increase ping timeout and make it tunable

2024-05-07 23:40:23 +02:00

dist

build: cmake: use per-mode build dir

2024-07-28 18:11:37 +03:00

docs

doc: enable publishing docs for branch-6.1

2024-07-31 12:48:51 +02:00

exceptions

exceptions/exceptions.hh: Wrap #include <concepts> within an #ifdef

2024-07-17 22:09:41 +03:00

gms

db: service: add request type column to topology_requests

2024-07-23 13:35:01 +02:00

idl

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

index

code-cleanup: add missing header guards

2024-07-09 18:31:35 +03:00

lang

code-cleanup: add missing header guards

2024-07-09 18:31:35 +03:00

licenses

…

locator

locator: load_sketch: Introduce populate_dc()

2024-07-31 11:38:17 +02:00

message

tasks: implement task_manager::virtual_task::impl::get_children

2024-07-23 13:35:01 +02:00

mutation

code-cleanup: add missing header guards

2024-07-09 18:31:35 +03:00

mutation_writer

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

node_ops

db: node_ops: filter topology request entries

2024-07-23 13:35:02 +02:00

raft

raft: add more raft metrics to make debug easier

2024-07-01 10:55:22 +02:00

readers

readers: define query::partition_slice before using it in default argument

2024-06-27 19:36:13 +03:00

redis

code: Switch to sched group in request_stop_server()

2024-05-24 18:00:01 +03:00

reloc

reloc: create $BUILDDIR for getting its path

2024-05-01 09:52:17 +03:00

repair

tasks: keep virtual tasks in task manager

2024-07-23 13:35:01 +02:00

replica

db: fix waiting for counter update operations on table stop

2024-08-01 09:39:49 +02:00

rust

rust: disable incremental build for release build

2024-06-20 12:01:14 +03:00

schema

schema/schema: fix column names in index description

2024-07-09 22:37:05 +02:00

scripts

scripts/open-coredump.sh: allow complete bypass of S3 server

2024-07-18 21:43:53 +03:00

seastar @ a7d81328fb

Update seastar submodule

2024-07-28 21:04:45 +03:00

service

Merge 'tablets: load_balancer: Improve per-table balance' from Tomasz Grabiec

2024-08-01 21:12:14 +03:00

sstables

sstables: fix a typo in comment

2024-07-31 13:58:09 +03:00

streaming

code-cleanup: add missing header guards

2024-07-09 18:31:35 +03:00

swagger-ui @ 12f1da1082

…

tasks

db: node_ops: filter topology request entries

2024-07-23 13:35:02 +02:00

test

Merge 'tablets: load_balancer: Improve per-table balance' from Tomasz Grabiec

2024-08-01 21:12:14 +03:00

tools

Update ./tools/java submodule

2024-07-22 17:12:09 +03:00

tracing

cql3: Define prepared_statement weak pointer as const

2024-05-25 16:40:35 +03:00

transport

transport: move the cql_server::~cql_server() into .cc

2024-07-10 12:52:51 +08:00

types

treewide: include seastar headers with brackets

2024-06-21 19:20:27 +03:00

unified

cqlsh: update cqlsh submodule

2024-06-26 12:07:21 +03:00

utils

s3/client: add client::upload_file()

2024-07-23 14:39:30 +08:00

.dockerignore

…

.gitattributes

gitattributes: Mark swagger .js files as binary

2024-06-19 15:07:56 +03:00

.gitignore

git: add build.ninja.new to .gitignore

2024-06-24 16:48:50 +03:00

.gitmodules

build: bring abseil submodule back

2024-05-05 23:31:09 +03:00

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

…

absl-flat_hash_map.hh

…

amplify.yml

…

backlog_controller.hh

…

build_mode.hh

…

bytes_ostream.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

bytes.cc

bytes: drop unused operator<<

2024-06-25 12:11:28 +03:00

bytes.hh

bytes: drop unused operator<<

2024-06-25 12:11:28 +03:00

cache_mutation_reader.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

cache_temperature.hh

…

cartesian_product.hh

…

cell_locking.hh

…

checked-file-impl.hh

…

client_data.cc

…

client_data.hh

transport: do not return client_type from cql_server::connection::make_client_key()

2024-06-07 09:23:06 +08:00

clocks-impl.cc

…

clocks-impl.hh

…

clustering_bounds_comparator.hh

clustering_bounds_comparator: drop operator<< for bound_kind

2024-06-11 18:01:06 +02:00

clustering_interval_set.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

clustering_key_filter.hh

…

clustering_ranges_walker.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

CMakeLists.txt

node_ops: add task manager module and node_ops_virtual_task

2024-07-23 13:35:01 +02:00

collection_mutation.cc

collection_mutation: improve collection_mutation_view formatting

2024-05-02 18:42:41 +03:00

collection_mutation.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

column_computation.hh

…

combine.hh

…

compound_compat.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

compound.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

compress.cc

…

compress.hh

compress, auth: include used headers

2024-05-30 09:16:23 +03:00

concrete_types.hh

…

configure.py

node_ops: add task manager module and node_ops_virtual_task

2024-07-23 13:35:01 +02:00

CONTRIBUTING.md

…

converting_mutation_partition_applier.cc

…

converting_mutation_partition_applier.hh

…

counters.cc

…

counters.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

coverage_excludes.txt

…

coverage_sources.list

…

cql_serialization_format.hh

…

db_clock.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

debug.cc

…

debug.hh

…

default.nix

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

Doxyfile

…

duration.cc

…

duration.hh

…

encoding_stats.hh

…

enum_set.hh

…

fix_system_distributed_tables.py

…

flake.lock

…

flake.nix

…

frozen_schema.cc

…

frozen_schema.hh

…

full_position.hh

…

gc_clock.hh

…

gdbinit

…

gen_segmented_compress_params.py

…

generic_server.cc

generic_server: Fix indentation after previous patch

2024-05-03 12:29:08 +03:00

generic_server.hh

…

HACKING.md

HACKING.md: fix typo in "--overprovisioned" option name

2024-06-25 12:11:28 +03:00

hashing_partition_visitor.hh

…

idl-compiler.py

idl-compiler: generate async serialization functions for stub members

2024-05-02 19:27:56 +03:00

inet_address_vectors.hh

…

init.cc

…

init.hh

…

install-dependencies.sh

toolchain: change optimized clang install method to standard one

2024-07-09 14:22:42 +03:00

install.sh

dist: support nonroot and offline mode for scylla-housekeeping

2024-07-23 07:57:32 +03:00

interval.hh

treewide: replace std::result_of_t with std::invoke_result_t

2024-05-26 16:45:42 +03:00

keys.cc

clustering_bounds_comparator: drop operator<< for bound_kind

2024-06-11 18:01:06 +02:00

keys.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

LICENSE.AGPL

…

log.hh

…

main.cc

storage_service: Remote gossiper argument from join_cluster()

2024-07-26 16:29:58 +03:00

map_difference.hh

…

marshal_exception.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

multishard_mutation_query.cc

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

multishard_mutation_query.hh

…

mutation_query.cc

…

mutation_query.hh

treewide: Use partition_slice::is_reversed()

2024-03-13 08:52:46 +02:00

noexcept_traits.hh

…

NOTICE.txt

…

ORIGIN

…

partition_builder.hh

…

partition_range_compat.hh

…

partition_slice_builder.cc

…

partition_slice_builder.hh

…

partition_snapshot_reader.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

partition_snapshot_row_cursor.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

protocol_server.hh

protocol_server: Keep scheduling group on board

2024-05-24 17:54:29 +03:00

querier.cc

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

querier.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

query_id.hh

…

query_ranges_to_vnodes.cc

./: not include unused headers

2024-03-20 09:16:46 +02:00

query_ranges_to_vnodes.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

query_result_merger.hh

…

query-request.hh

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

query-result-reader.hh

…

query-result-set.cc

…

query-result-set.hh

…

query-result-writer.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

query-result.hh

…

query.cc

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

read_context.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

reader_concurrency_semaphore.cc

reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop

2024-07-04 17:47:52 +03:00

reader_concurrency_semaphore.hh

reader_concurrency_semaphore: wire in the configurable cpu concurrency

2024-06-27 09:57:11 -04:00

reader_permit.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

README.md

README.md: add badges for cron jobs

2024-06-23 19:24:40 +03:00

real_dirty_memory_accounter.hh

…

release.cc

release: introduce doc_link()

2024-05-08 09:41:17 -04:00

release.hh

release: introduce doc_link()

2024-05-08 09:41:17 -04:00

reversibly_mergeable.hh

…

row_cache.cc

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

row_cache.hh

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

schema_mutations.cc

schema_mutations: add fmt::formatter for schema_mutations

2024-03-15 09:49:56 +02:00

schema_mutations.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

schema_upgrader.hh

…

scylla_post_install.sh

…

scylla-gdb.py

Merge 'replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho

2024-07-12 15:45:36 +03:00

SCYLLA-VERSION-GEN

Update ScyllaDB version to: 6.2.0-dev

2024-07-18 16:07:07 +03:00

seastarx.hh

…

serialization_visitors.hh

…

serializer_impl.hh

serializer_impl, sstables: fix build failure due to missing includes

2024-04-23 12:03:51 +03:00

serializer.cc

…

serializer.hh

…

service_permit.hh

…

setup.py

…

shell.nix

…

sstables_loader.cc

sstables-loader: Run loading in its scheduling group

2024-05-28 11:07:58 +03:00

sstables_loader.hh

sstables-loader: Add scheduling group to constructor

2024-05-28 11:07:22 +03:00

supervisor.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

table_helper.cc

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

table_helper.hh

…

test.py

Merge '[test.py] add --extra-scylla-cmdline-options argument for test.py' from Artsiom Mishuta

2024-06-28 11:11:29 +02:00

timeout_config.cc

…

timeout_config.hh

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

timestamp.hh

…

tombstone_gc_extension.hh

./: not include unused headers

2024-03-20 09:16:46 +02:00

tombstone_gc_options.cc

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

tombstone_gc_options.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

tombstone_gc.cc

token: move ordering operator inline

2024-07-20 21:21:42 +03:00

tombstone_gc.hh

cql3: statements: change default tombstone_gc mode for tablets

2024-04-24 10:42:10 +02:00

tox.ini

…

ubsan-suppressions.supp

…

unimplemented.cc

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

unimplemented.hh

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

validation.cc

…

validation.hh

…

version.hh

…

view_info.hh

treewide: replace formatter<std::string_view> with formatter<string_view>

2024-04-19 07:44:07 +03:00

vint-serialization.cc

…

vint-serialization.hh

…

zstd.cc

zstd: include external header with brackets

2024-07-04 10:42:29 +03:00

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++20 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain, This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of the ScyllaDB open source.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.1%

Python 26.7%

CMake 0.3%

GAP 0.3%

Shell 0.3%