Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Refs #23042Closesscylladb/scylladb#23079
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
(cherry picked from commit b1d9f80d85)
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.
Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]
Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.
This is the main problem fixed by this patch.
It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.
Fixes#22994.
Refs #17265.
This patch is based on Raphael's work done in PR #23081. The main differences are:
1) Instead of sorting replicas by rack, we try to find
replicas in sibling tablets which belong to the same rack.
This is similar to how we match replicas within the same host.
It reduces number of across-rack migrations even if RF!=#racks,
which the original patch didn't handle.
Unlike the original patch, it also avoids rack-overloaded in case
RF!=#racks
2) We emit across-rack co-locating migrations if we have no other choice
in order to finalize the merge
This is ok, since views are not supported with tablets yet. Later,
we will disallow this for tables which have views, and we will
allow creating views in the first place only when no such migrations
can happen (RF=#racks).
3) Added boost unit test which checks that rack overload is avoided during merge
in case RF<#racks
4) Moved logging of across-rack migration to debug level
5) Exposed metric for across-rack co-locating migrations
(cherry picked from commit af949f3b6a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Main problem:
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
Secondary problem:
We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state.
Third problem:
We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN.
The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes.
Closesscylladb/scylladb#21928
* github.com:scylladb/scylladb:
tablets: load_balancer: Fail when draining with no candidate nodes
tablets: load_balancer: Ignore skip_list when draining
tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
When doing normal load balancing, we can ignore DOWN nodes in the node
set and just balance the UP nodes among themselves because it's ok to
equalize load just in that set, it improves the situation.
It's dangerous to do that when draining because that can lead to
overloading of the UP nodes. In the worst case, we can have only one
non-drained node in the UP set, which would receive all the tablets of
the drained node, doubling its load.
It's safer to let the drain fail or stall. This is decided by topology
coordinator, currently we will fail (on barrier) and rollback.
In this change, tablet_virtual_task starts supporting tablet
migration, in addition to tablet repair. Both tablet operations
reuse the same virtual_task because their task data is retrieved
similarly. However, it changes nothing from the task manager
API users' perspective. They can list running migrations or check
their statuses all the same as if migration had its own virtual_task.
Users can see running migration tasks - finished tasks are not
presented with the task manager API. However, the result
of the migration (whether it succeeded or failed) would be
presented to users, if they use wait API.
If a migration was reverted, it will appear to users as failed.
We assume that the migration was reverted, when its destination
does not contain a tablet replica.
Fixes: https://github.com/scylladb/scylladb/issues/21365.
No backport, new feature
Closesscylladb/scylladb#21729
* github.com:scylladb/scylladb:
test: boost: check migration_task_info in tablet_test.cc
replica: add repair related fields to tablet_map_to_mutation
test: add tests to check the failed migration virtual tasks
test: add tests to check the list of migration virtual tasks
test: add tests to check migration virtual tasks status
test: topology_tasks: generalize repair task functions
service: extend tablet_virtual_task::abort
service: extend tablet_virtual_task::wait
service: extend tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: tasks: make get_table_id a method of virtual_task_hint
service: tasks: extend virtual_task_hint
replica: service: add migration_task_info column to system.tablets
locator: extend tablet_task_info to cover migration tasks
locator: rename tablet_task_info methods
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.
Fixesscylladb/scylladb#20357Closesscylladb/scylladb#21863
When tablet scheduler drains nodes, it chooses target location based
on "badness" metric. Nodes with lowest score are preferred. Before the
patch, the score which was used was the number of tablets on that node
post-movement. This way we populate least-loaded node first. But this
works only if nodes have equal number of shards. If nodes have different
capacity, then number of tablets is not a good metric, because we don't
aim to equalize per-node count, but per-shard count. We assume that each
shard has equal capacity.
Because of this bug, during decommission, the nodes with fewer shards
would be preferred to receive replicas, which may lead to overloading
of those nodes. This imbalance would be later fixed by the normal load
balancing logic, but it's still problematic.
Fixes#21783Closesscylladb/scylladb#21860
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
set_resize_plan() breaks commutativity since it may override the
resize plans done earlier, for example, when adding co-location
migrations in the DC plan.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This implements the ability to resize the tablet map for merge if
the balancer emits the decision to finalize the merge when all
sibling replicas are colocated for a table. But the co-location
plan is not implemented in the balancer yet, so this is still
not in use.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.
The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.
NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The initial_tablet_count is respected while the table is in "growing mode".
The table implicitly enters this mode when created, since we expect the
table to be populated thereafter.
We say that a table leaves this mode if it required a split above the initial
tablet count. After that, we can rely purely on the average size to say that
a table is shrinking and requires merge.
This is not perfect and we may want to leave the mode too if we detect
the table is shrinking (or even not growing for some significant amount
of time), before any split happened.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If table is undergoing merge, co-located replicas of sibling tablets will
be treated by balancer as if they were a single migration candidate.
The reason for that is that the balancer must not undo the co-location work
done previously on behalf of merge decision.
Sibling tablets will be put in the same migration plan, but note that
each tablet is still migrated independently in the state machine.
The balancer will exclude both co-located tablets from the candidate list
if either haven't finished migration yet. It achieves that by pretending
migration of sibling tablets succeeded, allowing it to note that tablets
are co-located even though either can still be migrating.
The load inversion convergence check also happens after picking a
candidate now, since the balancer must be aware that co-located
tablets are being migrated together and we want to avoid oscillations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have a check that moving a tablet from A to B won't violate
replication constraints.
The contraints might not be the same for two sibling tablets that have co-located
replicas.
Example:
nodes = {A, B, C, D}
tablet1 = {A, B, C}
tablet2 = {A, B, D}
viable target for {tablet1, B} is D.
viable target for {tablet2, B} is C.
When co-located replicas share a viable target, then a migration can be emitted to
preserve co-location.
To allow decommission when co-located replicas don't share a viable target, a skip
info will be returned for each tablet, even though that means breaking this
co-location. Decommission is higher in priority.
Also, doing some preparation for integration of migration_tablet_set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The load inversion convergence check should be able to know when two
tablets are being migrated instead of one, to avoid oscillations.
This will be wired when migration_tablet_set is wired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new type will allow the load balancer to treat co-located tablets
as a single candidate (will treat them as if they were already merged),
allowing co-located replicas to be migrated together (in the same
migration plan).
The type is a variant of global_tablet_id and colocated_tablets
(which holds the global_tablet_id of the sibling tablets).
It will be eventually wired after some more preparation. It will allow
for minimal amount of changes in the balancer code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extraction will make it easier later when co-located tablets
are introduced in load balancer.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
The stream_weight for repair migration is set to 2, because it requires
more work than just moving the tablet around. The stream_weight for all
other migrations are set to 1.
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.
Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`
Note: best viewed with `git show -w`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit c286434e4c, reversing
changes made to 6712fcc316.
The commit causes memtable_test to be very flaky in debug mode.
Specifically, subtests test_exceptions_in_flush_on_sstable_open
and test_exceptions_in_flush_on_sstable_write).
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.
Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`
Note: best viewed with `git show -w`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We're going to remove the interface `data_dictionary::keyspace_element`.
As `schema::keyspace_name()` is an implementation of one of the methods
specified by that interface, we replace its uses by `schema::ks_name()`.
`schema::keyspace_name()` was an alias for it, so no semantic change
has occured.
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In one of the following patches, we introduce support for zero-token
nodes. From that point, getting all nodes and getting all token
owners isn't equivalent. In this patch, we ensure that we consider
only token owners when we want to consider only token owners (for
example, in the replication logic), and we consider all nodes when
we want to consider all nodes (for example, in the topology logic).
The main purpose of this patch is to make the PR introducing
zero-token nodes easier to review. The patch that introduces
zero-token nodes is already complicated. We don't want trivial
changes from this patch to make noise there.
This patch introduces changes needed for zero-token nodes only in the
Raft-based topology and in the recovery mode. Zero-token nodes are
unsupported in the gossip-based topology outside recovery.
Some functions added to `token_metadata` and `topology` are
inefficient because they compute a new data structure in every call.
They are never called in the hot path, so it's not a serious problem.
Nevertheless, we should improve it somehow. Note that it's not
obvious how to do it because we don't want to make `token_metadata`
store topology-related data. Similarly, we don't want to make
`topology` store token-related data. We can think of an improvement
in a follow-up.
We don't remove unused `topology::get_datacenter_rack_nodes` and
`topology::get_datacenter_nodes`. These function can be useful in the
future. Also, `topology::_dc_nodes` is used internally in `topology`.
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
fmt 11 enforces the constness of `format()` member function, if it
is not marked with `const`, the tree fails to build with fmt 11, like:
```
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
1393 | ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
| ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<service::migration_badness, fmt::formatter<service::migration_badness>>' requested here
1374 | custom.format = format_custom_arg<
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:170:14: note: in instantiation of function template specialization 'fmt::format_to<fmt::basic_appender<char>, const locator::global_tablet_id &, const locator::tablet_replica &, const locator::tablet_replica &, const service::migration_badness &, 0>' requested here
170 | fmt::format_to(ctx.out(), "{{tablet: {}, {} -> {}, badness: {}", candidate.tablet, candidate.src,
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:161:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<service::migration_badness>', but method is not marked const
161 | auto format(const service::migration_badness& badness, FormatContext& ctx) {
| ^
```
so, in this change, we mark these two `format()` member functions const.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20013
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
If the source and destination shards picked for migration based on
global tablet balance do not have a good candidate in terms of effect
on per-table balance, the algorithm explores other source shards and
destinations. This has quadratic complexity in terms of shard count in
the worst case, when there are no good candidates.
Since we can have up to ~200 shards, this can slow down scheduling
significantly. I saw total scheduling time of 5 min in the following run:
scylla perf-load-balancing -c1 -m1G --iterations=8 \
--nodes=4 --tablets1=1024 --tablets2=8096 \
--rf1=2 --rf2=3 --shards=256
To improve, change the apprach to first find the best source shard and
then best target shard, sequentially. So it's now linear in terms of
shard count.
After the change, the total scheduling time in that run is down to 4s.
Minimizing source and destination metrics piece-wise minimizes the
combined metric, so badness of the best candidate doesn't suffer after
this change.
Affects only intra-node migration. The code was recording destination
shard as taken and did not un-take it in case we skipped the migration
due to lack of candidates.
Noticed during code review. Impact is minor, since even if this leads
to suboptimal balance, the next scheduling round should fix it.
Also, the source shard was not unloaded, but that should have no
impact on decisions. But to be future-proof, better to maintain the
load accurately in case the algorithm is extended with more steps.