Resize is no longer only due to avg tablet size. Log avg tablet size as an
information, not the reason, and log the true reason for target tablet
count.
Hints have common meaning for all strategies, so the logic
belongs more to make_sizing_plan().
As a side effect, we can reuse shard capacity computation across
tables, which reduces computational complexity from O(tables*nodes) to
O(tables * DCs + nodes)
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.
We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
This is in preparation for using the sizing plan during table creation
where we never have size stats, and hints are the only determining
factor for target tablet count.
Resize plan making will now happen in two stages:
1) Determine desired tablet counts per table (sizing plan)
2) Schedule resize decisions
We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.
We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.
Also, this sizing plan will be later plugged into a virtual table for
observability.
Logic is preserved since target tablet size is constant for all
tables.
Dropping d.target_max_tablet_size() will allow us to move it
to the load_balancer scope.
Demote do-nothing decisions to debug level, but keep them at info
if we did decide to do nothing (such as migrate a tablet). Information
about more major events (like split/merge) is kept at info level.
Once log line that logs node information now also logs the datacenter,
which was previously supplied by a log line that is now debug-only.
Closesscylladb/scylladb#22783
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.
TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed. In this case
the tables tablet counts should be scaled down proportionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than target_max_tablet_size. We need both the target
as well as max and min tablet sizes, so there is no sense
in keeping the max and deriving the target and the minimum
for the max value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the keyspace initial_tablets for min_tablet_count, if the latter
isn't set, then take the maximum of the option-based tablet counts:
- min_tablet_count
- and expected_data_size_in_gb / target_tablet_size
- min_per_shard_tablet_count (via
calculate_initial_tablets_from_topology)
If none of the hints produce a positive tablet_count,
fall back to calculate_initial_tablets_from_topology * initial_scale.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Main problem:
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
Secondary problem:
We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state.
Third problem:
We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN.
The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes.
Closesscylladb/scylladb#21928
* github.com:scylladb/scylladb:
tablets: load_balancer: Fail when draining with no candidate nodes
tablets: load_balancer: Ignore skip_list when draining
tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
When doing normal load balancing, we can ignore DOWN nodes in the node
set and just balance the UP nodes among themselves because it's ok to
equalize load just in that set, it improves the situation.
It's dangerous to do that when draining because that can lead to
overloading of the UP nodes. In the worst case, we can have only one
non-drained node in the UP set, which would receive all the tablets of
the drained node, doubling its load.
It's safer to let the drain fail or stall. This is decided by topology
coordinator, currently we will fail (on barrier) and rollback.
In this change, tablet_virtual_task starts supporting tablet
migration, in addition to tablet repair. Both tablet operations
reuse the same virtual_task because their task data is retrieved
similarly. However, it changes nothing from the task manager
API users' perspective. They can list running migrations or check
their statuses all the same as if migration had its own virtual_task.
Users can see running migration tasks - finished tasks are not
presented with the task manager API. However, the result
of the migration (whether it succeeded or failed) would be
presented to users, if they use wait API.
If a migration was reverted, it will appear to users as failed.
We assume that the migration was reverted, when its destination
does not contain a tablet replica.
Fixes: https://github.com/scylladb/scylladb/issues/21365.
No backport, new feature
Closesscylladb/scylladb#21729
* github.com:scylladb/scylladb:
test: boost: check migration_task_info in tablet_test.cc
replica: add repair related fields to tablet_map_to_mutation
test: add tests to check the failed migration virtual tasks
test: add tests to check the list of migration virtual tasks
test: add tests to check migration virtual tasks status
test: topology_tasks: generalize repair task functions
service: extend tablet_virtual_task::abort
service: extend tablet_virtual_task::wait
service: extend tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: tasks: make get_table_id a method of virtual_task_hint
service: tasks: extend virtual_task_hint
replica: service: add migration_task_info column to system.tablets
locator: extend tablet_task_info to cover migration tasks
locator: rename tablet_task_info methods
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.
Fixesscylladb/scylladb#20357Closesscylladb/scylladb#21863
When tablet scheduler drains nodes, it chooses target location based
on "badness" metric. Nodes with lowest score are preferred. Before the
patch, the score which was used was the number of tablets on that node
post-movement. This way we populate least-loaded node first. But this
works only if nodes have equal number of shards. If nodes have different
capacity, then number of tablets is not a good metric, because we don't
aim to equalize per-node count, but per-shard count. We assume that each
shard has equal capacity.
Because of this bug, during decommission, the nodes with fewer shards
would be preferred to receive replicas, which may lead to overloading
of those nodes. This imbalance would be later fixed by the normal load
balancing logic, but it's still problematic.
Fixes#21783Closesscylladb/scylladb#21860
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
set_resize_plan() breaks commutativity since it may override the
resize plans done earlier, for example, when adding co-location
migrations in the DC plan.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This implements the ability to resize the tablet map for merge if
the balancer emits the decision to finalize the merge when all
sibling replicas are colocated for a table. But the co-location
plan is not implemented in the balancer yet, so this is still
not in use.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.
The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.
NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The initial_tablet_count is respected while the table is in "growing mode".
The table implicitly enters this mode when created, since we expect the
table to be populated thereafter.
We say that a table leaves this mode if it required a split above the initial
tablet count. After that, we can rely purely on the average size to say that
a table is shrinking and requires merge.
This is not perfect and we may want to leave the mode too if we detect
the table is shrinking (or even not growing for some significant amount
of time), before any split happened.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If table is undergoing merge, co-located replicas of sibling tablets will
be treated by balancer as if they were a single migration candidate.
The reason for that is that the balancer must not undo the co-location work
done previously on behalf of merge decision.
Sibling tablets will be put in the same migration plan, but note that
each tablet is still migrated independently in the state machine.
The balancer will exclude both co-located tablets from the candidate list
if either haven't finished migration yet. It achieves that by pretending
migration of sibling tablets succeeded, allowing it to note that tablets
are co-located even though either can still be migrating.
The load inversion convergence check also happens after picking a
candidate now, since the balancer must be aware that co-located
tablets are being migrated together and we want to avoid oscillations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have a check that moving a tablet from A to B won't violate
replication constraints.
The contraints might not be the same for two sibling tablets that have co-located
replicas.
Example:
nodes = {A, B, C, D}
tablet1 = {A, B, C}
tablet2 = {A, B, D}
viable target for {tablet1, B} is D.
viable target for {tablet2, B} is C.
When co-located replicas share a viable target, then a migration can be emitted to
preserve co-location.
To allow decommission when co-located replicas don't share a viable target, a skip
info will be returned for each tablet, even though that means breaking this
co-location. Decommission is higher in priority.
Also, doing some preparation for integration of migration_tablet_set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The load inversion convergence check should be able to know when two
tablets are being migrated instead of one, to avoid oscillations.
This will be wired when migration_tablet_set is wired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new type will allow the load balancer to treat co-located tablets
as a single candidate (will treat them as if they were already merged),
allowing co-located replicas to be migrated together (in the same
migration plan).
The type is a variant of global_tablet_id and colocated_tablets
(which holds the global_tablet_id of the sibling tablets).
It will be eventually wired after some more preparation. It will allow
for minimal amount of changes in the balancer code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extraction will make it easier later when co-located tablets
are introduced in load balancer.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
The stream_weight for repair migration is set to 2, because it requires
more work than just moving the tablet around. The stream_weight for all
other migrations are set to 1.
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.
Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`
Note: best viewed with `git show -w`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>