Commit Graph

184 Commits

Author SHA1 Message Date
Michael Litvak
97b7c03709 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312
2025-11-28 11:17:12 +01:00
Tomasz Grabiec
f8879d797d tablet_allocator: Avoid load balancer failure when replacing the last node in a rack
Introduced in 9ebdeb2

The problem is specific to node replacing and rack-list RF. The
culprit is in the part of the load balancer which determines rack's
shard count. If we're replacing the last node, the rack will contain
no normal nodes, and shards_per_rack will have no entry for the rack,
on which the table still has replicas. This throws std::out_of_range
and fails the tablet draining stage, and node replace is failed.

No backport because the problem exists only on master.

Fixes #26768

Closes scylladb/scylladb#26783
2025-11-05 15:49:51 +03:00
Petr Gusev
03d6829783 test_tablets_lwt: add test_tablets_merge_waits_for_lwt 2025-10-22 11:33:20 +02:00
Tomasz Grabiec
e4e79be295 Merge 'tablet_allocator: allow merges in base tables if rf-rack-valid=true' from Piotr Dulikowski
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273

Marked for backport to 2025.4 as MVs are getting un-experimentaled there.

Closes scylladb/scylladb#26278

* github.com:scylladb/scylladb:
  test: mv: add a test for tablet merge
  tablet_allocator, tests: remove allow_tablet_merge_with_views injection
  tablet_allocator: allow merges in base tables if rf-rack-valid=true
2025-10-21 00:18:30 +02:00
Piotr Dulikowski
359ed964e3 tablet_allocator, tests: remove allow_tablet_merge_with_views injection
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.

Remove the error injection from the code and adjust the tests not to use
it.
2025-10-16 14:07:37 +02:00
Piotr Dulikowski
189ad96728 tablet_allocator: allow merges in base tables if rf-rack-valid=true
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273
2025-10-16 13:02:05 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Tomasz Grabiec
9ebdeb261f tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
The old logic assumes that replicas are spread across whole DC when
determining how many tablets we need to have at least 10 tablets per
shard. If replicas are actually confined to a subset of racks, that
will come up with a too high count and overshoot actual per-shard
count in this rack.

Similar problem happens for scaling-down of tablet count, when we try
to keep per-shard tablet count below the goal. It should be tracked
per-rack rather than per-DC, since racks can differ in how loaded they
are by RF if it's a rack-list.
2025-10-02 19:45:00 +02:00
Tomasz Grabiec
e5b7452af2 tablet_allocator: Respect binding replicas to racks 2025-10-02 19:42:39 +02:00
Benny Halevy
da6e2fdb1b locator: Pass topology to replication strategy constructor 2025-10-01 16:06:28 +02:00
Benny Halevy
aaddff5211 tablets: tablet_map_to_mutations: accept process_func
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.

This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.

Next patch will split large tablet mutations to prevent stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:38 +03:00
Michał Jadwiszczak
3bbbbf419b test/cluster/test_view_building_coordinator: add reproducer for staging sstables with tablet merge
The test verifies if staging sstables are processed correctly after
tablet merge.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26286
2025-09-30 09:05:31 +02:00
Petr Gusev
8adbb6c4dd tablets: disallow chains of colocated tables 2025-09-26 16:52:43 +02:00
Avi Kivity
2239474a87 Merge 'tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Tomasz Grabiec
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):

Before:  16 min 25 s
After:    0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.

Fixes #26016

Closes scylladb/scylladb#26017

* github.com:scylladb/scylladb:
  test: perf: perf-load-balancing: Add parallel-scaleout scenario
  test: perf: perf-load-balancing: Convert to tool_app_template
  tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
2025-09-23 22:45:35 +03:00
Tomasz Grabiec
981592bca5 tablet: scheduler: Do not emit conflicting migrations in the plan
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.

It may also surprise consumers of the plan, like we saw in #25912.

So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.

Fixes #26038

Closes scylladb/scylladb#26048
2025-09-23 22:40:08 +03:00
Tomasz Grabiec
c9f0a9d0eb tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):

Before: 16 min 25 s
After: 0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.

Fixes #26016
2025-09-23 00:30:37 +02:00
Ferenc Szili
d9f272dbdd load_balancer: fix badness object creation
The load balancer introduced the idea of badness, which is a measure of
how a tablet migration effects table balance on the source and
destination. This is an abbreviated definition of the badness struct:

struct migration_badness {
    double src_shard_badness = 0;
    double src_node_badness = 0;
    double dst_shard_badness = 0;
    double dst_node_badness = 0;

    ...

    double node_badness() const {
        return std::max(src_node_badness, dst_node_badness);
    }

    double shard_badness() const {
        return std::max(src_shard_badness, dst_shard_badness);
    }
};

A negative value for either of these 4 members signifies a good
migration (improves table balance), and a positive signifies a bad
migration.

In two places in the balancer, badness for source and destination is
computed independently in two objects of type migration_badness
(src_badness and dst_badness), and later combined into a single object
similar to this:

return migration_badness{
    src_badness.shard_badness(),
    src_badness.node_badness(),
    dst_badness.shard_badness(),
    dst_badness.node_badness()
};

This is a problem when, for instance, source shard badness is good
(less that 0), shard_badness() will return 0 because of std::max().
This way the actual computed badness is not set in the final object.
This can lead to incorrect decisions made later by the balancer, when it
searches for the best migration among a set of candidates.

Closes scylladb/scylladb#26091
2025-09-21 21:37:23 +02:00
Tomasz Grabiec
ddbcea3e2a tablets: scheduler: Run plan-maker in maintenance scheduling group
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.

Reduce impact by running in the maintenance scheduling group.

Fixes #26037

Closes scylladb/scylladb#26046
2025-09-21 18:44:57 +03:00
Piotr Dulikowski
5f55787e50 Merge 'CDC with tablets' from Michael Litvak
initial implementation to support CDC in tablets-enabled keyspaces.

The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.

In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets

Fixes https://github.com/scylladb/scylladb/issues/22576

backport not needed - new feature

update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136

Closes scylladb/scylladb#23795

* github.com:scylladb/scylladb:
  docs/dev: update CDC dev docs for tablets
  doc: update CDC docs for tablets
  test: cluster_events: enable add_cdc and drop_cdc
  test/cql: enable cql cdc tests to run with tablets
  test: test_cdc_with_alter: adjust for cdc with tablets
  test/cqlpy: adjust cdc tests for tablets
  test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
  cdc: enable cdc with tablets
  topology coordinator: change streams on tablet split/merge
  cdc: virtual tables for cdc with tablets
  cdc: generate_stream_diff helper function
  cdc: choose stream in tablets enabled keyspaces
  cdc: rename get_stream to get_vnode_stream
  cdc: load tablet streams metadata from tables
  cdc: helper functions for reading metadata from tables
  cdc: colocate cdc table with base
  cdc: remove streams when dropping CDC table
  cdc: create streams when allocating tablets
  migration_listener: add on_before_allocate_tablet_map notification
  cdc: notify when creating or dropping cdc table
  cdc: move cdc table creation to pre_create
  cdc: add internal tables for cdc with tablets
  cdc: add cdc_with_tablets feature flag
  cdc: add is_log_schema helper
2025-09-18 13:39:37 +02:00
Wojciech Mitros
f17beba834 load_balancer: include dead nodes when calculating rack load
Load balancer aims to preserve a balance in rack loads when generating
tablet migrations. However, this balance might get broken when dead nodes
are present. Currently, these nodes aren't include in rack load calculations,
even if they own tablet replicas. As a result, load balancer treats racks
with dead nodes as racks with a lower load, so I generates migrations to these
racks.
This is incorrect, because a dead node might come back alive, which would result
in having multiple tablet replicas on the same rack. It's also inefficient
even if we know that the node won't come back - when it's being replaced or removed.
In that case we know we are going to rebuild the lost tablet replicas
so migrating tablets to this rack just doubles the work. Allowing such migrations
to happen would also require adjustments in the materialized view pairing code
because we'd temporarily allow having multiple tablet replicas on the same rack.
So in this patch we include dead nodes when calculating rack loads in the load
balancer. The dead nodes still aren't treated as potential migration sources or
destinations.
We also add a test which verifies that no migrations are performed by doing a node
replace with a mv workload in parallel. Before the patch, we'd get pairing errors
and after the patch, no pairing errors are detected.

Fixes https://github.com/scylladb/scylladb/issues/24485

Closes scylladb/scylladb#26028
2025-09-17 20:49:18 +02:00
Michael Litvak
7f2cd06bdc migration_listener: add on_before_allocate_tablet_map notification
Add a new notification on_before_allocate_tablet_map that is called when
creating a tablet map for a new table and passes the tablet map.

This will be useful next for CDC for example. when creating tablets for
a new table we want to create CDC streams for each tablet in the same
operation, and we need to have the tablet map with the tablet count and
tokens for each tablet, because the CDC streams are based on that.

We need to change slightly the tablet allocation code for this to work
with colocated tables, because previously when we created the tablet map
of a colocated table we didn't have a reference to the base tablet map,
but now we do need it so we can pass it to the notification.
2025-09-17 14:47:11 +02:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Asias He
b226ad2f11 tablet_allocator: Add tablet_force_tablet_count_increase and decrease
It is useful to increase and decrease the tablet count in the test for
tablet split and merge testing.
2025-08-11 10:10:08 +08:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Avi Kivity
0138afa63b service: tablet_allocator: avoid large contiguous vector in make_repair_plan()
make_repair_plan() allocates a temporary vector which can grow larger
than our 128k basic allocation unit. Use a chunked vector to avoid
stalls due to large allocations.

Fixes #24713.

Closes scylladb/scylladb#24801
2025-07-09 12:50:02 +02:00
Michael Litvak
018b61f658 tablets: allocator: create co-located tables in a single operation
Co-located base and child tables may be created together in a single
operation.  The tablet allocator in this case needs to handle them
together and not each table independently, because we need to have the
base schema and tablet map when creating the child tablet map.

We do this by registering the tablet allocator to the migration
notification on_before_create_column_families that announces multiple
new tables, and there we allocate tablets for all the new base tables,
and for the new child tables we create their maps from the base tables,
which are either a new table or an existing one.
2025-07-01 13:20:19 +03:00
Michael Litvak
ddf02c9489 tablets: replace all_tables method
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.

Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.

We replace all_tables with new methods that can be used for each of the
cases.
2025-07-01 13:20:18 +03:00
Michael Litvak
255ca569e3 tablets: split when all co-located tablets are ready
For a group of co-located tablets, they must be split together
atomically, so finalize tablet split only when all tablets in the group
are ready.
2025-07-01 13:20:18 +03:00
Michael Litvak
0dcb9f2ed6 tablets: load balancer: sizing plan for table groups
We update the sizing plan to work with table groups instead of single
tables, using the base table as a representative of a table group.

The resize decision is made based on the combined per-table tablet
hints, and considering the size of all tables in the group. We calculate
the average tablet size of all tablets in the group and compare it with
the target tablet size.

The target tablet size is changed to be some function of the group size,
because we may want to have a lower target tablet size when we have
multiple co-located tablets, in order to reduce the migration size.
2025-07-01 13:20:18 +03:00
Michael Litvak
ac5f4da905 tablets: load balancer: handle co-located tablets
Tablets of co-located tables are always co-located and migrated
together, so they are considered as an atomic unit for the tablets load
balancer.

We change the load balancer to work with table groups as migration
candidates instead of single tables, using the base table of a group as
a representative of the group.

For the purpose of load calculations, a group of co-located tablets is
considered like a single tablet, because their combined target tablet
sizes is the same as a single tablet.
2025-07-01 13:20:18 +03:00
Michael Litvak
3db8f6fd37 tablets: allocate co-located tablets
When allocating tablets for a new table, add the option to create a
co-located tablet map with an existing base table.

The co-located tablet map is created with the base_table value set.
2025-07-01 13:20:18 +03:00
Aleksandra Martyniuk
1f4edd8683 test_tablet_tasks: use injection to revoke resize
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.

Use error injection to trigger resize revoke.

Fixes: #22570.

Closes scylladb/scylladb#23966
2025-04-30 07:04:57 +03:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Tomasz Grabiec
76bc11c78c service: tablets: Keep load_stats inside tablet_allocator
So that virtual tables can pick them up.

It's a better place to keep them than in topology_coordinator.
2025-04-09 20:21:51 +02:00
Aleksandra Martyniuk
acd32b24d3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
4a847df55c locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.
2025-04-08 10:42:02 +02:00
Botond Dénes
1198213000 Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

Closes scylladb/scylladb#23478

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-03 16:32:53 +03:00
Tomasz Grabiec
d6232a4f5f tablets: load_balancer: Fix reporting of total load per node
Load is now utilization, not count, so we should report average
per-shard load, which is equivalent to node's utilization.
2025-03-27 23:28:20 +01:00
Lakshmi Narayanan Sreethar
8cabc66f07 load_balancer: make repair plans only when there is no pending resize finalization
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.

Refs #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Raphael S. Carvalho
e9944f0b7c service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>

Closes scylladb/scylladb#23247
2025-03-16 22:45:00 +02:00
Tomasz Grabiec
c4714180cc tablets: Make load balancing capacity-aware
Before this patch the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogenous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assummes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
2025-03-06 13:35:38 +01:00
Tomasz Grabiec
1a7023c85a config, tablets: Allow tablets_initial_scale_factor to be a fraction
We may want fewer than 1 tablets per shard in large clusters.

The per-table option is a fraction, so for consistency, this should be
too.
2025-02-19 16:29:08 +01:00
Tomasz Grabiec
7e4a61953d tablets: load_balancer: Pick initial_scale_factor from config
So that it can be live-updated.
2025-02-19 16:29:08 +01:00
Tomasz Grabiec
41789962ef tablets, load_balancer: Fix and improve logging of resize decisions
Resize is no longer only due to avg tablet size. Log avg tablet size as an
information, not the reason, and log the true reason for target tablet
count.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
d1ccbee7f9 tablets, load_balancer: Log reason for target tablet count
Helps in debugging.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
029505b179 tablets: load_balancer: Move hints processing to tablet scheduler
Hints have common meaning for all strategies, so the logic
belongs more to make_sizing_plan().

As a side effect, we can reuse shard capacity computation across
tables, which reduces computational complexity from O(tables*nodes) to
O(tables * DCs + nodes)
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
f1bda8d4c1 tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.

There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.

If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.

If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.

The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
94b5165ac7 tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.

We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
dd68c1e526 tablets: load_balancer: Determine desired count from size separately from count from options
For debugging purposes. Later we will want to know which rule
determined the count.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
e4c5e2ab55 tablets: load_balancer: Determine resize decision from target tablet count
The flow is simpler this way, since the decision cannot now be
mismatched with target tablet count.
2025-02-19 14:40:07 +01:00