Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.
The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.
The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.
The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
perf-simple-query test results show no signs of regression:
Command: perf-simple-query -c1 -m1G --write --tablets --duration=10
Before:
> 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors)
> 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors)
> 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors)
> 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors)
> 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors)
> 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors)
> 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors)
> 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors)
> 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors)
> 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>
After:
> 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors)
> 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors)
> 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors)
> 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors)
> 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors)
> 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors)
> 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors)
> 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors)
> 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors)
> 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06
Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10
Before:
> 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors)
> 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors)
> 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors)
> 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors)
> 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors)
> 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors)
> 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors)
> 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors)
> 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors)
> 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>
After:
> 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors)
> 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors)
> 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors)
> 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors)
> 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors)
> 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors)
> 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors)
> 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors)
> 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors)
> 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>
Fixes #16594
Closes scylladb/scylladb#18026
* github.com:scylladb/scylladb:
Implement fast streaming for intra-node migration
test: tablets_test: Test sharding during intra-node migration
test: tablets_test: Check sharding also on the pending host
test: py: tablets: Test writes concurrent with migration
test: py: tablets: Test crash during intra-node migration
api, storage_service: Introduce API to wait for topology to quiesce
dht, replica: Remove deprecated sharder APIs
test: Avoid using deprecated sharded API
db: do_apply_many() avoid deprecated sharded API
replica: mutation_dump: Avoid deprecated sharder API
repair: Avoid deprecated sharder API
table: Remove optimization which returns empty reader when key is not owned by the shard
dht: is_single_shard: Avoid deprecated sharder API
dht: split_range_to_single_shard: Work with static_sharder only
dht: ring_position_range_sharder: Avoid deprecated sharder APIs
dht: token: Avoid use of deprecated sharder API by switching to static_sharder
selective_token_sharder: Avoid use of deprecated sharder API
docs: Document tablet sharding vs tablet replica placement
readers/multishard.cc: use shard_for_reads() instead of shard_of()
multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
storage_proxy: Extract common code to apply mutations on many shards according to sharder
storage_proxy: Prepare per-partition rate-limiting for intra-node migration
storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
storage_proxy: Prepare mutate_hint() for intra-node tablet migration
commitlog_replayer: Avoid deprecated sharder::shard_of()
lwt: Avoid deprecated sharder::shard_of()
compaction: Avoid deprecated sharder::shard_of()
dht: Extract dht::static_sharder
replica: Deprecate table::shard_of()
locator: Deprecate effective_replication_map::shard_of()
dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
tests: tablets: py: Add intra-node migration test
tests: tablets: Test that drained nodes are not balanced internally
tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
tests: tablets: Verify that disabling balancing results in no intra-node migrations
tests: tablets: Check that nodes are internally balanced
tests: tablets: Improve debuggability by showing which rows are missing
tablets, storage_service: Support intra-node migration in move_tablet() API
tablet_allocator: Generate intra-node migration plan
tablet_allocator: Extract make_internode_plan()
tablet_allocator: Maintain candidate list and shard tablet count for target nodes
tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
tablets, streaming: Implement tablet streaming for intra-node migration
dht, auto_refreshing_sharder: Allow overriding write selector
multishard_writer: Handle intra-node migration
storage_proxy: Handle intra-node tablet migration for writes
tablets: Get rid of tablet_map::get_shard()
tablets: Avoid tablet_map::get_shard in cleanup
tablets: test: Use sharder instead of tablet_map::get_shard()
tablets: tablet_sharder: Allow working with non-local host
sharding: Prepare for intra-node-migration
docs: Document sharder use for tablets
tablets: Introduce tablet transition kind for intra-node migration
tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
sstables, gdb: Track readers in a linked list
raft topology: Fix global token metadata barrier to not fence ahead of what is drained