Files

Tomasz Grabiec 7e2875d648 Merge 'Add tablet merge support' from Raphael Raph Carvalho

The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.

Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.

Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.

Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.

The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.

When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.

Fixes #18181.

system test details:

test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml

instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.

latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode:			 write
  99.9th:	 3.145727ms
  99th:		 1.998847ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 2.031615ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.933311ms
  99.9th:	 3.047423ms
  99th:		 1.933311ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 1.900543ms
  99.9th:	 3.145727ms
  99th:		 1.900543ms
Mode:			 write
  99.9th:	 5.079039ms
  99th:		 3.604479ms
  99.9th:	 35.389439ms
  99th:		 25.624575ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.998847ms
  99.9th:	 3.047423ms
  99th:		 1.998847ms
Mode:			 read
  99.9th:	 3.080191ms
  99th:		 2.031615ms
  99.9th:	 3.112959ms
  99th:		 2.031615ms
```

Closes scylladb/scylladb#20572

* github.com:scylladb/scylladb:
  docs: Document tablet merging
  tests/boost: Add test to verify correctness of balancer decisions during merge
  tests/topology_experimental_raft: Add tablet merge test
  service: Handle exception when retrying split
  service: Co-locate sibling tablets for a table undergoing merge
  gms: Add cluster feature for tablet merge
  service: Make merge of resize plan commutative
  replica: Implement merging of compaction groups on merge completion
  replica: Handle tablet merge completion
  service: Implement tablet map resize for merge
  locator: Introduce merge_tablet_info()
  service: Rename topology::transition_state::tablet_split_finalization
  service: Respect initial_tablet_count if table is in growing mode
  service: Wire migration_tablet_set into the load balancer
  locator: Add tablet_map::sibling_tablets()
  service: Introduce sorted_replicas_for_tablet_load()
  locator/tablets: Extend tablet_replica equality comparator to three-way
  service: Introduce alias to per-table candidate map type
  service: Add replication constraint check variant for migration_tablet_set
  service: Add convergence check variant for migration_tablet_set
  service: Add migration helpers for migration_tablet_set
  service/tablet_allocator: Introduce migration_tablet_set
  service: Introduce migration_plan::add(migrations_vector)
  locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
  locator/tablets: Introduce tablet_map::needs_merge()
  locator/tablets: Introduce resize_decision::initial_decision()
  locator/tablets: Fix return type of three-way comparison operators
  service: Extract update of node load on migrations
  service: Extract converge check for intra-node migration
  service: Extract erase of tablet replicas from candidate list
  scripts/tablet-mon: Allow visualization of tablet id

2024-12-06 18:06:20 +01:00

api_v2.md

docs: dev: correct a typo

2023-03-31 17:19:08 +03:00

backport.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

building.md

Merge 'Fixes for docs/dev/building.md' from Kamil Braun

2023-02-26 19:27:33 +02:00

cdc.md

system_tables: Compute schema version automatically

2024-11-15 19:16:41 +01:00

code-coverage.md

Add code coverage documentation

2024-01-18 11:11:34 +02:00

commitlog-file-format.md

docs: Add entry on commitlog file format v4

2024-09-03 16:38:28 +00:00

compaction_controller.md

docs: dev: write mathematical expressions in LaTeX

2023-03-29 15:07:14 +03:00

compilation-time-analysis.md

doc/dev: add document about analyzing build time

2023-09-01 11:33:36 +03:00

cql3-type-mapping.md

…

cql-extensions-internal.md

…

debugging.md

docs: Extend debugging with info about exploring ELF notes

2024-08-05 09:49:52 +03:00

describe_schema.md

docs/dev: Document semantics of describing CDC tables

2024-10-31 11:25:19 +01:00

docker-hub.md

doc: remove outdated JMX references

2024-10-07 13:55:15 +03:00

hinted_handoff_design.md

docs: Update Hinted Handoff documentation

2024-04-28 01:22:59 +02:00

IDL.md

Typos: more/less then -> more/less than

2024-02-13 17:16:15 +02:00

isolation.md

docs: isolation.md: add section on RPC call isolation

2024-05-21 03:12:22 -04:00

logging.md

…

lua-type-mapping.md

…

maintainer.md

docs/dev/maintainer.md: clarify "Updating submodule references"

2024-09-05 13:57:32 +03:00

metrics.md

…

migrating-from-users-to-roles.md

…

modules.md

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

mvcc.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

object_storage.md

docs: promote object storage configuration to user-facing documentation

2024-10-22 18:26:19 +08:00

paged-queries.md

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

parallel_aggregations.md

…

per-partition-rate-limit.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

protocol-extensions.md

docs: fix misspellings

2024-01-26 13:14:21 +02:00

protocols.md

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

raft-in-scylla.md

raft: clean up the documentation

2024-12-05 13:44:11 +01:00

reader-concurrency-semaphore.md

docs/dev/reader-concurrency-semaphore.md: fix formatting of diagnostics dump

2024-11-27 12:13:16 +03:00

README.md

docs: fix misspellings

2024-01-26 13:14:21 +02:00

redis.md

docs: fix typos in dev documents

2024-05-27 12:28:34 +03:00

repair_based_node_ops.md

docs/dev/repair_based_node_ops: better formatting

2023-05-25 08:31:43 +03:00

reverse-reads.md

reverse-reads.md: Drop legacy reverse format information

2024-08-13 10:07:12 +02:00

review-checklist.md

…

row_cache.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

row_level_repair.md

…

rust.md

rust: use depfile and Cargo.lock to avoid building rust when unnecessary

2023-01-12 14:44:11 +02:00

secondary_index.md

…

service_levels.md

docs/dev/service_levels: replace unspecified workload type with NULL

2024-09-24 11:43:29 +03:00

sstable-scylla-format.md

Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec

2024-10-28 21:13:23 +02:00

sstables-directory-structure.md

…

system_keyspace.md

docs: Update system_keyspace.md for tablet repair related info

2024-11-20 09:42:41 +08:00

system_schema_keyspace.md

…

task_manager.md

docs: update task manager docs

2024-11-26 09:57:41 +01:00

testing.md

test: rename "cql-pytest" to "cqlpy"

2024-11-06 16:48:36 +02:00

timestamp-conflict-resolution.md

docs: use less slangy language

2024-03-13 13:33:37 +02:00

topology-over-raft.md

Merge 'Add tablet merge support' from Raphael Raph Carvalho

2024-12-06 18:06:20 +01:00

tracing.md

…

virtual-tables.md

…

README.md

Scylla developer documentation

This folder contains developer-oriented documentation concerning the ScyllaDB codebase. We also have a wiki, which contains additional developer-oriented documentation. There is currently no clear definition of what goes where, so when looking for something be sure to check both.

Seastar documentation can be found here.

User documentation can be found on docs.scylladb.com

For information on how to build Scylla and how to contribute visit HACKING.md and CONTRIBUTING.md.

Index

Module list and dependencies