Add a new docs/dev document for the strongly consistent tables feature. For now, it only contains information about the Raft metadata persistence, but it should be updated as more of the strong-consistency components are added.
70 lines
3.2 KiB
Markdown
70 lines
3.2 KiB
Markdown
# Introduction
|
||
|
||
This document describes the implementation details and design choices for the
|
||
strongly-consistent tables feature
|
||
|
||
The feature is heavily based on the existing implementation of Raft in scylla, which
|
||
is described in [docs/dev/raft-in-scylla.md](raft-in-scylla.md).
|
||
|
||
# Raft metadata persistence
|
||
|
||
## Group0 persistence context
|
||
|
||
The Raft groups for strongly consistent tables differ from Raft group0 particularly
|
||
in the extend of where their Raft group members can be located. For group0, all
|
||
group members (Raft servers) are on shard 0. For groups for strongly consistent tablets,
|
||
the group members may be located on any shard. In the future, they will even be able
|
||
to move alongside their corresponding tablets.
|
||
|
||
That's why, when adding the Raft metadata persistence layer for strongly consistent tables,
|
||
we can't reuse the existing approach for group 0. Group0's persistence stores all Raft state
|
||
on shard 0. This approach can't be used for strongly consistent tables, because raft groups
|
||
for strongly consistent tables can occupy many different shards and their metadata may be
|
||
updated often. Storing all data on a single shard would at the same time make this shard
|
||
a bottleneck and it would require performing cross-shard operations for most strongly
|
||
consistent writes, which would also diminish their performance on its own.
|
||
|
||
Instead, we want to store the metadata for a Raft group on the same shard where this group's
|
||
server is located, avoiding any cross-shard operations and evenly distributing the work
|
||
related to writing metadata to all shards.
|
||
|
||
## Strongly consistent table persistence
|
||
|
||
We introduce a separate set of Raft system tables for strongly consistent tablets:
|
||
|
||
- `system.raft_groups`
|
||
- `system.raft_groups_snapshots`
|
||
- `system.raft_groups_snapshot_config`
|
||
|
||
These tables mirror the logical contents of the existing `system.raft`, `system.raft_snapshots`,
|
||
`system.raft_snapshot_config` tables, but their partition key is a composite `(shard, group_id)`
|
||
rather than just `group_id`.
|
||
|
||
To make “(shard, group_id) belongs to shard X” true at the storage layer, we use:
|
||
|
||
- a dedicated partitioner (`service::strong_consistency::raft_groups_partitioner`)
|
||
which encodes the shard into the token, and
|
||
- a dedicated sharder (`service::strong_consistency::raft_groups_sharder`) which extracts
|
||
that shard from the token.
|
||
|
||
As a result, reads and writes for a given group’s persistence are routed to the same shard
|
||
where the Raft server instance runs.
|
||
|
||
## Token encoding
|
||
|
||
The partitioner encodes the destination shard in the token’s high bits:
|
||
|
||
- token layout: `[shard: 16 bits][group_id_hash: 48 bits]`
|
||
- the shard value is constrained to fit the `smallint` column used in the schema.
|
||
it also needs to be non-negative, so it's effectively limited to range `[0, 32767]`
|
||
- the lower 48 bits are derived by hashing the `group_id` (timeuuid)
|
||
|
||
The key property is that shard extraction is a pure bit operation and does not depend on
|
||
the cluster’s shard count.
|
||
|
||
## No direct migration support
|
||
|
||
`raft_groups_sharder::shard_for_writes()` returns up to one shard - it does not support
|
||
migrations using double writes. Instead, for a given Raft group, when a tablet is migrated,
|
||
the Raft metadata needs to be erased from the former location and added in the new location.
|