docs: add strong consistency doc
Add a new docs/dev document for the strongly consistent tables feature. For now, it only contains information about the Raft metadata persistence, but it should be updated as more of the strong-consistency components are added.
This commit is contained in:
69
docs/dev/strong_consistency.md
Normal file
69
docs/dev/strong_consistency.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Introduction
|
||||
|
||||
This document describes the implementation details and design choices for the
|
||||
strongly-consistent tables feature
|
||||
|
||||
The feature is heavily based on the existing implementation of Raft in scylla, which
|
||||
is described in [docs/dev/raft-in-scylla.md](raft-in-scylla.md).
|
||||
|
||||
# Raft metadata persistence
|
||||
|
||||
## Group0 persistence context
|
||||
|
||||
The Raft groups for strongly consistent tables differ from Raft group0 particularly
|
||||
in the extend of where their Raft group members can be located. For group0, all
|
||||
group members (Raft servers) are on shard 0. For groups for strongly consistent tablets,
|
||||
the group members may be located on any shard. In the future, they will even be able
|
||||
to move alongside their corresponding tablets.
|
||||
|
||||
That's why, when adding the Raft metadata persistence layer for strongly consistent tables,
|
||||
we can't reuse the existing approach for group 0. Group0's persistence stores all Raft state
|
||||
on shard 0. This approach can't be used for strongly consistent tables, because raft groups
|
||||
for strongly consistent tables can occupy many different shards and their metadata may be
|
||||
updated often. Storing all data on a single shard would at the same time make this shard
|
||||
a bottleneck and it would require performing cross-shard operations for most strongly
|
||||
consistent writes, which would also diminish their performance on its own.
|
||||
|
||||
Instead, we want to store the metadata for a Raft group on the same shard where this group's
|
||||
server is located, avoiding any cross-shard operations and evenly distributing the work
|
||||
related to writing metadata to all shards.
|
||||
|
||||
## Strongly consistent table persistence
|
||||
|
||||
We introduce a separate set of Raft system tables for strongly consistent tablets:
|
||||
|
||||
- `system.raft_groups`
|
||||
- `system.raft_groups_snapshots`
|
||||
- `system.raft_groups_snapshot_config`
|
||||
|
||||
These tables mirror the logical contents of the existing `system.raft`, `system.raft_snapshots`,
|
||||
`system.raft_snapshot_config` tables, but their partition key is a composite `(shard, group_id)`
|
||||
rather than just `group_id`.
|
||||
|
||||
To make “(shard, group_id) belongs to shard X” true at the storage layer, we use:
|
||||
|
||||
- a dedicated partitioner (`service::strong_consistency::raft_groups_partitioner`)
|
||||
which encodes the shard into the token, and
|
||||
- a dedicated sharder (`service::strong_consistency::raft_groups_sharder`) which extracts
|
||||
that shard from the token.
|
||||
|
||||
As a result, reads and writes for a given group’s persistence are routed to the same shard
|
||||
where the Raft server instance runs.
|
||||
|
||||
## Token encoding
|
||||
|
||||
The partitioner encodes the destination shard in the token’s high bits:
|
||||
|
||||
- token layout: `[shard: 16 bits][group_id_hash: 48 bits]`
|
||||
- the shard value is constrained to fit the `smallint` column used in the schema.
|
||||
it also needs to be non-negative, so it's effectively limited to range `[0, 32767]`
|
||||
- the lower 48 bits are derived by hashing the `group_id` (timeuuid)
|
||||
|
||||
The key property is that shard extraction is a pure bit operation and does not depend on
|
||||
the cluster’s shard count.
|
||||
|
||||
## No direct migration support
|
||||
|
||||
`raft_groups_sharder::shard_for_writes()` returns up to one shard - it does not support
|
||||
migrations using double writes. Instead, for a given Raft group, when a tablet is migrated,
|
||||
the Raft metadata needs to be erased from the former location and added in the new location.
|
||||
Reference in New Issue
Block a user