docs: add strong consistency doc

Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
This commit is contained in:
Wojciech Mitros
2026-02-12 03:52:03 +01:00
parent 97dc88d6b6
commit d1ff8f1db3

View File

@@ -0,0 +1,69 @@
# Introduction
This document describes the implementation details and design choices for the
strongly-consistent tables feature
The feature is heavily based on the existing implementation of Raft in scylla, which
is described in [docs/dev/raft-in-scylla.md](raft-in-scylla.md).
# Raft metadata persistence
## Group0 persistence context
The Raft groups for strongly consistent tables differ from Raft group0 particularly
in the extend of where their Raft group members can be located. For group0, all
group members (Raft servers) are on shard 0. For groups for strongly consistent tablets,
the group members may be located on any shard. In the future, they will even be able
to move alongside their corresponding tablets.
That's why, when adding the Raft metadata persistence layer for strongly consistent tables,
we can't reuse the existing approach for group 0. Group0's persistence stores all Raft state
on shard 0. This approach can't be used for strongly consistent tables, because raft groups
for strongly consistent tables can occupy many different shards and their metadata may be
updated often. Storing all data on a single shard would at the same time make this shard
a bottleneck and it would require performing cross-shard operations for most strongly
consistent writes, which would also diminish their performance on its own.
Instead, we want to store the metadata for a Raft group on the same shard where this group's
server is located, avoiding any cross-shard operations and evenly distributing the work
related to writing metadata to all shards.
## Strongly consistent table persistence
We introduce a separate set of Raft system tables for strongly consistent tablets:
- `system.raft_groups`
- `system.raft_groups_snapshots`
- `system.raft_groups_snapshot_config`
These tables mirror the logical contents of the existing `system.raft`, `system.raft_snapshots`,
`system.raft_snapshot_config` tables, but their partition key is a composite `(shard, group_id)`
rather than just `group_id`.
To make “(shard, group_id) belongs to shard X” true at the storage layer, we use:
- a dedicated partitioner (`service::strong_consistency::raft_groups_partitioner`)
which encodes the shard into the token, and
- a dedicated sharder (`service::strong_consistency::raft_groups_sharder`) which extracts
that shard from the token.
As a result, reads and writes for a given groups persistence are routed to the same shard
where the Raft server instance runs.
## Token encoding
The partitioner encodes the destination shard in the tokens high bits:
- token layout: `[shard: 16 bits][group_id_hash: 48 bits]`
- the shard value is constrained to fit the `smallint` column used in the schema.
it also needs to be non-negative, so it's effectively limited to range `[0, 32767]`
- the lower 48 bits are derived by hashing the `group_id` (timeuuid)
The key property is that shard extraction is a pure bit operation and does not depend on
the clusters shard count.
## No direct migration support
`raft_groups_sharder::shard_for_writes()` returns up to one shard - it does not support
migrations using double writes. Instead, for a given Raft group, when a tablet is migrated,
the Raft metadata needs to be erased from the former location and added in the new location.