diff --git a/docs/dev/strong_consistency.md b/docs/dev/strong_consistency.md new file mode 100644 index 0000000000..83445df992 --- /dev/null +++ b/docs/dev/strong_consistency.md @@ -0,0 +1,69 @@ +# Introduction + +This document describes the implementation details and design choices for the +strongly-consistent tables feature + +The feature is heavily based on the existing implementation of Raft in scylla, which +is described in [docs/dev/raft-in-scylla.md](raft-in-scylla.md). + +# Raft metadata persistence + +## Group0 persistence context + +The Raft groups for strongly consistent tables differ from Raft group0 particularly +in the extend of where their Raft group members can be located. For group0, all +group members (Raft servers) are on shard 0. For groups for strongly consistent tablets, +the group members may be located on any shard. In the future, they will even be able +to move alongside their corresponding tablets. + +That's why, when adding the Raft metadata persistence layer for strongly consistent tables, +we can't reuse the existing approach for group 0. Group0's persistence stores all Raft state +on shard 0. This approach can't be used for strongly consistent tables, because raft groups +for strongly consistent tables can occupy many different shards and their metadata may be +updated often. Storing all data on a single shard would at the same time make this shard +a bottleneck and it would require performing cross-shard operations for most strongly +consistent writes, which would also diminish their performance on its own. + +Instead, we want to store the metadata for a Raft group on the same shard where this group's +server is located, avoiding any cross-shard operations and evenly distributing the work +related to writing metadata to all shards. + +## Strongly consistent table persistence + +We introduce a separate set of Raft system tables for strongly consistent tablets: + +- `system.raft_groups` +- `system.raft_groups_snapshots` +- `system.raft_groups_snapshot_config` + +These tables mirror the logical contents of the existing `system.raft`, `system.raft_snapshots`, +`system.raft_snapshot_config` tables, but their partition key is a composite `(shard, group_id)` +rather than just `group_id`. + +To make “(shard, group_id) belongs to shard X” true at the storage layer, we use: + +- a dedicated partitioner (`service::strong_consistency::raft_groups_partitioner`) + which encodes the shard into the token, and +- a dedicated sharder (`service::strong_consistency::raft_groups_sharder`) which extracts + that shard from the token. + +As a result, reads and writes for a given group’s persistence are routed to the same shard +where the Raft server instance runs. + +## Token encoding + +The partitioner encodes the destination shard in the token’s high bits: + +- token layout: `[shard: 16 bits][group_id_hash: 48 bits]` +- the shard value is constrained to fit the `smallint` column used in the schema. + it also needs to be non-negative, so it's effectively limited to range `[0, 32767]` +- the lower 48 bits are derived by hashing the `group_id` (timeuuid) + +The key property is that shard extraction is a pure bit operation and does not depend on +the cluster’s shard count. + +## No direct migration support + +`raft_groups_sharder::shard_for_writes()` returns up to one shard - it does not support +migrations using double writes. Instead, for a given Raft group, when a tablet is migrated, +the Raft metadata needs to be erased from the former location and added in the new location.