Files

Avi Kivity 85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros

In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference

2026-03-04 08:55:43 +02:00

advanced_rpc_compression_negotiation.tla

utils: add advanced_rpc_compressor

2024-12-23 23:37:02 +01:00

advanced_rpc_compression.md

utils: correct misspellings

2025-01-02 16:47:57 +02:00

api_v2.md

docs: dev: correct a typo

2023-03-31 17:19:08 +03:00

audit.md

auth: use unified cache for permissions

2026-02-17 17:56:27 +01:00

backport.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

building.md

build: replace tools/java submodule with packaged cassandra-stress

2025-04-15 10:11:28 +03:00

cdc.md

docs/dev: update CDC dev docs for tablets

2025-09-17 14:47:13 +02:00

code-coverage.md

Fix typos

2025-02-11 00:17:43 +02:00

commitlog-file-format.md

docs: Add entry on commitlog file format v4

2024-09-03 16:38:28 +00:00

compaction_controller.md

docs: dev: write mathematical expressions in LaTeX

2023-03-29 15:07:14 +03:00

compilation-time-analysis.md

Fix typos

2025-02-11 00:17:43 +02:00

cql3-type-mapping.md

docs: fix minor typo in docs/dev/cql3-type-mapping.md

2025-07-01 22:19:07 +05:30

cql-extensions-internal.md

…

debugging.md

Fix typos

2025-02-11 00:17:43 +02:00

describe_schema.md

tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds

2026-03-03 11:19:24 +03:00

docker-hub.md

doc: fix the license information on DockerHub

2025-12-23 15:53:06 +02:00

file_encryption.md

docs: Add EAR docs

2025-01-09 10:40:47 +00:00

hinted_handoff_design.md

docs: Update Hinted Handoff documentation

2024-04-28 01:22:59 +02:00

IDL.md

Typos: more/less then -> more/less than

2024-02-13 17:16:15 +02:00

isolation.md

Fix typos

2025-02-13 01:54:08 +02:00

logging.md

…

lua-type-mapping.md

docs: add vector type documentation

2025-01-28 21:14:49 +01:00

maintainer.md

docs/dev/maintainer.md: clarify "Updating submodule references"

2024-09-05 13:57:32 +03:00

metrics.md

Fix typos

2025-02-11 00:17:43 +02:00

migrating-from-users-to-roles.md

…

modules.md

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

mvcc.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

object_storage.md

snapshot: keep per-sstable metadata in manifest.json

2026-01-22 09:42:52 +02:00

paged-queries.md

docs/paged-queries.md: update references to readers

2025-09-26 11:15:38 +03:00

parallel_aggregations.md

…

per-partition-rate-limit.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

protocol-extensions.md

protocol-extensions.md: Fix client_options docs

2026-01-13 11:49:00 +01:00

protocols.md

vector_store_client: rename embedding into vs_vector

2025-09-18 08:45:46 +03:00

raft-in-scylla.md

raft: clean up the documentation

2024-12-05 13:44:11 +01:00

reader-concurrency-semaphore.md

permit_reader: Add a new state: preemptive_aborted

2026-01-28 14:20:01 +01:00

README.md

Add repository layout dev documentation

2025-06-25 13:58:05 +03:00

repair_based_node_ops.md

docs/dev/repair_based_node_ops: better formatting

2023-05-25 08:31:43 +03:00

repository_layout.md

remove redis documentation

2025-08-20 17:53:23 +03:00

reverse-reads.md

reverse-reads.md: Drop legacy reverse format information

2024-08-13 10:07:12 +02:00

review-checklist.md

docs/dev: review-checklist.md: expand the guide for good commit log

2025-01-08 13:01:35 +02:00

row_cache.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

row_level_repair.md

Fix typos

2025-02-11 00:17:43 +02:00

rust.md

rust: use depfile and Cargo.lock to avoid building rust when unnecessary

2023-01-12 14:44:11 +02:00

secondary_index.md

docs/design-notes/secondary_index: add VALUES to index target list

2022-08-14 10:29:52 +03:00

service_levels.md

docs/dev/service_levels: update docs to service levels on raft

2026-01-13 06:49:18 +02:00

size-based-load-balancing.md

docs: correct spelling errors in size based balancing docs

2026-01-16 17:41:57 +02:00

sstable-compression-dicts.md

docs/dev: add sstable-compression-dicts.md

2025-04-01 00:07:31 +02:00

sstable-scylla-format.md

Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec

2024-10-28 21:13:23 +02:00

sstables-directory-structure.md

docs: document sstables quarantine subdirectory

2025-11-21 10:45:33 +02:00

strong_consistency.md

docs: add strong consistency doc

2026-02-25 12:34:58 +01:00

system_keyspace.md

Docs: describe the system tables

2026-03-04 08:55:43 +02:00

system_schema_keyspace.md

schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF

2025-10-21 09:11:25 +03:00

task_manager.md

api: task_manager: do not unregister tasks on get_status

2025-01-27 11:23:45 +01:00

testing.md

scylla_gdb: use persistent GDB - decrease test execution time

2026-01-29 10:01:39 +02:00

timestamp-conflict-resolution.md

docs: use less slangy language

2024-03-13 13:33:37 +02:00

tombstone.md

treewide: fix spelling errors

2025-09-12 15:58:19 +03:00

topology-over-raft.md

doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged

2026-01-18 15:36:05 +01:00

tracing.md

Add table name to tracing in alternator

2025-11-21 09:33:40 +02:00

view-building-coordinator.md

docs/dev/view-building-coordinator: fix typos

2025-12-04 12:52:42 +01:00

virtual-tables.md

docs: virtual-tables: Fix instructions

2025-04-09 20:21:51 +02:00

README.md

Scylla developer documentation

This folder contains developer-oriented documentation concerning the ScyllaDB codebase. We also have a wiki, which contains additional developer-oriented documentation. There is currently no clear definition of what goes where, so when looking for something be sure to check both.

Seastar documentation can be found here.

User documentation can be found on docs.scylladb.com

For information on how to build Scylla and how to contribute visit HACKING.md and CONTRIBUTING.md.

Index

Module list and dependencies

Repository layout and short summary of components