Files
scylladb/schema
Avi Kivity 85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference
2026-03-04 08:55:43 +02:00
..