mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-23 16:22:15 +00:00
Small adjustments and improvements to the documentation in the raft section. Fixing Markdown lint warnings: - MD004/ul-style: Unordered list style [Expected: dash; Actual: asterisk] - MD007/ul-indent: Unordered list indentation [Expected: 0; Actual: 2] - MD032/blanks-around-lists: Lists should be surrounded by blank lines - MD036/no-emphasis-as-heading: Emphasis used instead of a heading - MD046/code-block-style: Code block style [Expected: fenced; Actual: indented] Closes scylladb/scylladb#21780
286 lines
14 KiB
Markdown
286 lines
14 KiB
Markdown
Date: 2022-05-06
|
|
|
|
# Intro
|
|
|
|
When we began implementing Raft we wanted to create a reusable and
|
|
well tested component which we could utilize for data, schema and topology
|
|
operations. This is why the Raft library in raft/ has the only dependency -
|
|
on seastar. It provides its own design documents and a README.
|
|
|
|
# Raft application in Scylla
|
|
|
|
In order to use the library, the client (Scylla server) needs to
|
|
provide implementations for three key interfaces:
|
|
|
|
- persistence - to persist Raft state
|
|
- rpc - to exchange messages with instances of the library
|
|
on other machines
|
|
- the client state machine - to execute commands once
|
|
they were replicated and committed on a majority of nodes.
|
|
|
|
Depending on the application (data, topology, or schema) Scylla can use
|
|
separate instantiations of the library with different parameters.
|
|
The term we use commonly for these instantiations is Raft groups.
|
|
|
|
Some applications may require multiple instantiations (groups) which share
|
|
the implementations of persistence, rpc, and client state machine. An
|
|
example is data replication where we partition the entire key range,
|
|
and use a separate replication group for one or several partitions.
|
|
|
|
Each group must be identified with a UUID based group id, which
|
|
works as a key both internally, when persisting the group state
|
|
or loading it at boot time, and externally, when communicating
|
|
with Raft peers of the group on other nodes.
|
|
|
|
For example, to persist the changes to schema and topology, Scylla
|
|
can (and does) use node-local system tables.
|
|
There are three such tables:
|
|
|
|
- `raft`, the main table which stores the Raft log for each
|
|
group. The table partition key is group id, so each log
|
|
forms its own partition. Since the table is local, this
|
|
works fine with many groups.
|
|
- `raft_snapshots`, a supporting table storing the so-called
|
|
snapshot descriptors,
|
|
- `raft_snapshot_config`, a normalized part of raft
|
|
`raft_snapshots`, storing the cluster configuration
|
|
at the time of taking the snapshot. May be out of date with
|
|
the real cluster configuration, e.g. when configuration
|
|
change happens after a snapshot and is only present in
|
|
`raft` log table.
|
|
|
|
The system tables are capable of storing the state of an arbitrary
|
|
number of groups, provided the group is happy with the relatively
|
|
low performance the system-table based persistence can provide.
|
|
|
|
Raft RPC implementation is the same for all groups. It runs on top
|
|
of the standard Scylla messaging service and uses group id as key
|
|
to route messages to the right group on a given host.
|
|
Raft persists peer addresses in its configuration, right next to
|
|
peer identifiers, and the format of the address is not restricted.
|
|
Currently IP addresses are used, just like in the rest of Scylla.
|
|
|
|
An own client state machine is expected to exist for each group. Right
|
|
now Scylla implements only one client state machine - the so-called
|
|
"schema state machine", created for the only group that contains
|
|
all cluster members - the group 0.
|
|
|
|
A helper service which can be used to quickly get from a group id
|
|
to its Raft instance is the group registry. It's
|
|
a sharded service which starts early at Scylla boot
|
|
and runs on all shards. The purpose of this component is to
|
|
isolate Raft service consumers within Scylla (e.g. the migration
|
|
manager for schema changes) from the logic of establishing and
|
|
maintaining the group at node start or during topology changes.
|
|
|
|
# The group 0
|
|
|
|
The main and the only group running in Scylla now is Group 0. It
|
|
maintains the schema state machine - as stored in the system
|
|
keyspace and the schema cache. Each cluster node is a member of
|
|
this group. The group's Raft server runs on shard 0 on each node.
|
|
Joining and leaving the group is integrated into topology
|
|
operations, and the group is started whenever a node starts.
|
|
|
|
## Establishing group 0 in a fresh cluster
|
|
|
|
When a Scylla node starts for the first time, it must join
|
|
group 0. If there is an existing cluster, joining can be done by
|
|
sending a request to add the starting node to the group
|
|
configuration.
|
|
|
|
But what should the node do if there is no running cluster yet?
|
|
|
|
A special distributed algorithm which persists its state locally
|
|
on each node is responsible for establishing an initial Raft
|
|
configuration in a fresh cluster. We often refer to it as
|
|
"discovery" algorithm.
|
|
|
|
Raft group 0 has an id (UUID) just like any other group. After a
|
|
node boots, this id is persisted in `scylla_local` system table.
|
|
If this id is present, the node can start a Raft instance for
|
|
the group using the last saved state in `raft`, `raft_snapshots`
|
|
and `raft_snapshot_config` system tables, which are all retrievable
|
|
by group id.
|
|
If a persisted id is missing, it means the node is bootstrapping
|
|
and haven't joined Raft yet.
|
|
|
|
To find out an existing cluster, or possibly create a new one,
|
|
an instance of 'discovery' state machine is created. This state
|
|
machine stores its intermediate state in `discovery` system table.
|
|
|
|
The machine is initialized with initial contact points: the seeds
|
|
parameter of `scylla.yaml` file. Then the machine runs a discovery
|
|
loop, during which it contacts all initial seeds, and, by
|
|
induction, all nodes known to the contacted nodes. The nodes
|
|
exchange the peer information until a transitive closure of all
|
|
peers is built - i.e. there are no new nodes discovered in the
|
|
last loop iteration and all previously discovered nodes have been
|
|
contacted.
|
|
|
|
In the process, the discovery state machine may find a node
|
|
with an existing Raft group, in which case it sends a configuration
|
|
change request to join. Otherwise the node with the lowest Raft id
|
|
establishes a new group.
|
|
|
|
As you can see from this description, Raft in Scylla supports
|
|
speedy, concurrent bootstraps of multiple nodes. As soon as
|
|
streaming is ready to support it, all Scylla topology changes
|
|
can be switched to a concurrent algorithm.
|
|
|
|
# Using Group 0 to perform schema changes
|
|
|
|
The main goal of adding Raft to Scylla is providing linearizability
|
|
of schema changes. The group 0 Raft log is used to share all schema
|
|
changes information across all cluster nodes, and perform schema changes
|
|
in the same order on every node. However, while Raft guarantees
|
|
that if an operation succeeds, it's stored in the logs on the
|
|
majority of nodes, it doesn't provide an API for strictly ordered
|
|
schema changes out of the box, for two reasons:
|
|
|
|
- Raft only stores a log of the operations. The operations
|
|
themselves - the schema changes - are applied to the client
|
|
state machine once they are committed to the log. In order
|
|
to construct a new operation it's necessary to read the
|
|
latest state of the client state machine.
|
|
E.g. if a client wants to create a table, it's necessary to
|
|
learn first that the table hasn't been created already. We
|
|
could store entire CQL statements in the log and read
|
|
the state of the schema when the command is already committed
|
|
to the Raft log and is being applied: but that would make
|
|
reporting errors back to the client difficult. Besides, that
|
|
would require that each participant of the cluster performs
|
|
the same reads and checks, that would be an unwanted
|
|
overhead.
|
|
Instead, we read the schema state when constructing the raft
|
|
command, on the node which received the DDL statement from
|
|
the client. With this approach, two nodes could read their
|
|
local state machines, decide that the table does not exist
|
|
yet, and append identical commands for creating the table to
|
|
the Raft log. The second command will fail to apply to the
|
|
client state machine.
|
|
- if a leader or network fails when committing an operation to
|
|
Raft log, the client has no way of knowing its status. E.g. a
|
|
network can time out, and establishing whether or not the
|
|
majority of Raft nodes store the command in its log or have applied it
|
|
may be difficult.
|
|
|
|
For the two reasons above, Scylla uses an additional
|
|
algorithm for all schema changes which it propagates through Raft
|
|
which provides an ordered, at-most-once semantics of command
|
|
application.
|
|
|
|
This algorithm is using an own system table - `group0_history` - to
|
|
store its state.
|
|
|
|
The algorithm is adding a unique, monotonic, persistent state id
|
|
to each command that is committed to Raft log of the group 0 state
|
|
machine. This id is used
|
|
to ensure that no two concurrent commands based on the same (and
|
|
outdated) state of the client state machine are applied, and to
|
|
verify that a command is committed to Raft log in a situation of
|
|
uncertainty and retry the command if it failed.
|
|
|
|
Here's how it works.
|
|
|
|
Each change to the schema state machine is "signed", or augmented, with a
|
|
state id - a unique monotonic identifier of the change. The state
|
|
id is based on TIMEUUID. When the command is applied, the state id is
|
|
persisted in `group0_history` table. In addition to the new state id,
|
|
the previous state id is also saved in the command to check that
|
|
the command is applied against the correct state of the client
|
|
state machine.
|
|
|
|
In order to perform a schema change, the client, which can be any Scylla
|
|
node, performs the following steps:
|
|
|
|
1. Takes a lock around its local client state machine, to protect against
|
|
concurrent operations of the state machine by different threads of
|
|
this node.
|
|
1. Issues a raft read barrier. This ensures this local client
|
|
state machine has all the commands added to Group 0 on other
|
|
nodes, up to the time of the barrier.
|
|
1. Reads `group0_history` table to find out the latest existing state id.
|
|
1. Creates a new state id, which would identify the new command. It
|
|
is strictly greater than the previous state id.
|
|
1. Reads the local state machine, validates and constructs a new command.
|
|
E.g. at this point, if we're creating a new keyspace, we may find
|
|
that the keyspace already exists and return an error to the user.
|
|
Otherwise we'll create a new mutation which adds the keyspace
|
|
to the system table.
|
|
1. Creates a command to store in Raft log, which bundles together
|
|
the mutation of the client state machine, the previous and the new
|
|
state ids.
|
|
1. When Raft commits the command to its log, each node will apply it
|
|
to its local schema state machine. At this point, verify that
|
|
the previous state id of the command matches the latest state
|
|
id of the local state machine, i.e. there was no race with another
|
|
Raft command from a different client, and if there is a match,
|
|
apply the command and append the new state id to `group0_history`,
|
|
otherwise turn the command into a no-op.
|
|
1. The client which generated the command waits for the
|
|
command to apply locally. If there was a timeout or error,
|
|
it can retry the command with the same state id until it gets
|
|
committed. If the state id is out of order, e.g. because there
|
|
were another state id committed since, or this state id is
|
|
itself already committed, applying the command turns into a
|
|
no-op. The state ids make the commands idempotent.
|
|
Once the command applies successfully, we can read the local
|
|
state machine and find out what happened with it using
|
|
`group0_history`. If the state id is recorded in the history,
|
|
the command is really executed, otherwise it turned into a no-op,
|
|
so the whole procedure needs to be restarted.
|
|
|
|
# Group 0 schema versioning
|
|
|
|
Historically Scylla was using hashes (called "digests") of schema mutations
|
|
calculated on each node to ensure that schema is in sync between nodes.
|
|
There was a global schema digest calculated from schema mutations of all
|
|
non-system tables, gossiped by each node as an application state
|
|
(`gms::application_state::SCHEMA`). Furthermore, each distributed table had its
|
|
own table schema version, calculated from schema mutations for this table
|
|
(`schema::version()`).
|
|
|
|
When a node noticed that another node is gossiping a different global digest
|
|
than its own, it would pull all schema mutations from that other node.
|
|
|
|
Whenever a replica receives a write or read to a table, the operation comes
|
|
with a version attached for this table by the operation's coordinator. If the
|
|
version is different or unknown, the replica would ensure that its schema
|
|
is at least as up-to-date as the coordinator's schema for this table before
|
|
handling the operation, pulling mutations for this table if necessary
|
|
(`get_schema_for_write`).
|
|
|
|
This hash-based versioning had its place in an eventually consistent world of
|
|
schema changes where different nodes might apply schema changes out of order.
|
|
But it has downsides, some of them described in issues scylladb/scylladb#7620,
|
|
scylladb/scylladb#13957. In particular, the more schema changes we performed,
|
|
the longer it would take to calculate the hash of entire schema (due to
|
|
tombstones), and at some point schema changes would slow down significantly.
|
|
|
|
With schema changes on group 0, we decided to replace these hashes with values
|
|
that are calculated in a single place -- by the sender of the schema change.
|
|
Other nodes persist the obtained global schema version and table versions when
|
|
applying the resulting group 0 command.
|
|
|
|
For the global schema version, each schema change command -- which is a vector
|
|
of mutations -- is extended with a mutation for the `system.scylla_local`
|
|
table, which is a string-string key-value store. This mutation writes the
|
|
version under `group0_schema_version` key. Whenever a node updates its
|
|
in-memory schema version (`update_schema_version_and_announce`), in particular
|
|
after each schema change, it uses the version obtained from this table if it's
|
|
present, instead of calculating a hash.
|
|
|
|
For each table creation or alteration schema change command, the vector of
|
|
mutations contains a mutation for the `system_schema.scylla_tables` table
|
|
that adds/modifies a row corresponding to the created/altered table. This row
|
|
contains a `version` cell that contains the version and a boolean
|
|
`committed_by_group0` cell that is set to `true`. Whenever a node updates its
|
|
in-memory version for this table, in particular after each schema change
|
|
touching this table, if it sees that `committed_by_group0 == true`, it will use
|
|
the provided version instead of calculating a hash.
|
|
|
|
When performing schema changes in Raft Recovery mode we're writing a tombstone
|
|
for the `system.scylla_local` entry and we write `committed_by_group0 == false`
|
|
for the `system_schema.scylla_tables` entries, forcing the old behavior.
|