scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Files

Tomasz Grabiec c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil

We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
service: migration_manager: `announce`: take a description parameter
service: raft: check and update state IDs during group 0 operations
service: raft: group0_state_machine: introduce `group0_command`
service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
service: migration_manager: convert migration request handler to coroutine
db: system_keyspace: introduce `system.group0_history` table
treewide: require `group0_guard` when performing schema changes
service: migration_manager: introduce `group0_guard`
service: raft: pass `storage_proxy&` to `group0_state_machine`
service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
service: migration_manager: `announce`: split raft and non-raft paths to separate functions
treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
service: migration_manager: put notifier call inside `async`
service: migration_manager: remove some unused and disabled code
db: system_distributed_keyspace: use current time when creating mutations in `start()`
redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only

2022-01-25 09:52:30 +02:00

auth.cc

Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity

2022-01-19 11:34:25 +02:00

auth.hh

treewide: use Software Package Data Exchange (SPDX) license identifiers

2022-01-18 12:15:18 +01:00

conditions.cc

treewide: use Software Package Data Exchange (SPDX) license identifiers

2022-01-18 12:15:18 +01:00

conditions.hh

treewide: use Software Package Data Exchange (SPDX) license identifiers