Files
scylladb/alternator
Tomasz Grabiec c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil
We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
  service: migration_manager: `announce`: take a description parameter
  service: raft: check and update state IDs during group 0 operations
  service: raft: group0_state_machine: introduce `group0_command`
  service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
  service: migration_manager: convert migration request handler to coroutine
  db: system_keyspace: introduce `system.group0_history` table
  treewide: require `group0_guard` when performing schema changes
  service: migration_manager: introduce `group0_guard`
  service: raft: pass `storage_proxy&` to `group0_state_machine`
  service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
  service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
  service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
  service: migration_manager: `announce`: split raft and non-raft paths to separate functions
  treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
  service: migration_manager: put notifier call inside `async`
  service: migration_manager: remove some unused and disabled code
  db: system_distributed_keyspace: use current time when creating mutations in `start()`
  redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
2022-01-25 09:52:30 +02:00
..