Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.
Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
The main target here is system_keyspace::update_schema_version() which
is now static, but needs to have system_keyspace at "this". Migration
manager is one of the places that calls that method indirectly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.
Catch the exception in CQL DDL statement execution function and retry.
In addition, the description of CQL DDL statements in group 0 history
table was improved.
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.
This is done by including an appropriate range tombstone in the group 0
history table mutation.
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.
This is how it looks now in a fresh cluster with a single statement
performed by the user:
cqlsh> select * from system.group0_history ;
key | state_id | description
---------+--------------------------------------+------------------------------------------------------
history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 | CQL DDL statement
history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 | null
history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f | null
history | 9be784ca-7547-11ec-f297-f40f0073038e | null
history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c | null
history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 | null
history | 9be160c2-7547-11ec-e0ea-29f4272345de | null
history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd | null
history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e | null
history | 9bdb925a-7547-11ec-d754-aa2cc394a22c | null
history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 | null
history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 | Create system_distributed(_everywhere) tables
history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a | Create system_distributed_everywhere keyspace
history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 | Create system_distributed keyspace
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.
To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.
The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.
The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.
We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.
The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.
The guard will be a method of ensuring linearizability for group 0
operations.
This object will be used to "guard" group 0 operations. Obtaining it
will be necessary to perform a group 0 change (such as modifying the
schema), which will be enforced by the type system.
The initial implementation is a stub and only provides a timestamp which
will be used by callers to create mutations for group 0 changes. The
next commit will change all call sites to use the guard as intended.
The final implementation, coming later, will ensure linearizability of
group 0 operations.
1. Generalize the name so it mentions group 0, which schema will be a
strict subset of.
2. Remove the fact that it performs a "read barrier" from the name. The
function will be used in general to ensure linearizability of group0
operations - both reads and writes. "Read barrier" is Raft-specific
terminology, so it can be thought of as an implementation detail.
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
This series greatly reduces gossipers' dependence on `seastar::async` (yet, not completely).
`i_endpoint_state_change_subscriber` callbacks are converted to return futures (again, to get rid of `seastar::async` dependency), all users are adjusted appropriately (e.g. `storage_service`, `cdc::generation_service`, `streaming::stream_manager`, `view_update_backlog_broker` and `migration_manager`).
This includes futurizing and coroutinizing the whole function call chain up to the `i_endpoint_state_change_subscriber` callback functions.
To aid the conversion process, a non-`seastar::async` dependent variant of `utils::atomic_vector::for_each` is introduced (`for_each_futurized`). A different name is used to clearly distinguish converted and non-converted code, so that the last step (remove `seastar::async()` wrappers around callback-calling code in gossiper) is easier. This is left for a follow-up series, though.
Tests: unit(dev)
Closes#9844
* github.com:scylladb/scylla:
service: storage_service: coroutinize `set_gossip_tokens`
service: storage_service: coroutinize `leave_ring`
service: storage_service: coroutinize `handle_state_left`
service: storage_service: coroutinize `handle_state_leaving`
service: storage_service: coroutinize `handle_state_removing`
service: storage_service: coroutinize `do_drain`
service: storage_service: coroutinize `shutdown_protocol_servers`
service: storage_service: coroutinize `excise`
service: storage_service: coroutinize `remove_endpoint`
service: storage_service: coroutinize `handle_state_replacing`
service: storage_service: coroutinize `handle_state_normal`
service: storage_service: coroutinize `update_peer_info`
service: storage_service: coroutinize `do_update_system_peers_table`
service: storage_service: coroutinize `update_table`
service: storage_service: coroutinize `handle_state_bootstrap`
service: storage_service: futurize `notify_*` functions
service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges`
repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair`
locator: reconnectable_snitch_helper: coroutinize `reconnect`
gms: i_endpoint_state_change_subscriber: make callbacks to return futures
utils: atomic_vector: introduce future-returning `for_each` function
utils: atomic_vector: rename `for_each` to `thread_for_each`
gms: gossiper: coroutinize `start_gossiping`
gms: gossiper: coroutinize `force_remove_endpoint`
gms: gossiper: coroutinize `do_status_check`
gms: gossiper: coroutinize `remove_endpoint`
This was needed to fix issue #2129 which was only manifest itself with
auto_bootstrap set to false. The option is ignored now and we always
wait for schema to synch during boot.
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
Currently a keyspace mutation is included into schema mutation list just
before announcement. Move the inclusion to a separate function. It will
be used later when instead of announcing new schema the mutation array
will be returned.
Currently storage service acts as a glue between database schema value
and the migration manager "passive_announce" call. This interposing is
not required, migration manager can do all the management itself, and
the linkage can be done in main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Because today's migration_manager::stop is called drain-time.
Keep the .stop for next patch, but since it's called when the
whole migration_manager stops, guard it against re-entrances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both calls are now private. Also the non-maybe one can become void
and handle pull exceptions by itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to start schema pulls upon on_join, on_alive and on_change ones
in the next patch. Migration manager already has gossiper reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The patch also removes the usage of map_reduce() because it is no longer needed
after 6191fd7701 that drops futures from the view mutation building path.
The patch preserves yielding point that map_reduce() provides though by
calling to coroutine::maybe_yield() explicitly.
Message-Id: <YZoV3GzJsxR9AZfl@scylladb.com>