Commit Graph

29931 Commits

Author SHA1 Message Date
Piotr Sarna
99c5bec0e2 test: add a context manager for error injection to alternator
With the new context manager it's now easier to request an error
to be injected via REST API. Note that error injection is only
enabled in certain build modes (dev, debug, sanitize)
and the test case will be skipped if it's not possible to use
this mechanism.
2022-01-31 14:21:55 +01:00
Piotr Sarna
d50ed944f2 alternator: add error injection to BatchGetItem
When error injection is enabled at compile time, it's now possible
to inject an error into BatchGetItem in order to produce a partial
read, i.e. when only part of the items were retrieved successfully.
2022-01-31 12:56:00 +01:00
Piotr Sarna
31f4f062a2 alternator: fill UnprocessedKeys for failed batch reads
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.

NOTE: tested manually by compiling Scylla with error injection,
which fails every nth request. It's rather hard to figure out
an automatic test case for this scenario.

Fixes #9984
2022-01-31 12:56:00 +01:00
Nadav Har'El
59fe6a402c test/cql-pytest: use unique keys instead of random keys
Some of the tests in test/cql-pytest share the same table but use
different keys to ensure they don't collide. Before this patch we used a
random key, which was usually fine, but we recently noticed that the
pytest-randomly plugin may cause different tests to run through the *same*
sequence of random numbers and ruin our intent that different tests use
different keys.

So instead of using a *random* key, let's use a *unique* key. We can
achieve this uniqueness trivially - using a counter variable - because
anyway the uniqueness is only needed inside a single temporary table -
which is different in every run.

Another benefit is that it will now be clearer that the tests are
deterministic and not random - the intent of a random_string() key
was never to randomly walk the entire key space (random_string()
anyway had a pretty narrow idea of what a random string looks like) -
it was just to get a unique key.

Refs #9988 (fixes it for cql-pytest, but not for test/alternator)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-31 09:01:23 +02:00
Tomasz Grabiec
b734615f51 util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
2022-01-30 19:57:35 +02:00
Benny Halevy
3cee0f8bd9 shared_token_metadata: mutate_token_metadata: bump cloned copy ring_version
Currently this is done only in
storage_service::get_mutable_token_metadata_ptr
but it needs to be done here as well for code paths
calling mutate_token_metadata directly.

Currently, this it is only called from network_topology_strategy_test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220130152157.2596086-1-bhalevy@scylladb.com>
2022-01-30 18:15:08 +02:00
Piotr Sarna
471205bdcf test/alternator: use a global random generator for all test cases
It was observed (perhaps it depends on the Python implementation)
that an identical seed was used for multiple test cases,
which violated the assumption that generated values are in fact
unique. Using a global generator instead makes sure that it was
only seeded once.

Tests: unit(dev) # alternator tests used to fail for me locally
  before this patch was applied
Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>
2022-01-30 16:40:20 +02:00
Tomasz Grabiec
3e31126bdf Merge "Brush up the initial tokens generation code" from Pavel Emelyanov
On start the storage_service sets up initial tokens. Some dangling
variables, checks and code duplication had accumulated over time.

* xemul/br-storage-service-bootstrap-leftovers:
  dht: Use db::config to generate initial tookens
  database, dht: Move get_initial_tokens()
  storage_service: Factor out random/config tokens generation
  storage_service: No extra get_replace_address checks
  storage_service: Remove write-only local variable
2022-01-28 15:54:45 +01:00
Pavel Emelyanov
89a7c750ea Merge "Deglobalize repair_meta_map" from Benny
This series moves the static thread_local repair_meta_map instances
into the repair_service shards.

Refs #9809

Test: unit(release) (including scylla-gdb)
Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled](release)

* git@github.com:bhalevy/scylla.git deglobalize-repair_meta_map-v1
  repair_service: deglobalize get_next_repair_meta_id
  repair_service: deglobalize repair_meta_map
  repair_service: pass reference to service to row_level_repair_gossip_helper
  repair_meta: define repair_meta_ptr
  repair_meta: move static repair_meta map functions out of line
  repair_meta: make get_set_diff a free function
  repair: repair_meta: no need to keep sharded<netw::messaging_service>
  repair: repair_meta: derive subordinate services from repair_service
  repair: pass repair_service to repair_meta
2022-01-28 14:12:33 +02:00
Avi Kivity
34252eda26 Update seastar submodule
* seastar 5524f229b...0d250d15a (6):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed
Fixes #9982
  > Merge "Enhance io-tester and its rate-limited job" from Pavel E
  > queue: pop: assert that the queue is not empty
  > io_queue: properly declare io_queue_for_tests
  > reactor: Fix off-by-end-of-line misprint in legacy configuration
  > fair_queue: Fix move constructor
2022-01-28 14:12:33 +02:00
Tomasz Grabiec
7ee79fa770 logalloc: Add more logging
Message-Id: <20220127232009.314402-1-tgrabiec@scylladb.com>
2022-01-28 14:12:33 +02:00
Pavel Emelyanov
1525c04db3 dht: Use db::config to generate initial tookens
The replica::database is passed into the helper just to get the
config from. Better to use config directly without messing with
the database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
77532a6a36 database, dht: Move get_initial_tokens()
The helper in question has nothing to do with replica/database and
is only used by dht to convert config option to a set of tokens.
It sounds like the helper deserves living where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
50170366ea storage_service: Factor out random/config tokens generation
There's a place in normal node start that parses the initial_token
option or generates num_tokens random tokens. This code is used almost
unchanged since being ported from its java version. Later there appeared
the dht::get_bootstrap_token() with the same internal logic.

This patch generalizes these two places. Logging messages are unified
too (dtest seem not to check those).

The change improves a corner case. The normal node startup code doesn't
check if the initial_token is empty and num_tokens is 0 generating empty
bootstrap_tokens set. It fails later with an obscure 'remove_endpoint
should be used instead' message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
7b521405e4 storage_service: No extra get_replace_address checks
The get_replace_address() returns optional<inet_address>, but
in many cases it's used under if (is_replacing()) branch which,
in turn, returns bool(get_replace_address()) and this is only
executed if the returned optional is engaged.

Extra checks can be removed making the code tiny bit shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
330f2cfcfc storage_service: Remove write-only local variable
The set of tokens used to be use after being filled, but now
it's write-only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:25 +03:00
Benny Halevy
f8db9e1bd8 repair_service: deglobalize get_next_repair_meta_id
Rather than using a static unit32_t next_id,
move the next_id variable into repair_service shard 0
and manage it there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:34:21 +02:00
Benny Halevy
90ba9013be repair_service: deglobalize repair_meta_map
Move the static repair_meta_map into the repair_service
and expose it from there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:01:47 +02:00
Benny Halevy
e6b6fdc9a0 repair_service: pass reference to service to row_level_repair_gossip_helper
Note that we can't pass the repair_service container()
from its ctor since it's not populated until all shards start.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:00:26 +02:00
Benny Halevy
3008ecfd4e repair_meta: define repair_meta_ptr
Keep repair_meta in repair_meta_map as shared_ptr<repair_meta>
rather than lw_shared_ptr<repair_meta> so it can be defined
in the header file and use only forward-declared
class repair_meta.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:18:14 +02:00
Benny Halevy
fdc0a9602c repair_meta: move static repair_meta map functions out of line
Define the static {get,insert,remove}_repair_meta functions out
of the repair_meta class definition, on the way of moving them,
along with the repair_meta_map itself, to repair_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:15:09 +02:00
Benny Halevy
b5427cc6d1 repair_meta: make get_set_diff a free function
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:13:09 +02:00
Benny Halevy
224e7497e0 repair: repair_meta: no need to keep sharded<netw::messaging_service>
All repair_meta needs is the local instance.
Need be, it's a peering service so the container()
can be used if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:13:09 +02:00
Benny Halevy
c4ac92b2b7 repair: repair_meta: derive subordinate services from repair_service
Use repair_service as the authoritative source for
the database, messaging_service, system_distributed_keyspace,
and view_update_generator, similar to repair_info.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:12:53 +02:00
Benny Halevy
a71d6333e4 repair: pass repair_service to repair_meta
Prepare for old the repair_meta_map in repair_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:12:51 +02:00
Tomasz Grabiec
ba6c02b38a Merge "Clear old entries from group 0 history when performing schema changes" from Kamil
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.

* kbr/g0-history-gc-v2:
  idl: group0_state_machine: fix license blurb
  test: unit test for clearing old entries in group0 history
  service: migration_manager: clear old entries from group 0 history when announcing
2022-01-26 16:12:40 +01:00
Avi Kivity
df22396a34 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
2022-01-26 15:34:47 +02:00
Takuya ASADA
32f2eb63ac scylla_raid_setup: use mdmonitor only when RAID level > 0
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540
2022-01-26 22:33:07 +09:00
Takuya ASADA
cd57815fff Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
This reverts commit 0d8f932f0b,
because RHEL developer explains this is not a bug, it's expected behavior.
(mdadm --monitor does not start when RAID level is 0)
see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936

So we should stop downgrade mdadm package and modify our script not to
enable mdmonitor.service on RAID0, use it only for RAID5.
2022-01-26 22:33:06 +09:00
Gleb Natapov
579dcf187a raft: allow an option to persist commit index
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.

For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.

Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
2022-01-26 14:06:39 +01:00
Calle Wilund
43f51e9639 commitlog: Ensure we don't run continuation (task switch) with queues modified
Fixes #9955

In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).

Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.

Closes #9966
2022-01-26 13:51:01 +02:00
Avi Kivity
f5cd6ec419 Update tools/python3 submodule (relicensed to Apache License 2.0)
* tools/python3 8a77e76...f725ec7 (2):
  > Relicense to Apache 2.0
  > treewide: use Software Package Data Exchange (SPDX) license identifiers
2022-01-25 18:50:39 +02:00
Kamil Braun
f3c0c73d36 idl: group0_state_machine: fix license blurb 2022-01-25 17:48:46 +01:00
Kamil Braun
bf91dcd1e3 idl: group0_state_machine: fix license blurb 2022-01-25 13:14:47 +01:00
Kamil Braun
b863a63b08 test: unit test for clearing old entries in group0 history
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
2022-01-25 13:13:35 +01:00
Kamil Braun
e9083433a8 service: migration_manager: clear old entries from group 0 history when announcing
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.
2022-01-25 13:11:14 +01:00
Botond Dénes
eb42213db4 compact_mutation: close active range tombstone on page end
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.

Fixes: #9907

Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
2022-01-25 09:52:30 +02:00
Gleb Natapov
e56e96ac5a raft: do not add new wait entries after abort
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.

Fixes #9688

Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
2022-01-25 09:52:30 +02:00
Tomasz Grabiec
c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil
We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
  service: migration_manager: `announce`: take a description parameter
  service: raft: check and update state IDs during group 0 operations
  service: raft: group0_state_machine: introduce `group0_command`
  service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
  service: migration_manager: convert migration request handler to coroutine
  db: system_keyspace: introduce `system.group0_history` table
  treewide: require `group0_guard` when performing schema changes
  service: migration_manager: introduce `group0_guard`
  service: raft: pass `storage_proxy&` to `group0_state_machine`
  service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
  service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
  service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
  service: migration_manager: `announce`: split raft and non-raft paths to separate functions
  treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
  service: migration_manager: put notifier call inside `async`
  service: migration_manager: remove some unused and disabled code
  db: system_distributed_keyspace: use current time when creating mutations in `start()`
  redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
2022-01-25 09:52:30 +02:00
Avi Kivity
a105b09475 build: prepare for Scylla 5.0
We decided to name the next version Scylla 5.0, in honor of
Raft based schema management.
2022-01-25 09:52:30 +02:00
Avi Kivity
277303a722 build_indexes_virtual_reader: convert to flat_mutation_reader_v2
Since it doesn't handle range tombstones in any way, the conversion
consists of just using the new type names.

Closes #9948
2022-01-25 09:52:30 +02:00
Avi Kivity
007145e033 validation: complete transition to data_dictionary module
The API was converted in 00de5f4876,
but some #includes remain. Remove them.

Closes #9947
2022-01-25 09:52:30 +02:00
Avi Kivity
e74f570eda alternator: streams: fix use-after-free of data_dictionary in describe_stream()
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.

Fix by capturing by value.

Fixes #9952.

Closes #9954
2022-01-25 09:52:30 +02:00
Kamil Braun
044e05b0d9 service: migration_manager: announce: take a description parameter
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.

This is how it looks now in a fresh cluster with a single statement
performed by the user:

cqlsh> select * from system.group0_history ;

 key     | state_id                             | description
---------+--------------------------------------+------------------------------------------------------
 history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 |                                    CQL DDL statement
 history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 |                                                 null
 history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f |                                                 null
 history | 9be784ca-7547-11ec-f297-f40f0073038e |                                                 null
 history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c |                                                 null
 history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 |                                                 null
 history | 9be160c2-7547-11ec-e0ea-29f4272345de |                                                 null
 history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd |                                                 null
 history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e |                                                 null
 history | 9bdb925a-7547-11ec-d754-aa2cc394a22c |                                                 null
 history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 |                                                 null
 history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
 history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 |        Create system_distributed(_everywhere) tables
 history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a |        Create system_distributed_everywhere keyspace
 history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 |                   Create system_distributed keyspace
2022-01-24 15:20:37 +01:00
Kamil Braun
6a00e790c7 service: raft: check and update state IDs during group 0 operations
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
2022-01-24 15:20:37 +01:00
Kamil Braun
509ac2130f service: raft: group0_state_machine: introduce group0_command
Objects of this type will be serialized and sent as commands to the
group 0 state machine. They contain a set of mutations which modify
group 0 tables (at this point: schema tables and group 0 history table),
the 'previous state ID' which is the last state ID present in the
history table when the operation described by this command has started,
and the 'new state ID' which will be appended to the history table if
this change is successful (successful = the previous state ID is still
equal to the last state ID in the history table at the moment of
application). It also contains the address of the node which constructed
this command.

The state ID mechanism will be described in more detail in a later
commit.
2022-01-24 15:20:37 +01:00
Kamil Braun
cc0c54ea15 service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.

We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.

Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
2022-01-24 15:20:37 +01:00
Kamil Braun
a944dd44ee service: migration_manager: convert migration request handler to coroutine 2022-01-24 15:20:37 +01:00
Kamil Braun
fad72daeb4 db: system_keyspace: introduce system.group0_history table
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

We will use these state IDs to check if a given change is still
valid at the moment it is applied (in `group0_state_machine::apply`),
i.e. that there wasn't a concurrent change that happened between
creating this change and applying it (which may invalidate it).
2022-01-24 15:20:37 +01:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00