Commit Graph

244 Commits

Author SHA1 Message Date
Mikołaj Grzebieluch
b2d22d665e raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.

Send entire cdc_generations_v3 table in the snapshot to resolve this problem.

As a follow-up, a mechanism to remove old CDC generations will be introduced.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
e6b0403326 raft: introduce write_mutations command
This command is used to send mutations over raft.

In later commits if `topology_change` doesn't fit the max command size,
it will be split into smaller mutations and sent over multiple raft
commands.
2023-07-04 16:12:50 +02:00
Petr Gusev
5a3384f495 storage_proxy.cc: add and use global_token_metadata_barrier
fence_old_reads is removed since it's replaced by this barrier.
2023-06-15 15:52:50 +04:00
Petr Gusev
96a1c661bd raft_topology: add cmd_index to raft commands
In this commit we add logic to protect against
raft commands reordering. This way we can be
sure that the topology state
(_topology_state_machine._topology) on all the
nodes processing the command is consistent
with the topology state on the topology change
coordinator. In particular, this allows
us to simply use _topology.version as the current
version in barrier_and_drain instead of passing it
along with the command as a parameter.

Topology coordinator maintains an index of the last
command it has sent to the cluster. This index is
incremented for each command and sent along with it.
The receiving node compares it with the last index
it received in the same term and returns an error
if it's not greater. We are protected
against topology change coordinator migrating
to other node by the already existing
terms check: if the term from the command
doesn't match the current term we return an error.
2023-06-15 15:52:50 +04:00
Petr Gusev
94605e4839 storage_proxy.cc: add fencing to read RPCs
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.

Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
2023-06-15 15:52:50 +04:00
Petr Gusev
46f73fcaa6 storage_proxy: add fencing for mutation
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.

Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
2023-06-15 15:52:49 +04:00
Petr Gusev
7fe707570a storage_servie: fix indentation 2023-06-15 15:48:00 +04:00
Petr Gusev
d34da12240 storage_proxy: add fencing_token and related infrastructure
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.

An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
2023-06-15 15:48:00 +04:00
Petr Gusev
f6b019c229 raft topology: add fence_version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
2023-06-15 15:48:00 +04:00
Petr Gusev
4f99302c2b raft_topology: add barrier_and_drain cmd
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
2023-06-15 15:48:00 +04:00
Petr Gusev
3a88c7769f tracing::trace_info: pass by ref
sizeof(std::optional<tracing::trace_info>) == 64 bytes,
so it should be more efficient.
2023-05-30 14:32:10 +04:00
Petr Gusev
48600049fc storage_proxy: pass inet_address_vector_replica_set by ref
sizeof(inet_address_vector_replica_set) == 96 bytes and
it has complex move constructor.
2023-05-30 14:04:53 +04:00
Petr Gusev
896e3bb425 raft: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
4ff1adaef9 repair: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
282d66d15d forward_request: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
db4030f792 storage_proxy: paxos:: add [[ref]] attribute
read_command, partition_key and paxos::proposal
are marked with [[ref]]. partition_key contains
dynamic allocations and can be big. proposal
contains frozen_mutation, so it's also
contains dynamic allocations.

The call sites are fine, the already passed
by reference.
2023-05-30 13:14:19 +04:00
Petr Gusev
f2cba20945 storage_proxy: read_XXX:: make read_command [[ref]]
We had a redundant copies at the call sites of
these methods. Class read_command does not
contain dynamic allocations, but it's quite
but by itself (368 bytes).
2023-05-30 13:14:19 +04:00
Petr Gusev
ffb4e39e40 storage_proxy: hint_mutation:: make frozen_mutation [[ref]]
We had a redundant copy in hint_mutation::apply_remotely.
This frozen_mutation is dynamically allocated and
can be arbitrary large.
2023-05-30 13:14:19 +04:00
Petr Gusev
5adbb6cde2 storage_proxy: mutation:: make frozen_mutation [[ref]]
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.

Fixes: #12504
2023-05-30 13:14:19 +04:00
Benny Halevy
adfb79ba3e raft, idl: restore internal::tagged_uint64 type
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.

However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.

This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8

Fixes #13752

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 12:38:20 +03:00
Benny Halevy
d3a59fdefd idl: gossip_digest: include required headers
To be self-sufficient, before the next patch
that will affect tagged_integer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 06:51:26 +03:00
Benny Halevy
5dc7b7811c gms: gossip_digest: use generation_type and version_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
4cdad8bc8b gms: heart_beat_state: use generation_type and version_type
Define default constructor as heart_beat_state(gms::generation_type(0))

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
b638571cb0 gms: versioned_value: use version_type
Adjust scylla-gdb.get_gms_version_value
to get the versioned_value version as version_type
(utils::tagged_integer).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
f5f566bdd8 utils: add tagged_integer
A generic template for defining strongly typed
integer types.

Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Benny Halevy
c5d819ce60 gms: versioned_value: make members private
and provide accessor functions to get them.

1. So they can't be modified by mistake, as the versioned value is
   immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Kamil Braun
8afb15700b storage_service: include current CDC generation data in topology snapshots
Note that we don't need to include earlier CDC generations, just the
current (i.e. latest) one.

We might observe a problem when nodes are being bootstrapped in quick
succession - I left a FIXME describing the problem and possible
solutions.
2023-04-20 16:36:41 +02:00
Kefu Chai
023e985a6c build: cmake: add missing source files to idl and service
they were added recently, but cmake failed to sync with configure.py.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-26 14:01:21 +08:00
Gleb Natapov
d69a887366 storage_service: raft topology: introduce snapshot transfer code for the topology table 2023-03-23 16:29:56 +02:00
Gleb Natapov
6a4d773b7e raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
2023-03-23 16:29:56 +02:00
Gleb Natapov
16d61e791f service: raft: introduce topology_change group0 command
Also extend group0_command to be able to send new command type. The
 command consists of a mutation array.
2023-03-21 16:06:43 +02:00
Avi Kivity
6822e3b88a query_id: extract into new header
query_id currently lives query-request.hh, a busy place
with lots of dependencies. In turn it gets pulled by
uuid.idl.hh, which is also very central. This makes
test/raft/randomized_nemesis_test.cc which is nominally
only dependent on Raft rebuild on random header file changes.

Fix by extracting into a new header.

Closes #13042
2023-03-01 10:25:25 +02:00
Kefu Chai
c0824c6c25 build: cmake: put generated sources into ${scylla_gen_build_dir}
to be aligned with the convention of configure.py

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Kefu Chai
7b431748a8 build: cmake: extract idl library out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Botond Dénes
ef50170120 Merge 'build: cmake: sync with configure (2/n)' from Kefu Chai
* build: cmake: extract idl out
* build: cmake: link cql3 against xxHash
* build: cmake: correct the check in Findlibdeflate.cmake
* build: cmake find_package(libdeflate) earlier
* build: cmake: set more properties to alternator library
* build: cmake: include generate_cql_grammar
* build: cmake: find xxHash package
* build: cmake: add build mode support

Closes #12866

* github.com:scylladb/scylladb:
  build: cmake: correct generate_cql_grammar
  build: cmake: extract idl out
  build: cmake: link cql3 against xxHash
  build: cmake: correct the check in Findlibdeflate.cmake
  build: cmake: find_package(libdeflate) earlier
  build: cmake: set more properties to alternator library
  build: cmake: include generate_cql_grammar
  build: cmake: find xxHash package
  build: cmake: add build mode support
2023-02-16 07:11:26 +02:00
Kefu Chai
2718963a2a build: cmake: extract idl out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-16 00:07:40 +08:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00
Michał Sala
bbbe12af43 forward_service: fix timeout support in parallel aggregates
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.

To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
    - using steady_clock is just broken, so we aren't taking anything
        from users by breaking it further
    - once all nodes are upgraded, it magically starts to work

Closes #12529
2023-01-16 12:08:13 +02:00
Aleksandra Martyniuk
60e298fda1 repair: change utils::UUID to node_ops_id
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.

Closes #11673
2022-12-20 17:04:47 +02:00
Gleb Natapov
022a825b33 raft: introduce not_a_member error and return it when non member tries to do add/modify_config
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.

Return a new error not_a_member to a caller instead.

Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
2022-12-05 17:11:04 +01:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
0c9cb5c5bf Merge 'raft: wait for the next tick before retrying' from Gusev Petr
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.

The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.

Fixes: #11564

Closes #11769

* github.com:scylladb/scylladb:
  raft: rafactor: remove duplicate code on retries delays
  raft: use wait_for_next_tick in read_barrier
  raft: wait for the next tick before retrying
2022-11-16 18:20:54 +01:00
Petr Gusev
5e15c3c9bd raft: wait for the next tick before retrying
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.

The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.

We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.

Fixes: #11564
2022-11-15 11:49:26 +04:00
Aleksandra Martyniuk
e2c7c1495d repair: change UUID to task_id
Change type of repair id from utils::UUID to task_id to distinguish
them from ids of other entities.
2022-10-31 10:07:08 +01:00
Konstantin Osipov
3e46c32d7b raft: (discovery) do not use raft::server_address to carry IP data
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.

Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.

So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.

Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
2022-10-10 16:24:33 +03:00
Benny Halevy
480b4759a9 idl: streaming: include stream_fwd.hh
To keep the idl definition of plan_id from
getting out of sync with the one in stream_fwd.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11720
2022-10-06 13:49:26 +02:00
Mikołaj Grzebieluch
db88525774 raft: broadcast_tables: add execution of intermediate language
Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`.
Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft
commands without `group0_guard`.
Queries on group0_kv_store are executed in `group_0_state_machine::apply`,
but for now don't return results. They don't use previous state id, so they will
block concurrent schema changes, but these changes won't block queries.

In this version snapshots are ignored.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
c541d5c363 raft: broadcast_tables: add definition of intermediate language
In broadcast tables, raft command contains a whole program to be executed.
Sending and parsing on each node entire CQL statement is inefficient,
thus we decided to compile it to an intermediate language which can be
easily serializable.

This patch adds a definition of such a language. For now, only the following
types of statements can be compiled:
* select value where key = CONST from system.broadcast_kv_store;
* update system.broadcast_kv_store set value = CONST where key = CONST;
* update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST;
where CONST is string literal.
2022-09-08 14:03:51 +02:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00