Commit Graph

253 Commits

Author SHA1 Message Date
Benny Halevy
357d57c82d raft: group0_state_machine: transfer_snapshot: make abortable
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-03 16:32:08 +03:00
Kamil Braun
b835acf853 Merge 'Cluster features on raft: topology coordinator + check on boot' from Piotr Dulikowski
This PR implements the functionality of the raft-based cluster features
needed to safely manage and enable cluster features, according to the
cluster features on raft design doc.

Enabling features is a two phase process, performed by the topology
coordinator when it notices that there are no topology changes in
progress and there are some not-yet enabled features that are declared
to be supported by all nodes:

1. First, a global barrier is performed to make sure that all nodes saw
   and persisted the same state of the `system.topology` table as the
   coordinator and see the same supported features of all nodes. When
   booting, nodes are now forbidden to revoke support for a feature if all
   nodes declare support for it, a successful barrier this makes sure that
   no node will restart and disable the features.
2. After a successful barrier, the features are marked as enabled in the
   `system.topology` table.

The whole procedure is a group 0 operation and fails if the topology
table is modified in the meantime (e.g. some node changes its supported
features set).

For now, the implementation relies on gossip shadow round check to
protect from nodes without all features joining the cluster. In a
followup, a new joining procedure will be implemented which involves the
topology coordinator and lets it verify joining node's cluster features
before the new node is added to group 0 and to the cluster.

A set of tests for the new implementation is introduced, containing the
same tests as for the non-raft-based cluster feature implementation plus
one additional test, specific to this implementation.

Closes #14722

* github.com:scylladb/scylladb:
  test: topology_experimental_raft: cluster feature tests
  test: topology: fix a skipped test
  storage_service: add injection to prevent enabling features
  storage_service: initialize enabled features from first node
  topology_state_machine: add size(), is_empty()
  group0_state_machine: enable features when applying cmds/snapshots
  persistent_feature_enabler: attach to gossip only if not using raft
  feature_service: enable and check raft cluster features on startup
  storage_service: provide raft_topology_change_enabled flag from outside
  storage_service: enable features in topology coordinator
  storage_service: add barrier_after_feature_update
  topology_coordinator: exec_global_command: make it optional to retake the guard
  topology_state_machine: add calculate_not_yet_enabled_features
2023-08-02 12:32:27 +02:00
Piotr Dulikowski
082af79111 storage_service: add barrier_after_feature_update
Adds a variant of the existing `barrier` topology command which requires
from all participating nodes to confirm that they updated their features
after boot and won't remove any features from it until restart. A
successful global barrier of this type gives the topology coordinator a
guarantee that it can safely enable features that were supported by all
nodes at the moment of the barrier.
2023-08-01 14:33:20 +02:00
Tomasz Grabiec
0239ba4527 Merge 'fencing: handle counter_mutations' from Gusev Petr
In this PR we add proper fencing handling to the `counter_mutation` verb.

As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update.

In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`.

We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests.

Closes #14564

* github.com:scylladb/scylladb:
  counter_mutation: add fencing
  encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
  counter_mutation: add replica::exception_variant to signature
2023-08-01 12:41:22 +02:00
Tomasz Grabiec
6d545b2f9e storage_service: Implement stream_tablet RPC
Performs streaming of data for a single tablet between two tablet
replicas. The node which gets the RPC is the receiving replica.
2023-07-25 21:08:51 +02:00
Petr Gusev
116444a01b counter_mutation: add fencing
As for regular mutations, we do the check
twice in handle_counter_mutation, before
and after applying the mutations. The last
is important in case fence was moved while
we were handling the request - some post-fence
actions might have already happened at this
time, so we can't treat the request as successful.
For example, if topology change coordinator was
switching to write_both_read_new, streaming
might have already started and missed this update.

In mutate_counters we can use a single fencing_token
for all leaders, since all the erms are processed
without yields and should underneath share the
same token_metadata.

We don't pass fencing token for replication explicitly in
replicate_counter_from_leader since
mutate_counter_on_leader_and_replicate doesn't capture erm
and if the drain on the coordinator timed out the erm for
replication might be different and we should use the
corresponding (maybe the new one) topology version for
outgoing write replication requests. This delayed
replication is similar to any other background activity
(e.g. writing hints) - it takes the current erm and
the current token_metadata version for outgoing requests.
2023-07-25 12:10:03 +04:00
Petr Gusev
f2cbdc7f18 counter_mutation: add replica::exception_variant to signature
We are going to add fencing for counter mutations,
this means handle_counter_mutation will sometimes throw
stale_topology_exception. RPC doesn't marshall exceptions
transparently, exceptions thrown by server are delivered
to the client as a general remote_verb_error, which is not
very helpful.

The common practice is to embed exceptions into handler
result type. In this commit we use already existing
exception_variant as an exception container. We mark
exception_variant with [[version]] attribute in the idl
file, this should handle the case when the old replica
(without exception_variant in the signature) is replying
to the new one.
2023-07-25 12:09:19 +04:00
Petr Gusev
5fb8da4181 hints: add fencing
In this commit we just pass a fencing_token
through hint_mutation RPC verb.

The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.

Closes #14580
2023-07-24 18:12:48 +02:00
Patryk Jędrzejczak
a21c4abad7 replica: add abort_requested_exception to exception_variant
If migration_manager::get_schema_for_write is called after
migration_manager::drain, it throws abort_requested_exception.
This exception is not present in replica::exception_variant, which
means that RPC doesn't preserve information about its type. If it is
thrown on the replica side, it is deserialized as std::runtime_error
on the coordinator. Therefore, abstract_read_resolver::error logs
information about this exception, even though we don't want it (aborts
are triggered on shutdown and timeouts).

To solve this issue, we add abort_requested_exception to
replica::exception_variant and, in the next commits, refactor
storage_proxy::handle_read so that abort_requested_exception thrown in
migration_manager::get_schema_for_write is properly serialized. Thanks
to this change, unchanged abstract_read_resolver::error correctly
handles abort_requested_exception thrown on the replica side by not
reporting it.
2023-07-13 16:57:10 +02:00
Mikołaj Grzebieluch
b2d22d665e raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.

Send entire cdc_generations_v3 table in the snapshot to resolve this problem.

As a follow-up, a mechanism to remove old CDC generations will be introduced.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
e6b0403326 raft: introduce write_mutations command
This command is used to send mutations over raft.

In later commits if `topology_change` doesn't fit the max command size,
it will be split into smaller mutations and sent over multiple raft
commands.
2023-07-04 16:12:50 +02:00
Petr Gusev
5a3384f495 storage_proxy.cc: add and use global_token_metadata_barrier
fence_old_reads is removed since it's replaced by this barrier.
2023-06-15 15:52:50 +04:00
Petr Gusev
96a1c661bd raft_topology: add cmd_index to raft commands
In this commit we add logic to protect against
raft commands reordering. This way we can be
sure that the topology state
(_topology_state_machine._topology) on all the
nodes processing the command is consistent
with the topology state on the topology change
coordinator. In particular, this allows
us to simply use _topology.version as the current
version in barrier_and_drain instead of passing it
along with the command as a parameter.

Topology coordinator maintains an index of the last
command it has sent to the cluster. This index is
incremented for each command and sent along with it.
The receiving node compares it with the last index
it received in the same term and returns an error
if it's not greater. We are protected
against topology change coordinator migrating
to other node by the already existing
terms check: if the term from the command
doesn't match the current term we return an error.
2023-06-15 15:52:50 +04:00
Petr Gusev
94605e4839 storage_proxy.cc: add fencing to read RPCs
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.

Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
2023-06-15 15:52:50 +04:00
Petr Gusev
46f73fcaa6 storage_proxy: add fencing for mutation
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.

Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
2023-06-15 15:52:49 +04:00
Petr Gusev
7fe707570a storage_servie: fix indentation 2023-06-15 15:48:00 +04:00
Petr Gusev
d34da12240 storage_proxy: add fencing_token and related infrastructure
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.

An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
2023-06-15 15:48:00 +04:00
Petr Gusev
f6b019c229 raft topology: add fence_version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
2023-06-15 15:48:00 +04:00
Petr Gusev
4f99302c2b raft_topology: add barrier_and_drain cmd
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
2023-06-15 15:48:00 +04:00
Petr Gusev
3a88c7769f tracing::trace_info: pass by ref
sizeof(std::optional<tracing::trace_info>) == 64 bytes,
so it should be more efficient.
2023-05-30 14:32:10 +04:00
Petr Gusev
48600049fc storage_proxy: pass inet_address_vector_replica_set by ref
sizeof(inet_address_vector_replica_set) == 96 bytes and
it has complex move constructor.
2023-05-30 14:04:53 +04:00
Petr Gusev
896e3bb425 raft: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
4ff1adaef9 repair: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
282d66d15d forward_request: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
db4030f792 storage_proxy: paxos:: add [[ref]] attribute
read_command, partition_key and paxos::proposal
are marked with [[ref]]. partition_key contains
dynamic allocations and can be big. proposal
contains frozen_mutation, so it's also
contains dynamic allocations.

The call sites are fine, the already passed
by reference.
2023-05-30 13:14:19 +04:00
Petr Gusev
f2cba20945 storage_proxy: read_XXX:: make read_command [[ref]]
We had a redundant copies at the call sites of
these methods. Class read_command does not
contain dynamic allocations, but it's quite
but by itself (368 bytes).
2023-05-30 13:14:19 +04:00
Petr Gusev
ffb4e39e40 storage_proxy: hint_mutation:: make frozen_mutation [[ref]]
We had a redundant copy in hint_mutation::apply_remotely.
This frozen_mutation is dynamically allocated and
can be arbitrary large.
2023-05-30 13:14:19 +04:00
Petr Gusev
5adbb6cde2 storage_proxy: mutation:: make frozen_mutation [[ref]]
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.

Fixes: #12504
2023-05-30 13:14:19 +04:00
Benny Halevy
adfb79ba3e raft, idl: restore internal::tagged_uint64 type
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.

However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.

This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8

Fixes #13752

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 12:38:20 +03:00
Benny Halevy
d3a59fdefd idl: gossip_digest: include required headers
To be self-sufficient, before the next patch
that will affect tagged_integer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 06:51:26 +03:00
Benny Halevy
5dc7b7811c gms: gossip_digest: use generation_type and version_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
4cdad8bc8b gms: heart_beat_state: use generation_type and version_type
Define default constructor as heart_beat_state(gms::generation_type(0))

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
b638571cb0 gms: versioned_value: use version_type
Adjust scylla-gdb.get_gms_version_value
to get the versioned_value version as version_type
(utils::tagged_integer).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:48:01 +03:00
Benny Halevy
f5f566bdd8 utils: add tagged_integer
A generic template for defining strongly typed
integer types.

Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Benny Halevy
c5d819ce60 gms: versioned_value: make members private
and provide accessor functions to get them.

1. So they can't be modified by mistake, as the versioned value is
   immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Kamil Braun
8afb15700b storage_service: include current CDC generation data in topology snapshots
Note that we don't need to include earlier CDC generations, just the
current (i.e. latest) one.

We might observe a problem when nodes are being bootstrapped in quick
succession - I left a FIXME describing the problem and possible
solutions.
2023-04-20 16:36:41 +02:00
Kefu Chai
023e985a6c build: cmake: add missing source files to idl and service
they were added recently, but cmake failed to sync with configure.py.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-26 14:01:21 +08:00
Gleb Natapov
d69a887366 storage_service: raft topology: introduce snapshot transfer code for the topology table 2023-03-23 16:29:56 +02:00
Gleb Natapov
6a4d773b7e raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
2023-03-23 16:29:56 +02:00
Gleb Natapov
16d61e791f service: raft: introduce topology_change group0 command
Also extend group0_command to be able to send new command type. The
 command consists of a mutation array.
2023-03-21 16:06:43 +02:00
Avi Kivity
6822e3b88a query_id: extract into new header
query_id currently lives query-request.hh, a busy place
with lots of dependencies. In turn it gets pulled by
uuid.idl.hh, which is also very central. This makes
test/raft/randomized_nemesis_test.cc which is nominally
only dependent on Raft rebuild on random header file changes.

Fix by extracting into a new header.

Closes #13042
2023-03-01 10:25:25 +02:00
Kefu Chai
c0824c6c25 build: cmake: put generated sources into ${scylla_gen_build_dir}
to be aligned with the convention of configure.py

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Kefu Chai
7b431748a8 build: cmake: extract idl library out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Botond Dénes
ef50170120 Merge 'build: cmake: sync with configure (2/n)' from Kefu Chai
* build: cmake: extract idl out
* build: cmake: link cql3 against xxHash
* build: cmake: correct the check in Findlibdeflate.cmake
* build: cmake find_package(libdeflate) earlier
* build: cmake: set more properties to alternator library
* build: cmake: include generate_cql_grammar
* build: cmake: find xxHash package
* build: cmake: add build mode support

Closes #12866

* github.com:scylladb/scylladb:
  build: cmake: correct generate_cql_grammar
  build: cmake: extract idl out
  build: cmake: link cql3 against xxHash
  build: cmake: correct the check in Findlibdeflate.cmake
  build: cmake: find_package(libdeflate) earlier
  build: cmake: set more properties to alternator library
  build: cmake: include generate_cql_grammar
  build: cmake: find xxHash package
  build: cmake: add build mode support
2023-02-16 07:11:26 +02:00
Kefu Chai
2718963a2a build: cmake: extract idl out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-16 00:07:40 +08:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00
Michał Sala
bbbe12af43 forward_service: fix timeout support in parallel aggregates
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.

To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
    - using steady_clock is just broken, so we aren't taking anything
        from users by breaking it further
    - once all nodes are upgraded, it magically starts to work

Closes #12529
2023-01-16 12:08:13 +02:00
Aleksandra Martyniuk
60e298fda1 repair: change utils::UUID to node_ops_id
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.

Closes #11673
2022-12-20 17:04:47 +02:00
Gleb Natapov
022a825b33 raft: introduce not_a_member error and return it when non member tries to do add/modify_config
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.

Return a new error not_a_member to a caller instead.

Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
2022-12-05 17:11:04 +01:00