scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 20:05:10 +00:00

Author	SHA1	Message	Date
Benny Halevy	357d57c82d	raft: group0_state_machine: transfer_snapshot: make abortable Use an abort_source in group0_state_machine to abort an ongoing transfer_snapshot operation on group0_state_machine::abort() Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-03 16:32:08 +03:00
Kamil Braun	b835acf853	Merge 'Cluster features on raft: topology coordinator + check on boot' from Piotr Dulikowski This PR implements the functionality of the raft-based cluster features needed to safely manage and enable cluster features, according to the cluster features on raft design doc. Enabling features is a two phase process, performed by the topology coordinator when it notices that there are no topology changes in progress and there are some not-yet enabled features that are declared to be supported by all nodes: 1. First, a global barrier is performed to make sure that all nodes saw and persisted the same state of the `system.topology` table as the coordinator and see the same supported features of all nodes. When booting, nodes are now forbidden to revoke support for a feature if all nodes declare support for it, a successful barrier this makes sure that no node will restart and disable the features. 2. After a successful barrier, the features are marked as enabled in the `system.topology` table. The whole procedure is a group 0 operation and fails if the topology table is modified in the meantime (e.g. some node changes its supported features set). For now, the implementation relies on gossip shadow round check to protect from nodes without all features joining the cluster. In a followup, a new joining procedure will be implemented which involves the topology coordinator and lets it verify joining node's cluster features before the new node is added to group 0 and to the cluster. A set of tests for the new implementation is introduced, containing the same tests as for the non-raft-based cluster feature implementation plus one additional test, specific to this implementation. Closes #14722 * github.com:scylladb/scylladb: test: topology_experimental_raft: cluster feature tests test: topology: fix a skipped test storage_service: add injection to prevent enabling features storage_service: initialize enabled features from first node topology_state_machine: add size(), is_empty() group0_state_machine: enable features when applying cmds/snapshots persistent_feature_enabler: attach to gossip only if not using raft feature_service: enable and check raft cluster features on startup storage_service: provide raft_topology_change_enabled flag from outside storage_service: enable features in topology coordinator storage_service: add barrier_after_feature_update topology_coordinator: exec_global_command: make it optional to retake the guard topology_state_machine: add calculate_not_yet_enabled_features	2023-08-02 12:32:27 +02:00
Piotr Dulikowski	082af79111	storage_service: add barrier_after_feature_update Adds a variant of the existing `barrier` topology command which requires from all participating nodes to confirm that they updated their features after boot and won't remove any features from it until restart. A successful global barrier of this type gives the topology coordinator a guarantee that it can safely enable features that were supported by all nodes at the moment of the barrier.	2023-08-01 14:33:20 +02:00
Tomasz Grabiec	0239ba4527	Merge 'fencing: handle counter_mutations' from Gusev Petr In this PR we add proper fencing handling to the `counter_mutation` verb. As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update. In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`. We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests. Closes #14564 * github.com:scylladb/scylladb: counter_mutation: add fencing encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant counter_mutation: add replica::exception_variant to signature	2023-08-01 12:41:22 +02:00
Tomasz Grabiec	6d545b2f9e	storage_service: Implement stream_tablet RPC Performs streaming of data for a single tablet between two tablet replicas. The node which gets the RPC is the receiving replica.	2023-07-25 21:08:51 +02:00
Petr Gusev	116444a01b	counter_mutation: add fencing As for regular mutations, we do the check twice in handle_counter_mutation, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to write_both_read_new, streaming might have already started and missed this update. In mutate_counters we can use a single fencing_token for all leaders, since all the erms are processed without yields and should underneath share the same token_metadata. We don't pass fencing token for replication explicitly in replicate_counter_from_leader since mutate_counter_on_leader_and_replicate doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current token_metadata version for outgoing requests.	2023-07-25 12:10:03 +04:00
Petr Gusev	f2cbdc7f18	counter_mutation: add replica::exception_variant to signature We are going to add fencing for counter mutations, this means handle_counter_mutation will sometimes throw stale_topology_exception. RPC doesn't marshall exceptions transparently, exceptions thrown by server are delivered to the client as a general remote_verb_error, which is not very helpful. The common practice is to embed exceptions into handler result type. In this commit we use already existing exception_variant as an exception container. We mark exception_variant with [[version]] attribute in the idl file, this should handle the case when the old replica (without exception_variant in the signature) is replying to the new one.	2023-07-25 12:09:19 +04:00
Petr Gusev	5fb8da4181	hints: add fencing In this commit we just pass a fencing_token through hint_mutation RPC verb. The hints manager uses either storage_proxy::send_hint_to_all_replicas or storage_proxy::send_hint_to_endpoint to send a hint. Both methods capture the current erm and use the corresponding fencing token from it in the mutation or hint_mutation RPC verb. If these verbs are fenced out, the server stale_topology_exception is translated to a mutation_write_failure_exception on the client with an appropriate error message. The hint manager will attempt to resend the failed hint from the commitlog segment after a delay. However, if delivery is unsuccessful, the hint will be discarded after gc_grace_seconds. Closes #14580	2023-07-24 18:12:48 +02:00
Patryk Jędrzejczak	a21c4abad7	replica: add abort_requested_exception to exception_variant If migration_manager::get_schema_for_write is called after migration_manager::drain, it throws abort_requested_exception. This exception is not present in replica::exception_variant, which means that RPC doesn't preserve information about its type. If it is thrown on the replica side, it is deserialized as std::runtime_error on the coordinator. Therefore, abstract_read_resolver::error logs information about this exception, even though we don't want it (aborts are triggered on shutdown and timeouts). To solve this issue, we add abort_requested_exception to replica::exception_variant and, in the next commits, refactor storage_proxy::handle_read so that abort_requested_exception thrown in migration_manager::get_schema_for_write is properly serialized. Thanks to this change, unchanged abstract_read_resolver::error correctly handles abort_requested_exception thrown on the replica side by not reporting it.	2023-07-13 16:57:10 +02:00
Mikołaj Grzebieluch	b2d22d665e	raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If new a generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. In this scenario, new or delayed nodes can never get the applied part. Send entire cdc_generations_v3 table in the snapshot to resolve this problem. As a follow-up, a mechanism to remove old CDC generations will be introduced.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	e6b0403326	raft: introduce `write_mutations` command This command is used to send mutations over raft. In later commits if `topology_change` doesn't fit the max command size, it will be split into smaller mutations and sent over multiple raft commands.	2023-07-04 16:12:50 +02:00
Petr Gusev	5a3384f495	storage_proxy.cc: add and use global_token_metadata_barrier fence_old_reads is removed since it's replaced by this barrier.	2023-06-15 15:52:50 +04:00
Petr Gusev	96a1c661bd	raft_topology: add cmd_index to raft commands In this commit we add logic to protect against raft commands reordering. This way we can be sure that the topology state (_topology_state_machine._topology) on all the nodes processing the command is consistent with the topology state on the topology change coordinator. In particular, this allows us to simply use _topology.version as the current version in barrier_and_drain instead of passing it along with the command as a parameter. Topology coordinator maintains an index of the last command it has sent to the cluster. This index is incremented for each command and sent along with it. The receiving node compares it with the last index it received in the same term and returns an error if it's not greater. We are protected against topology change coordinator migrating to other node by the already existing terms check: if the term from the command doesn't match the current term we return an error.	2023-06-15 15:52:50 +04:00
Petr Gusev	94605e4839	storage_proxy.cc: add fencing to read RPCs On the call site we use the version captured in read_executor/erm/token_metadata. In the handlers we use apply_fence twice just like in mutation RPC. Fencing was also added to local query calls, such as query_result_local in make_data_request. This is for the case when query coordinator was isolated from topology change coordinator and didn't receive barrier_and_drain.	2023-06-15 15:52:50 +04:00
Petr Gusev	46f73fcaa6	storage_proxy: add fencing for mutation At the call site, we use the version, captured in erm/token_metadata. In the handler, we use double checking, apply_fence after the local write guarantees that no mutations succeed on coordinators if the fence version has been updated on the replica during the write. Fencing was also added to mutate_locally calls on request coordinator, for the case if this coordinator was isolated from the topology change coordinator and missed the barrier_and_drain command.	2023-06-15 15:52:49 +04:00
Petr Gusev	7fe707570a	storage_servie: fix indentation	2023-06-15 15:48:00 +04:00
Petr Gusev	d34da12240	storage_proxy: add fencing_token and related infrastructure A new stale_topology_exception was introduced, it's raised in apply_fence when an RPC comes with a stale fencing_token. An overload of apply_fence with future will be used to wrap the storage_proxy methods which need to be fenced.	2023-06-15 15:48:00 +04:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	4f99302c2b	raft_topology: add barrier_and_drain cmd We use utils::phased_barrier. The new phase is started each time the version is updated. We track all instances of token_metadata, when an instance is destroyed the corresponding phased_barrier::operation is released.	2023-06-15 15:48:00 +04:00
Petr Gusev	3a88c7769f	tracing::trace_info: pass by ref sizeof(std::optional<tracing::trace_info>) == 64 bytes, so it should be more efficient.	2023-05-30 14:32:10 +04:00
Petr Gusev	48600049fc	storage_proxy: pass inet_address_vector_replica_set by ref sizeof(inet_address_vector_replica_set) == 96 bytes and it has complex move constructor.	2023-05-30 14:04:53 +04:00
Petr Gusev	896e3bb425	raft: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	4ff1adaef9	repair: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	282d66d15d	forward_request: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	db4030f792	storage_proxy: paxos:: add [[ref]] attribute read_command, partition_key and paxos::proposal are marked with [[ref]]. partition_key contains dynamic allocations and can be big. proposal contains frozen_mutation, so it's also contains dynamic allocations. The call sites are fine, the already passed by reference.	2023-05-30 13:14:19 +04:00
Petr Gusev	f2cba20945	storage_proxy: read_XXX:: make read_command [[ref]] We had a redundant copies at the call sites of these methods. Class read_command does not contain dynamic allocations, but it's quite but by itself (368 bytes).	2023-05-30 13:14:19 +04:00
Petr Gusev	ffb4e39e40	storage_proxy: hint_mutation:: make frozen_mutation [[ref]] We had a redundant copy in hint_mutation::apply_remotely. This frozen_mutation is dynamically allocated and can be arbitrary large.	2023-05-30 13:14:19 +04:00
Petr Gusev	5adbb6cde2	storage_proxy: mutation:: make frozen_mutation [[ref]] We had a redundant copy in receive_mutation_handler forward_fn callback. This frozen_mutation is dynamically allocated and can be arbitrary large. Fixes: #12504	2023-05-30 13:14:19 +04:00
Benny Halevy	adfb79ba3e	raft, idl: restore internal::tagged_uint64 type Change `f5f566bdd8` introduced tagged_integer and replaced raft::internal::tagged_uint64 with utils::tagged_integer. However, the idl type for raft::internal::tagged_uint64 was not marked as final, but utils::tagged_integer is, breaking the on-the-wire compatibility. This change defines the different raft tagged_uint64 types in idl/raft_storage.idl.hh as non-final to restore the way they were serialized prior to `f5f566bdd8` Fixes #13752 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 12:38:20 +03:00
Benny Halevy	d3a59fdefd	idl: gossip_digest: include required headers To be self-sufficient, before the next patch that will affect tagged_integer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 06:51:26 +03:00
Benny Halevy	5dc7b7811c	gms: gossip_digest: use generation_type and version_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	4cdad8bc8b	gms: heart_beat_state: use generation_type and version_type Define default constructor as heart_beat_state(gms::generation_type(0)) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	b638571cb0	gms: versioned_value: use version_type Adjust scylla-gdb.get_gms_version_value to get the versioned_value version as version_type (utils::tagged_integer). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	f5f566bdd8	utils: add tagged_integer A generic template for defining strongly typed integer types. Use it here to replace raft::internal::tagged_uint64. Will be used for defining gms generation and version as strong and distinguishable types in following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Benny Halevy	c5d819ce60	gms: versioned_value: make members private and provide accessor functions to get them. 1. So they can't be modified by mistake, as the versioned value is immutable. A new value must have a higher version. 2. Before making the version a strong gms::version_type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Kamil Braun	8afb15700b	storage_service: include current CDC generation data in topology snapshots Note that we don't need to include earlier CDC generations, just the current (i.e. latest) one. We might observe a problem when nodes are being bootstrapped in quick succession - I left a FIXME describing the problem and possible solutions.	2023-04-20 16:36:41 +02:00
Kefu Chai	023e985a6c	build: cmake: add missing source files to idl and service they were added recently, but cmake failed to sync with configure.py. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:01:21 +08:00
Gleb Natapov	d69a887366	storage_service: raft topology: introduce snapshot transfer code for the topology table	2023-03-23 16:29:56 +02:00
Gleb Natapov	6a4d773b7e	raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes Empty for now. Will be used later by the topology coordinator to communicate with other nodes to instruct them to start streaming, or start to fence read/writes.	2023-03-23 16:29:56 +02:00
Gleb Natapov	16d61e791f	service: raft: introduce topology_change group0 command Also extend group0_command to be able to send new command type. The command consists of a mutation array.	2023-03-21 16:06:43 +02:00
Avi Kivity	6822e3b88a	query_id: extract into new header query_id currently lives query-request.hh, a busy place with lots of dependencies. In turn it gets pulled by uuid.idl.hh, which is also very central. This makes test/raft/randomized_nemesis_test.cc which is nominally only dependent on Raft rebuild on random header file changes. Fix by extracting into a new header. Closes #13042	2023-03-01 10:25:25 +02:00
Kefu Chai	c0824c6c25	build: cmake: put generated sources into ${scylla_gen_build_dir} to be aligned with the convention of configure.py Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-17 18:38:44 +08:00
Kefu Chai	7b431748a8	build: cmake: extract idl library out Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-17 18:38:44 +08:00
Botond Dénes	ef50170120	Merge 'build: cmake: sync with configure (2/n)' from Kefu Chai * build: cmake: extract idl out * build: cmake: link cql3 against xxHash * build: cmake: correct the check in Findlibdeflate.cmake * build: cmake find_package(libdeflate) earlier * build: cmake: set more properties to alternator library * build: cmake: include generate_cql_grammar * build: cmake: find xxHash package * build: cmake: add build mode support Closes #12866 * github.com:scylladb/scylladb: build: cmake: correct generate_cql_grammar build: cmake: extract idl out build: cmake: link cql3 against xxHash build: cmake: correct the check in Findlibdeflate.cmake build: cmake: find_package(libdeflate) earlier build: cmake: set more properties to alternator library build: cmake: include generate_cql_grammar build: cmake: find xxHash package build: cmake: add build mode support	2023-02-16 07:11:26 +02:00
Kefu Chai	2718963a2a	build: cmake: extract idl out Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-16 00:07:40 +08:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Michał Sala	bbbe12af43	forward_service: fix timeout support in parallel aggregates `forward_request` verb carried information about timeouts using `lowres_clock::time_point` (that came from local steady clock `seastar::lowres_clock`). The time point was produced on one node and later compared against other node `lowres_clock`. That behavior was wrong (`lowres_clock::time_point`s produced with different `lowres_clock`s cannot be compared) and could lead to delayed or premature timeout. To fix this issue, `lowres_clock::time_point` was replaced with `lowres_system_clock::time_point` in `forward_request` verb. Representation to which both time point types serialize is the same (64-bit integer denoting the count of elapsed nanoseconds), so it was possible to do an in-place switch of those types using logic suggested by @avikivity: - using steady_clock is just broken, so we aren't taking anything from users by breaking it further - once all nodes are upgraded, it magically starts to work Closes #12529	2023-01-16 12:08:13 +02:00
Aleksandra Martyniuk	60e298fda1	repair: change utils::UUID to node_ops_id Type of the id of node operations is changed from utils::UUID to node_ops_id. This way the id of node operations would be easily distinguished from the ids of other entities. Closes #11673	2022-12-20 17:04:47 +02:00
Gleb Natapov	022a825b33	raft: introduce not_a_member error and return it when non member tries to do add/modify_config Currently if a node that is outside of the config tries to add an entry or modify config transient error is returned and this causes the node to retry. But the error is not transient. If a node tries to do one of the operations above it means it was part of the cluster at some point, but since a node with the same id should not be added back to a cluster if it is not in the cluster now it will never be. Return a new error not_a_member to a caller instead. Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>	2022-12-05 17:11:04 +01:00

1 2 3 4 5 ...

253 Commits