scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Mikołaj Grzebieluch	b2d22d665e	raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If new a generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. In this scenario, new or delayed nodes can never get the applied part. Send entire cdc_generations_v3 table in the snapshot to resolve this problem. As a follow-up, a mechanism to remove old CDC generations will be introduced.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	e6b0403326	raft: introduce `write_mutations` command This command is used to send mutations over raft. In later commits if `topology_change` doesn't fit the max command size, it will be split into smaller mutations and sent over multiple raft commands.	2023-07-04 16:12:50 +02:00
Petr Gusev	5a3384f495	storage_proxy.cc: add and use global_token_metadata_barrier fence_old_reads is removed since it's replaced by this barrier.	2023-06-15 15:52:50 +04:00
Petr Gusev	96a1c661bd	raft_topology: add cmd_index to raft commands In this commit we add logic to protect against raft commands reordering. This way we can be sure that the topology state (_topology_state_machine._topology) on all the nodes processing the command is consistent with the topology state on the topology change coordinator. In particular, this allows us to simply use _topology.version as the current version in barrier_and_drain instead of passing it along with the command as a parameter. Topology coordinator maintains an index of the last command it has sent to the cluster. This index is incremented for each command and sent along with it. The receiving node compares it with the last index it received in the same term and returns an error if it's not greater. We are protected against topology change coordinator migrating to other node by the already existing terms check: if the term from the command doesn't match the current term we return an error.	2023-06-15 15:52:50 +04:00
Petr Gusev	94605e4839	storage_proxy.cc: add fencing to read RPCs On the call site we use the version captured in read_executor/erm/token_metadata. In the handlers we use apply_fence twice just like in mutation RPC. Fencing was also added to local query calls, such as query_result_local in make_data_request. This is for the case when query coordinator was isolated from topology change coordinator and didn't receive barrier_and_drain.	2023-06-15 15:52:50 +04:00
Petr Gusev	46f73fcaa6	storage_proxy: add fencing for mutation At the call site, we use the version, captured in erm/token_metadata. In the handler, we use double checking, apply_fence after the local write guarantees that no mutations succeed on coordinators if the fence version has been updated on the replica during the write. Fencing was also added to mutate_locally calls on request coordinator, for the case if this coordinator was isolated from the topology change coordinator and missed the barrier_and_drain command.	2023-06-15 15:52:49 +04:00
Petr Gusev	7fe707570a	storage_servie: fix indentation	2023-06-15 15:48:00 +04:00
Petr Gusev	d34da12240	storage_proxy: add fencing_token and related infrastructure A new stale_topology_exception was introduced, it's raised in apply_fence when an RPC comes with a stale fencing_token. An overload of apply_fence with future will be used to wrap the storage_proxy methods which need to be fenced.	2023-06-15 15:48:00 +04:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	4f99302c2b	raft_topology: add barrier_and_drain cmd We use utils::phased_barrier. The new phase is started each time the version is updated. We track all instances of token_metadata, when an instance is destroyed the corresponding phased_barrier::operation is released.	2023-06-15 15:48:00 +04:00
Petr Gusev	3a88c7769f	tracing::trace_info: pass by ref sizeof(std::optional<tracing::trace_info>) == 64 bytes, so it should be more efficient.	2023-05-30 14:32:10 +04:00
Petr Gusev	48600049fc	storage_proxy: pass inet_address_vector_replica_set by ref sizeof(inet_address_vector_replica_set) == 96 bytes and it has complex move constructor.	2023-05-30 14:04:53 +04:00
Petr Gusev	896e3bb425	raft: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	4ff1adaef9	repair: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	282d66d15d	forward_request: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	db4030f792	storage_proxy: paxos:: add [[ref]] attribute read_command, partition_key and paxos::proposal are marked with [[ref]]. partition_key contains dynamic allocations and can be big. proposal contains frozen_mutation, so it's also contains dynamic allocations. The call sites are fine, the already passed by reference.	2023-05-30 13:14:19 +04:00
Petr Gusev	f2cba20945	storage_proxy: read_XXX:: make read_command [[ref]] We had a redundant copies at the call sites of these methods. Class read_command does not contain dynamic allocations, but it's quite but by itself (368 bytes).	2023-05-30 13:14:19 +04:00
Petr Gusev	ffb4e39e40	storage_proxy: hint_mutation:: make frozen_mutation [[ref]] We had a redundant copy in hint_mutation::apply_remotely. This frozen_mutation is dynamically allocated and can be arbitrary large.	2023-05-30 13:14:19 +04:00
Petr Gusev	5adbb6cde2	storage_proxy: mutation:: make frozen_mutation [[ref]] We had a redundant copy in receive_mutation_handler forward_fn callback. This frozen_mutation is dynamically allocated and can be arbitrary large. Fixes: #12504	2023-05-30 13:14:19 +04:00
Benny Halevy	adfb79ba3e	raft, idl: restore internal::tagged_uint64 type Change `f5f566bdd8` introduced tagged_integer and replaced raft::internal::tagged_uint64 with utils::tagged_integer. However, the idl type for raft::internal::tagged_uint64 was not marked as final, but utils::tagged_integer is, breaking the on-the-wire compatibility. This change defines the different raft tagged_uint64 types in idl/raft_storage.idl.hh as non-final to restore the way they were serialized prior to `f5f566bdd8` Fixes #13752 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 12:38:20 +03:00
Benny Halevy	d3a59fdefd	idl: gossip_digest: include required headers To be self-sufficient, before the next patch that will affect tagged_integer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 06:51:26 +03:00
Benny Halevy	5dc7b7811c	gms: gossip_digest: use generation_type and version_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	4cdad8bc8b	gms: heart_beat_state: use generation_type and version_type Define default constructor as heart_beat_state(gms::generation_type(0)) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	b638571cb0	gms: versioned_value: use version_type Adjust scylla-gdb.get_gms_version_value to get the versioned_value version as version_type (utils::tagged_integer). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:48:01 +03:00
Benny Halevy	f5f566bdd8	utils: add tagged_integer A generic template for defining strongly typed integer types. Use it here to replace raft::internal::tagged_uint64. Will be used for defining gms generation and version as strong and distinguishable types in following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Benny Halevy	c5d819ce60	gms: versioned_value: make members private and provide accessor functions to get them. 1. So they can't be modified by mistake, as the versioned value is immutable. A new value must have a higher version. 2. Before making the version a strong gms::version_type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Kamil Braun	8afb15700b	storage_service: include current CDC generation data in topology snapshots Note that we don't need to include earlier CDC generations, just the current (i.e. latest) one. We might observe a problem when nodes are being bootstrapped in quick succession - I left a FIXME describing the problem and possible solutions.	2023-04-20 16:36:41 +02:00
Kefu Chai	023e985a6c	build: cmake: add missing source files to idl and service they were added recently, but cmake failed to sync with configure.py. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:01:21 +08:00
Gleb Natapov	d69a887366	storage_service: raft topology: introduce snapshot transfer code for the topology table	2023-03-23 16:29:56 +02:00
Gleb Natapov	6a4d773b7e	raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes Empty for now. Will be used later by the topology coordinator to communicate with other nodes to instruct them to start streaming, or start to fence read/writes.	2023-03-23 16:29:56 +02:00
Gleb Natapov	16d61e791f	service: raft: introduce topology_change group0 command Also extend group0_command to be able to send new command type. The command consists of a mutation array.	2023-03-21 16:06:43 +02:00
Avi Kivity	6822e3b88a	query_id: extract into new header query_id currently lives query-request.hh, a busy place with lots of dependencies. In turn it gets pulled by uuid.idl.hh, which is also very central. This makes test/raft/randomized_nemesis_test.cc which is nominally only dependent on Raft rebuild on random header file changes. Fix by extracting into a new header. Closes #13042	2023-03-01 10:25:25 +02:00
Kefu Chai	c0824c6c25	build: cmake: put generated sources into ${scylla_gen_build_dir} to be aligned with the convention of configure.py Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-17 18:38:44 +08:00
Kefu Chai	7b431748a8	build: cmake: extract idl library out Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-17 18:38:44 +08:00
Botond Dénes	ef50170120	Merge 'build: cmake: sync with configure (2/n)' from Kefu Chai * build: cmake: extract idl out * build: cmake: link cql3 against xxHash * build: cmake: correct the check in Findlibdeflate.cmake * build: cmake find_package(libdeflate) earlier * build: cmake: set more properties to alternator library * build: cmake: include generate_cql_grammar * build: cmake: find xxHash package * build: cmake: add build mode support Closes #12866 * github.com:scylladb/scylladb: build: cmake: correct generate_cql_grammar build: cmake: extract idl out build: cmake: link cql3 against xxHash build: cmake: correct the check in Findlibdeflate.cmake build: cmake: find_package(libdeflate) earlier build: cmake: set more properties to alternator library build: cmake: include generate_cql_grammar build: cmake: find xxHash package build: cmake: add build mode support	2023-02-16 07:11:26 +02:00
Kefu Chai	2718963a2a	build: cmake: extract idl out Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-16 00:07:40 +08:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Michał Sala	bbbe12af43	forward_service: fix timeout support in parallel aggregates `forward_request` verb carried information about timeouts using `lowres_clock::time_point` (that came from local steady clock `seastar::lowres_clock`). The time point was produced on one node and later compared against other node `lowres_clock`. That behavior was wrong (`lowres_clock::time_point`s produced with different `lowres_clock`s cannot be compared) and could lead to delayed or premature timeout. To fix this issue, `lowres_clock::time_point` was replaced with `lowres_system_clock::time_point` in `forward_request` verb. Representation to which both time point types serialize is the same (64-bit integer denoting the count of elapsed nanoseconds), so it was possible to do an in-place switch of those types using logic suggested by @avikivity: - using steady_clock is just broken, so we aren't taking anything from users by breaking it further - once all nodes are upgraded, it magically starts to work Closes #12529	2023-01-16 12:08:13 +02:00
Aleksandra Martyniuk	60e298fda1	repair: change utils::UUID to node_ops_id Type of the id of node operations is changed from utils::UUID to node_ops_id. This way the id of node operations would be easily distinguished from the ids of other entities. Closes #11673	2022-12-20 17:04:47 +02:00
Gleb Natapov	022a825b33	raft: introduce not_a_member error and return it when non member tries to do add/modify_config Currently if a node that is outside of the config tries to add an entry or modify config transient error is returned and this causes the node to retry. But the error is not transient. If a node tries to do one of the operations above it means it was part of the cluster at some point, but since a node with the same id should not be added back to a cluster if it is not in the cluster now it will never be. Return a new error not_a_member to a caller instead. Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>	2022-12-05 17:11:04 +01:00
Kamil Braun	cbdcc944b5	service/raft: specialized verb for failure detector pinger We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace.	2022-12-01 20:54:18 +01:00
Kamil Braun	0c9cb5c5bf	Merge 'raft: wait for the next tick before retrying' from Gusev Petr When `modify_config` or `add_entry` is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of `modify_config`, other `modify_config` is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type `transient_error`. When the client receives it, it is obliged to retry the request after some delay. Previously leader-side exceptions were converted to `not_a_leader`, which is strange, especially for `conf_change_in_progress`. Fixes: #11564 Closes #11769 * github.com:scylladb/scylladb: raft: rafactor: remove duplicate code on retries delays raft: use wait_for_next_tick in read_barrier raft: wait for the next tick before retrying	2022-11-16 18:20:54 +01:00
Petr Gusev	5e15c3c9bd	raft: wait for the next tick before retrying When modify_config or add_entry is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of modify_config, other modify_config is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type transient_error. When the client node receives it, it is obliged to retry the request, possibly after some delay. Previously, leader-side exceptions were converted to not_a_leader exception, which is strange, especially for conf_change_in_progress. We add a delay before retrying in modify_config and add_entry if the client hasn't received any new information about the leader since the last attempt. This can happen if the server responds with a transient_error with an empty leader and the current node has not yet learned the new leader. We neglect an excessive delay if the newly elected leader is the same as the previous one, this supposed to be a rare. Fixes: #11564	2022-11-15 11:49:26 +04:00
Aleksandra Martyniuk	e2c7c1495d	repair: change UUID to task_id Change type of repair id from utils::UUID to task_id to distinguish them from ids of other entities.	2022-10-31 10:07:08 +01:00
Konstantin Osipov	3e46c32d7b	raft: (discovery) do not use raft::server_address to carry IP data We plan to remove IP information from Raft addresses. raft::server_address is used in Raft configuration and also in discovery, which is a separate algorithm, as a handy data structure, to avoid having new entities in RPC. Since we plan to remove IP addresses from Raft configuration, using raft::server_address in discovery and still storing IPs in it would create ambiguity: in some uses raft::server_address would store an IP, and in others - would not. So switch to an own data structure for the purposes of discovery, discovery_peer, which contains a pair ip, raft server id. Note to reviewers: ideally we should switch to URIs in discovery_peer right away. Otherwise we may have to deal with incompatible changes in discovery when adding URI support to Scylla.	2022-10-10 16:24:33 +03:00
Benny Halevy	480b4759a9	idl: streaming: include stream_fwd.hh To keep the idl definition of plan_id from getting out of sync with the one in stream_fwd.hh. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11720	2022-10-06 13:49:26 +02:00
Mikołaj Grzebieluch	db88525774	raft: broadcast_tables: add execution of intermediate language Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`. Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft commands without `group0_guard`. Queries on group0_kv_store are executed in `group_0_state_machine::apply`, but for now don't return results. They don't use previous state id, so they will block concurrent schema changes, but these changes won't block queries. In this version snapshots are ignored.	2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch	c541d5c363	raft: broadcast_tables: add definition of intermediate language In broadcast tables, raft command contains a whole program to be executed. Sending and parsing on each node entire CQL statement is inefficient, thus we decided to compile it to an intermediate language which can be easily serializable. This patch adds a definition of such a language. For now, only the following types of statements can be compiled: * select value where key = CONST from system.broadcast_kv_store; * update system.broadcast_kv_store set value = CONST where key = CONST; * update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST; where CONST is string literal.	2022-09-08 14:03:51 +02:00
Tomasz Grabiec	1d0264e1a9	Merge 'Implement Raft upgrade procedure' from Kamil Braun Start with a cluster with Raft disabled, end up with a cluster that performs schema operations using group 0. Design doc: https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/ (TODO: replace this with .md file - we can do it as a follow-up) The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations. With the procedure comes a recovery mode in case the upgrade procedure gets stuck (and it may if we lose a node during recovery - the procedure, to correctly establish a single group 0 cluster, requires contacting every node). This recovery mode can also be used to recover clusters with group 0 already established if they permanently lose a majority of nodes - killing two birds with one stone. Details in the last commit message. Read the design doc, then read the commits in topological order for best reviewing experience. --- I did some manual tests: upgrading a cluster, using the cluster to add nodes, remove nodes (both with `decommission` and `removenode`), replacing nodes. Performing recovery. As a follow-up, we'll need to implement tests using the new framework (after it's ready). It will be easy to test upgrades and recovery even with a single Scylla version - we start with a cluster with the RAFT flag disabled, then rolling-restart while enabling the flag (and recovery is done through simple CQL statements). Closes #10835 * github.com:scylladb/scylladb: service/raft: raft_group0: implement upgrade procedure service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` service/raft: raft_group0: introduce local loggers for group 0 and upgrade service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb service/raft: raft_group0_client: prepare for upgrade procedure service/raft: introduce `group0_upgrade_state` db: system_keyspace: introduce `load_peers` idl-compiler: introduce cancellable verbs message: messaging_service: cancellable version of `send_schema_check`	2022-08-25 11:32:06 +03:00

1 2 3 4 5

244 Commits