scylladb

Author	SHA1	Message	Date
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Piotr Dulikowski	0d74dee683	Revert "messaging_service: add verbs for hint sync points" This reverts commit `82c419870a`. This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK rpc verbs. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so there is no need for new verbs.	2021-08-09 09:24:36 +02:00
Calle Wilund	b8b5f69111	messaging_service: Bind to listen address, not broadcast Refs #8418 Broadcast can (apparently) be an address not actually on machine, but on the other side of NAT. Thus binding local side of outgoing connection there will fail. Bind instead to listen_address (or broadcast, if listen_to_broadcast), this will require routing + NAT to make the connection looking like from broadcast from node connected to, to allow the connection (if using partial encryption). Note: this is somewhat verified somewhat limitedly. I would suggest verifying various multi rack/dc setups before relying on it. Closes #8974	2021-07-15 13:18:10 +03:00
Avi Kivity	9059514335	build, treewide: enable -Wpessimizing-move warning This warning prevents using std::move() where it can hurt - on an unnamed temporary or a named automatic variable being returned from a function. In both cases the value could be constructed directly in its final destination, but std::move() prevents it. Fix the handful of cases (all trivial), and enable the warning. Closes #8992	2021-07-08 17:52:34 +03:00
Benny Halevy	51bc6c8b5a	messaging_service: do_start_listen: improve info log accuracy Make sure to log the info message when we actually start listening. Also, print a log message when listening on the broadcast address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-30 16:25:21 +03:00
Benny Halevy	df442d4d24	messaging_service: never listen on port 0 We never want to listen on port 0, even if configured so. When the listen port is set to 0, the OS will choose the port randomly, which makes it useless for communicating with other nodes in the cluster, since we don't support that. Also, it causes the listen_ports_conf_test internode_ssl_test to fail since it expects to disable listening on storage_port or ssl_storage_port when set to 0, as seen in https://github.com/scylladb/scylla-dtest/issues/2174. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-30 16:24:54 +03:00
Pavel Emelyanov	3552e99ce7	scylla-gdb: Bring scylla netw back to work The netw command tries to access the netw::_the_messaging_service that was removed long ago. The correct place for the messaging service is in debug:: namespace. The scylla-gdb test checks that, but the netw command sees that the ptr in question is not initialized, thinks it's not yet sharded::start()-ed and exits without errors. tests: unit(gdb) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210624135107.12375-1-xemul@scylladb.com>	2021-06-24 20:59:27 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Konstantin Osipov	ac43941f17	rpc: don't include an unused header (raft_services.hh) Message-Id: <20210525183919.1395607-7-kostja@scylladb.com>	2021-05-26 11:07:44 +03:00
Asias He	425e3b1182	gossip: Introduce direct failure detector Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes #8488 Closes #8036	2021-05-24 10:47:06 +03:00
Avi Kivity	cea5493cb7	storage_proxy, treewide: introduce names for vectors of inet_address storage_proxy works with vectors of inet_addresses for replica sets and for topology changes (pending endpoints, dead nodes). This patch introduces new names for these (without changing the underlying type - it's still std::vector<gms::inet_address>). This is so that the following patch, that changes those types to utils::small_vector, will be less noisy and highlight the real changes that take place.	2021-05-05 18:36:48 +03:00
Avi Kivity	6ffd813b7b	Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency. When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter. This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached. This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value. Fixes #8102 Tests: - unit(dev) - some manual tests: - shutting down repair coordinator during hints replay, - shutting down node participating in repair during hints replay, Closes #8452 * github.com:scylladb/scylla: repair: introduce abort_source for repair abort repair: introduce abort_source for shutdown storage_proxy: add abort_source to wait_for_hints_to_be_replayed storage_proxy: stop waiting for hints replay when node goes down hints: dismiss segment waiters when hint queue can't send repair: plug in waiting for hints to be sent before repair repair: add get_hosts_participating_in_repair storage_proxy: coordinate waiting for hints to be sent config: add wait_for_hint_replay_before_repair option storage_proxy: implement verbs for hint sync points messaging_service: add verbs for hint sync points storage_proxy: add functions for syncing with hints queue db/hints: make it possible to wait until current hints are sent db/hints: add a metric for counting processed files db/hints: allow to forcefully update segment list on flush	2021-05-03 18:47:27 +03:00
Pavel Solodovnikov	4c351ff260	raft: switch `group_id` type from `uint64_t` to `utils::UUID` Introduce a tagged id struct for `group_id`. Raft code would want to generate quite a lot of unique raft groups in the future (e.g. tablets). UUID is designed exactly for that (e.g. larger capacity than `uint64_t`, obviously, and also has built-in procedures to generate random ids). Also, this is a preparation to make "raft group 0" use a random ID instead of a literal fixed `0` as a group id. The purpose is that every scylla cluster must have a unique ID for "raft group 0" since we don't want the nodes from some other cluster to disrupt the current cluster. This can happen if, for some reason, a foreign node happens to contact a node in our cluster. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>	2021-05-02 16:39:54 +03:00
Eliran Sinvani	0320110b04	messaging service: be more verbose when shutting down servers and clients We encountered a phenomena where shutting down the messaging service don't complete, leaving the shutdown process stuck. Since we couldn't pinpoint where exactly the shutdown went wrong, here we add some verbosity to the shutdown stages so we can more accurately pinpoint the culprit. Closes #8560	2021-04-29 12:28:17 +03:00
Piotr Dulikowski	82c419870a	messaging_service: add verbs for hint sync points Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK. Those will make it possible to create a sync point and regularly poll to check its existence.	2021-04-27 15:06:39 +02:00
Avi Kivity	0af7a22c21	repair: remove partition_checksum and related code `80ebedd242` made row-level repair mandatory, so there remain no callers to partition_checksum. Remove it. Closes #8537	2021-04-22 18:56:53 +03:00
Asias He	9ea57dff21	gossip: Relax failure detector update We currently only update the failure detector for a node when a higher version of application state is received. Since gossip syn messages do not contain application state, so this means we do not update the failure detector upon receiving gossip syn messages, even if a message from peer node is received which implies the peer node is alive. This patch relaxes the failure detector update rule to update the failure detector for the sender of gossip messages directly. Refs #8296 Closes #8476	2021-04-14 13:16:00 +02:00
Avi Kivity	82c76832df	treewide: don't include "db/system_distributed_keyspace.hh" from headers This just causes unneeded and slower recompliations. Instead replace with forward declarations, or includes of smaller headers that were incidentally brought in by the one removed. The .cc files that really need it gain the include, but they are few. Ref #1. Closes #8403	2021-04-04 14:00:26 +03:00
Avi Kivity	40b60e8f09	Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem, we can do the replacing operation in multiple stages. One solution is to introduce a new gossip status state as proposed here: gossip: Introduce STATUS_PREPARE_REPLACE #7416 1) replacing node does not respond echo message 2) replacing node advertises prepare_replace state (Remove replacing node from natural endpoint, but do not put in pending list yet) 3) replacing node responds echo message 4) replacing node advertises hibernate state (Put replacing node in pending list) Since we now have the node ops verb introduced in `829b4c1438` (repair: Make removenode safe by default), we can do the multiple stage without introducing a new gossip status state. This patch uses the NODE_OPS_CMD infrastructure to implement replace operation. Improvements: 1) It solves the race between marking replacing node alive and sending writes to replacing node 2) The cluster reverts to a state before the replace operation automatically in case of error. As a result, it solves when the replacing node fails in the middle of the operation, the repacing node will be in HIBERNATE status forever issue. 3) The gossip status of the node to be replaced is not changed until the replace operation is successful. HIBERNATE gossip status is not used anymore. 4) Users can now pass a list of dead nodes to ignore explicitly. Fixes #8013 Closes #8330 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for replace operation gossip: Add advertise_to_nodes gossip: Add helper to wait for a node to be up gossip: Add is_normal_ring_member helper	2021-04-04 12:54:09 +03:00
Asias He	bdb95233e8	gossip: Add advertise_to_nodes gossiper::advertise_to_nodes() is added to allow respond to gossip echo message with specified nodes and the current gossip generation number for the nodes. This is helpful to avoid the restarted node to be marked as alive during a pending replace operation. After this patch, when a node sends a echo message, the gossip generation number is sent in the echo message. Since the generation number changes after a restart, the receiver of the echo message can compare the generation number to tell if the node has restarted. Refs #8013	2021-04-01 09:38:54 +08:00
Gleb Natapov	9d6bf7f351	raft: introduce leader stepdown procedure Section 3.10 of the PhD describes two cases for which the extension can be helpful: 1. Sometimes the leader must step down. For example, it may need to reboot for maintenance, or it may be removed from the cluster. When it steps down, the cluster will be idle for an election timeout until another server times out and wins an election. This brief unavailability can be avoided by having the leader transfer its leadership to another server before it steps down. 2. In some cases, one or more servers may be more suitable to lead the cluster than others. For example, a server with high load would not make a good leader, or in a WAN deployment, servers in a primary datacenter may be preferred in order to minimize the latency between clients and the leader. Other consensus algorithms may be able to accommodate these preferences during leader election, but Raft needs a server with a sufficiently up-to-date log to become leader, which might not be the most preferred one. Instead, a leader in Raft can periodically check to see whether one of its available followers would be more suitable, and if so, transfer its leadership to that server. (If only human leaders were so graceful.) The patch here implements the extension and employs it automatically when a leader removes itself from a cluster.	2021-03-22 10:28:43 +02:00
Konstantin Osipov	4afa662d62	raft: respond with snapshot_reply to send_snapshot RPC Raft send_snapshot RPC is actually two-way, the follower responds with snapshot_reply message. This message until now was, however, muted by RPC. Do not mute snapshot_reply any more: - to make it obvious the RPC is two way - to feed the follower response directly into leader's FSM and thus ensure that FSM testing results produced when using a test transport are representative of the real world uses of raft::rpc.	2021-03-18 16:56:42 +03:00
Calle Wilund	a0745f9498	messaging_service: Enforce dc/rack membership iff required for non-tls connections When internode_encryption is "rack" or "dc", we should enforce incoming connections are from the appropriate address spaces iff answering on non-tls socket. This is implemented by having two protocol handlers. One for tls/full notls, and one for mixed (needs checking) connections. The latter will ask snitch if remote address is kosher, and refuse the connection otherwise. Note: requires seastar patches: "rpc: Make is possible for rpc server instance to refuse connection" "RPC: (client) retain local address and use on stream creation" Note that ip-level checks are not exhaustive. If a user is also using "require_client_auth" with dc/rack tls setting we should warn him that there is a possibility that someone could spoof himself pass the authentication. Closes #8051	2021-03-17 09:59:22 +02:00
Asias He	7018377bd7	messaging_service: Move gossip ack message verb to gossip group Fix a scheduling group leak: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip After the fix: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip Fixes #7986 Closes #8129	2021-02-23 10:10:00 +02:00
Avi Kivity	789233228b	messaging: don't inherit from seastar::rpc::protocol messaging_service's rpc_protocol_server_wrapper inherits from seastar::rpc::protocol::server as a way to avoid a is unfortunate, as protocol.hh wasn't designed for inheritance, and is not marked final. Avoid this inheritance by hiding the class as a member. This causes a lot of boilerplate code, which is unfortunate, but this random inheritance is bad practice and should be avoided. Closes #8084	2021-02-16 16:04:44 +02:00
Pavel Solodovnikov	d8dfdfba1e	raft: pass `group_id` as an argument to raft rpc messages This will be used later to filter the requests which belong to the schema raft group and route them to shard 0. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-11 16:25:33 +03:00
Pavel Solodovnikov	1a979dbba2	raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls All RPC module APIs except for `send_snapshot` should resolve as soon as the message is sent, so these messages are passed via `send_message_oneway_timeout`. `send_snapshot` message is sent via `send_message_timeout` and returns a `future<>`, which resolves when snapshot transfer finishes or fails with an exception. All necessary functions to wire the new Raft RPC verbs are also provided (such as `register` and `unregister` handlers). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-30 01:11:17 +03:00
Asias He	829b4c1438	repair: Make removenode safe by default Currently removenode works like below: - The coordinator node advertises the node to be removed in REMOVING_TOKEN status in gossip - Existing nodes learn the node in REMOVING_TOKEN status - Existing nodes sync data for the range it owns - Existing nodes send notification to the coordinator - The coordinator node waits for notification and announce the node in REMOVED_TOKEN Current problems: - Existing nodes do not tell the coordinator if the data sync is ok or failed. - The coordinator can not abort the removenode operation in case of error - Failed removenode operation will make the node to be removed in REMOVING_TOKEN forever. - The removenode runs in best effort mode which may cause data consistency issues. It means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. Improvements in this patch: - This patch makes the removenode safe by default. We require all nodes in the cluster to participate in the removenode operation and sync data if needed. We fail the removenode operation if any of them is down or fails. If the user want the removenode operation to succeed even if some of the nodes are not available, the user has to explicitly pass a list of nodes that can be skipped for the operation. $ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id> Example restful api: $ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5" - The coordinator can abort data sync on existing nodes For example, if one of the nodes fails to sync data. It makes no sense for other nodes to continue to sync data because the whole operation will fail anyway. - The coordinator can decide which nodes to ignore and pass the decision to other nodes Previously, there is no way for the coordinator to tell existing nodes to run in strict mode or best effort mode. Users will have to modify config file or run a restful api cmd on all the nodes to select strict or best effort mode. With this patch, the cluster wide configuration is eliminated. Fixes #7359 Closes #7626	2020-12-10 10:14:39 +02:00
Benny Halevy	e28d80ec0c	messaging: msg_addr: mark methods noexcept Based on gms::inet_address. With that, gossiper::get_msg_addr can be marked noexcept (and const while at it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Tomasz Grabiec	14fdd2f501	Merge "Gossip echo message improvement" from Asias This series improves gossip echo message handling in a loaded cluster. Refs: #7197 * git://github.com/asias/scylla.git gossip_echo_improve_7197: gossiper: Handle echo message on any shard gossiper: Increase echo message timeout gossiper: Remove unused _last_processed_message_at	2020-09-24 15:13:55 +02:00
Asias He	c7cb638e95	gossiper: Increase echo message timeout Gossip echo message is used to confirm a node is up. In a heavily loaded slow cluster, a node might take a long time to receive a heart beat update, then the node uses the echo message to confirm the peer node is really up. If the echo message timeout too early, the peer node will not be marked as up. This is bad because a live node is marked as down and this could happen on multiple nodes in the cluster which causes cluster wide unavailability issue. In order to prevent multiple nodes to marked as down, it is better to be conservative and less restrictive on echo message timeout. Note, echo message is not used to detect a node down. Increasing the echo timeout does not have any impact on marking a node down in a timely manner. Refs: #7197	2020-09-24 09:50:09 +08:00
Pavel Emelyanov	2fde6bbfe7	messaging_service: Report still registered services as errors On stop -- unregister the CLIENT_ID verb, which is registerd in constructor, then check for any remaining ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-17 09:52:57 +03:00
Pavel Emelyanov	623f61e63e	messaging_service: Unglobal messaging service instance Remove the global messaging_service, keep it on the main stack. But also store a pointer on it in debug namespace for debugging. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	4ea63b2211	gossiper: Share the messaging service with snitch And make snitch use gossiper's messaging, not global Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Pavel Emelyanov	878c50b9ad	main: Keep reference on global messaging service This is the preparation for moving the message service to main -- keep a reference and eventually pass one to subsystems depending on messaging. Once they are ready, the reference will be turned into an instance. For now only push the reference into the messaging service init/exit itself, other subsystems will be patched next. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	bdfb77492f	init: The messaging_service::stop is back (not really) Introduce back the .stop() method that will be used to really stop the service. For now do not do sharded::stop, as its users are not yet stopping, so this prevents use-after-free on messaging service. For now the .stop() is empty, but will be in charge of checking if all the other users had unregisterd their handlers from rpc. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	c28aeaee2e	messaging_service: Move initialization to messaging/ Now the init_messaging_service() only deals with messaing service and related internal stuff, so it can sit in its own module. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	5b169e8d16	messaging_service: Construct using config This is the continuation of the previous patch -- change the primary constructor to work with config. This, in turn, will decouple the messaging service from database::config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	304a414e39	messaging_service: Introduce and use config This service constructor uses and copies many simple values, it would be much simpler to group them on config. It also helps the next patches to simplify the messaging service initialization and to keep the defaults (for testing) in one place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	1c8ea817cd	messaging_service: Rename stop() to shutdown() On today's stop() the messaging service is not really stopped as other services still (may) use it and have registered handlers in it. Inside the .stop() only the rpc servers are brought down, so the better name for this method would be shutdown(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	e6fb2b58fc	messaging_service: Cleanup visibility of stopping methods Just a cleanup. These internal stoppers must be private, also there are too many public specifiers in the class description around them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Avi Kivity	3b1ff90a1a	Merge "Get rid of seed concept in gossip" from Asias " gossip: Get rid of seed concept The concept of seed and the different behaviour between seed nodes and non seed nodes generate a lot of confusion, complication and error for users. For example, how to add a seed node into into a cluster, how to promote a non seed node to a seed node, how to choose seeds node in multiple DC setup, edit config files for seeds, why seed node does not bootstrap. If we remove the concept of seed, it will get much easier for users. After this series, seed config option is only used once when a new node joins a cluster. Major changes: Seed nodes are only used as the initial contact point nodes. Seed nodes now perform bootstrap. The only exception is the first node in the cluster. The unsafe auto_bootstrap option is now ignored. Gossip shadow round now talks to all nodes instead of just seed nodes. Refs: #6845 Tests: update_cluster_layout_tests.py + manual test " * 'gossip_no_seed_v2' of github.com:asias/scylla: gossip: Get rid of seed concept gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb gossip: Add do_apply_state_locally helper gossip: Do not talk to seed node explicitly gossip: Talk to live endpoints in a shuffled fashion	2020-08-17 09:50:51 +03:00
Avi Kivity	257c17a87a	Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael " While working on another patch I was getting odd compiler errors saying that a call to ::make_shared was ambiguous. The reason was that seastar has both: template <typename T, typename... A> shared_ptr<T> make_shared(A&&... a); template <typename T> shared_ptr<T> make_shared(T&& a); The second variant doesn't exist in std::make_shared. This series drops the dependency in scylla, so that a future change can make seastar::make_shared a bit more like std::make_shared. " * 'espindola/make_shared' of https://github.com/espindola/scylla: Everywhere: Explicitly instantiate make_lw_shared Everywhere: Add a make_shared_schema helper Everywhere: Explicitly instantiate make_shared cql3: Add a create_multi_column_relation helper main: Return a shared_ptr from defer_verbose_shutdown	2020-08-02 19:51:24 +03:00
Avi Kivity	3f84d41880	Merge "messaging: make verb handler registering independent of current scheduling group" from Botond " `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore. In particular this caused severe problems with ranges scans, which in some cases ended up using different semaphores per page resulting in a crash. This could happen because when the page was read locally the code would run in the statement scheduling group, but when the request arrived from a remote coordinator via rpc, it was read in a system scheduling group. This caused a mismatch between the semaphore the saved reader was created with and the one the new page was read with. The result was that in some cases when looking up a paused reader from the wrong semaphore, a reader belonging to another read was returned, creating a disconnect between the lifecycle between readers and that of the slice and range they were referencing. This series fixes the underlying problem of the scheduling group influencing the verb handler registration, as well as adding some additional defenses if this semaphore mismatch ever happens in the future. Inactive read handles are now unique across all semaphores, meaning that it is not possible anymore that a handle succeeds in looking up a reader when used with the wrong semaphore. The range scan algorithm now also makes sure there is no semaphore mismatch between the one used for the current page and that of the saved reader from the previous page. I manually checked that each individual defense added is already preventing the crash from happening. Fixes: #6613 Fixes: #6907 Fixes: #6908 Tests: unit(dev), manual(run the crash reproducer, observe no crash) " * 'query-classification-regressions/v1' of https://github.com/denesb/scylla: multishard_mutation_query: use cached semaphore messaging: make verb handler registering independent of current scheduling group multishard_mutation_query: validate the semaphore of the looked-up reader reader_concurrency_semaphore: make inactive read handles unique across semaphores reader_concurrency_semaphore: add name() accessor reader_concurrency_semaphore: allow passing name to no-limit constructor	2020-07-27 13:56:52 +03:00
Botond Dénes	0df4c2fd3b	messaging: make verb handler registering independent of current scheduling group `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore.	2020-07-27 10:11:21 +03:00
Asias He	cd7d64f588	gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb The new verb is used to replace the current gossip shadow round implementation. Current shadow round implementation reuses the gossip syn and ack async message, which has plenty of drawbacks. It is hard to tell if the syn messages to a specific peer node has responded. The delayed responses from shadow round can apply to the normal gossip states even if the shadow round is done. The syn and ack message handler are full special cases due to the shadow round. All gossip application states including the one that are not relevant are sent back. The gossip application states are applied and the gossip listeners are called as if is in the normal gossip operation. It is completely unnecessary to call the gossip listeners in the shadow round. This patch introduces a new verb to request the exact gossip application states the shadow round needed with a synchronous verb and applies the application states without calling the gossip listeners. This patch makes the shadow round easier to reason about, more robust and efficient. Refs: #6845 Tests: update_cluster_layout_tests.py	2020-07-27 09:15:11 +08:00
Pavel Emelyanov	7a7b1b3108	messaging: Add missing handlers unregistration helpers Handlers for each verb have both -- register and unregister helpers, but unregistration ones for some verbs are missing, so here they are. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-22 16:31:57 +03:00
Rafael Ávila de Espíndola	e15c8ee667	Everywhere: Explicitly instantiate make_lw_shared seastar::make_lw_shared has a constructor taking a T&&. There is no such constructor in std::make_shared: https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared This means that we have to move from make_lw_shared(T(...) to make_lw_shared<T>(...) If we don't want to depend on the idiosyncrasies of seastar::make_lw_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-21 10:33:49 -07:00
Pavel Emelyanov	8618a02815	migration_manager: Remove db/schema_tables.hh inclustion into header The schema_tables.hh -> migration_manager.hh couple seems to work as one of "single header for everyhing" creating big blot for many seemingly unrelated .hh's. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:43 +03:00
Asias He	67f6da6466	repair: Switch to btree_set for repair_hash. In one of the longevity tests, we observed 1.3s reactor stall which came from repair_meta::get_full_row_hashes_source_op. It traced back to a call to std::unordered_set::insert() which triggered big memory allocation and reclaim. I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set and absl::btree_set. The absl::btree_set was the only one that seastar oversized allocation checker did not warn in my tests where around 300K repair hashes were inserted into the container. - unordered_set: hash_sets=295634, time=333029199 ns - flat_hash_set: hash_sets=295634, time=312484711 ns - node_hash_set: hash_sets=295634, time=346195835 ns - btree_set: hash_sets=295634, time=341379801 ns The btree_set is a bit slower than unordered_set but it does not have huge memory allocation. I do not measure real difference of total time to finish repair of the same dataset with unordered_set and btree_set. To fix, switch to absl btree_set container. Fixes #6190	2020-07-09 11:35:18 +03:00

1 2 3 4 5 ...

348 Commits