scylladb

Author	SHA1	Message	Date
Avi Kivity	789233228b	messaging: don't inherit from seastar::rpc::protocol messaging_service's rpc_protocol_server_wrapper inherits from seastar::rpc::protocol::server as a way to avoid a is unfortunate, as protocol.hh wasn't designed for inheritance, and is not marked final. Avoid this inheritance by hiding the class as a member. This causes a lot of boilerplate code, which is unfortunate, but this random inheritance is bad practice and should be avoided. Closes #8084	2021-02-16 16:04:44 +02:00
Pavel Solodovnikov	d8dfdfba1e	raft: pass `group_id` as an argument to raft rpc messages This will be used later to filter the requests which belong to the schema raft group and route them to shard 0. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-11 16:25:33 +03:00
Pavel Solodovnikov	1a979dbba2	raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls All RPC module APIs except for `send_snapshot` should resolve as soon as the message is sent, so these messages are passed via `send_message_oneway_timeout`. `send_snapshot` message is sent via `send_message_timeout` and returns a `future<>`, which resolves when snapshot transfer finishes or fails with an exception. All necessary functions to wire the new Raft RPC verbs are also provided (such as `register` and `unregister` handlers). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-30 01:11:17 +03:00
Asias He	829b4c1438	repair: Make removenode safe by default Currently removenode works like below: - The coordinator node advertises the node to be removed in REMOVING_TOKEN status in gossip - Existing nodes learn the node in REMOVING_TOKEN status - Existing nodes sync data for the range it owns - Existing nodes send notification to the coordinator - The coordinator node waits for notification and announce the node in REMOVED_TOKEN Current problems: - Existing nodes do not tell the coordinator if the data sync is ok or failed. - The coordinator can not abort the removenode operation in case of error - Failed removenode operation will make the node to be removed in REMOVING_TOKEN forever. - The removenode runs in best effort mode which may cause data consistency issues. It means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. Improvements in this patch: - This patch makes the removenode safe by default. We require all nodes in the cluster to participate in the removenode operation and sync data if needed. We fail the removenode operation if any of them is down or fails. If the user want the removenode operation to succeed even if some of the nodes are not available, the user has to explicitly pass a list of nodes that can be skipped for the operation. $ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id> Example restful api: $ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5" - The coordinator can abort data sync on existing nodes For example, if one of the nodes fails to sync data. It makes no sense for other nodes to continue to sync data because the whole operation will fail anyway. - The coordinator can decide which nodes to ignore and pass the decision to other nodes Previously, there is no way for the coordinator to tell existing nodes to run in strict mode or best effort mode. Users will have to modify config file or run a restful api cmd on all the nodes to select strict or best effort mode. With this patch, the cluster wide configuration is eliminated. Fixes #7359 Closes #7626	2020-12-10 10:14:39 +02:00
Benny Halevy	e28d80ec0c	messaging: msg_addr: mark methods noexcept Based on gms::inet_address. With that, gossiper::get_msg_addr can be marked noexcept (and const while at it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Tomasz Grabiec	14fdd2f501	Merge "Gossip echo message improvement" from Asias This series improves gossip echo message handling in a loaded cluster. Refs: #7197 * git://github.com/asias/scylla.git gossip_echo_improve_7197: gossiper: Handle echo message on any shard gossiper: Increase echo message timeout gossiper: Remove unused _last_processed_message_at	2020-09-24 15:13:55 +02:00
Asias He	c7cb638e95	gossiper: Increase echo message timeout Gossip echo message is used to confirm a node is up. In a heavily loaded slow cluster, a node might take a long time to receive a heart beat update, then the node uses the echo message to confirm the peer node is really up. If the echo message timeout too early, the peer node will not be marked as up. This is bad because a live node is marked as down and this could happen on multiple nodes in the cluster which causes cluster wide unavailability issue. In order to prevent multiple nodes to marked as down, it is better to be conservative and less restrictive on echo message timeout. Note, echo message is not used to detect a node down. Increasing the echo timeout does not have any impact on marking a node down in a timely manner. Refs: #7197	2020-09-24 09:50:09 +08:00
Pavel Emelyanov	2fde6bbfe7	messaging_service: Report still registered services as errors On stop -- unregister the CLIENT_ID verb, which is registerd in constructor, then check for any remaining ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-17 09:52:57 +03:00
Pavel Emelyanov	623f61e63e	messaging_service: Unglobal messaging service instance Remove the global messaging_service, keep it on the main stack. But also store a pointer on it in debug namespace for debugging. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	4ea63b2211	gossiper: Share the messaging service with snitch And make snitch use gossiper's messaging, not global Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Pavel Emelyanov	878c50b9ad	main: Keep reference on global messaging service This is the preparation for moving the message service to main -- keep a reference and eventually pass one to subsystems depending on messaging. Once they are ready, the reference will be turned into an instance. For now only push the reference into the messaging service init/exit itself, other subsystems will be patched next. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	bdfb77492f	init: The messaging_service::stop is back (not really) Introduce back the .stop() method that will be used to really stop the service. For now do not do sharded::stop, as its users are not yet stopping, so this prevents use-after-free on messaging service. For now the .stop() is empty, but will be in charge of checking if all the other users had unregisterd their handlers from rpc. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	c28aeaee2e	messaging_service: Move initialization to messaging/ Now the init_messaging_service() only deals with messaing service and related internal stuff, so it can sit in its own module. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	5b169e8d16	messaging_service: Construct using config This is the continuation of the previous patch -- change the primary constructor to work with config. This, in turn, will decouple the messaging service from database::config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	304a414e39	messaging_service: Introduce and use config This service constructor uses and copies many simple values, it would be much simpler to group them on config. It also helps the next patches to simplify the messaging service initialization and to keep the defaults (for testing) in one place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	1c8ea817cd	messaging_service: Rename stop() to shutdown() On today's stop() the messaging service is not really stopped as other services still (may) use it and have registered handlers in it. Inside the .stop() only the rpc servers are brought down, so the better name for this method would be shutdown(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Pavel Emelyanov	e6fb2b58fc	messaging_service: Cleanup visibility of stopping methods Just a cleanup. These internal stoppers must be private, also there are too many public specifiers in the class description around them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Avi Kivity	3b1ff90a1a	Merge "Get rid of seed concept in gossip" from Asias " gossip: Get rid of seed concept The concept of seed and the different behaviour between seed nodes and non seed nodes generate a lot of confusion, complication and error for users. For example, how to add a seed node into into a cluster, how to promote a non seed node to a seed node, how to choose seeds node in multiple DC setup, edit config files for seeds, why seed node does not bootstrap. If we remove the concept of seed, it will get much easier for users. After this series, seed config option is only used once when a new node joins a cluster. Major changes: Seed nodes are only used as the initial contact point nodes. Seed nodes now perform bootstrap. The only exception is the first node in the cluster. The unsafe auto_bootstrap option is now ignored. Gossip shadow round now talks to all nodes instead of just seed nodes. Refs: #6845 Tests: update_cluster_layout_tests.py + manual test " * 'gossip_no_seed_v2' of github.com:asias/scylla: gossip: Get rid of seed concept gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb gossip: Add do_apply_state_locally helper gossip: Do not talk to seed node explicitly gossip: Talk to live endpoints in a shuffled fashion	2020-08-17 09:50:51 +03:00
Avi Kivity	257c17a87a	Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael " While working on another patch I was getting odd compiler errors saying that a call to ::make_shared was ambiguous. The reason was that seastar has both: template <typename T, typename... A> shared_ptr<T> make_shared(A&&... a); template <typename T> shared_ptr<T> make_shared(T&& a); The second variant doesn't exist in std::make_shared. This series drops the dependency in scylla, so that a future change can make seastar::make_shared a bit more like std::make_shared. " * 'espindola/make_shared' of https://github.com/espindola/scylla: Everywhere: Explicitly instantiate make_lw_shared Everywhere: Add a make_shared_schema helper Everywhere: Explicitly instantiate make_shared cql3: Add a create_multi_column_relation helper main: Return a shared_ptr from defer_verbose_shutdown	2020-08-02 19:51:24 +03:00
Avi Kivity	3f84d41880	Merge "messaging: make verb handler registering independent of current scheduling group" from Botond " `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore. In particular this caused severe problems with ranges scans, which in some cases ended up using different semaphores per page resulting in a crash. This could happen because when the page was read locally the code would run in the statement scheduling group, but when the request arrived from a remote coordinator via rpc, it was read in a system scheduling group. This caused a mismatch between the semaphore the saved reader was created with and the one the new page was read with. The result was that in some cases when looking up a paused reader from the wrong semaphore, a reader belonging to another read was returned, creating a disconnect between the lifecycle between readers and that of the slice and range they were referencing. This series fixes the underlying problem of the scheduling group influencing the verb handler registration, as well as adding some additional defenses if this semaphore mismatch ever happens in the future. Inactive read handles are now unique across all semaphores, meaning that it is not possible anymore that a handle succeeds in looking up a reader when used with the wrong semaphore. The range scan algorithm now also makes sure there is no semaphore mismatch between the one used for the current page and that of the saved reader from the previous page. I manually checked that each individual defense added is already preventing the crash from happening. Fixes: #6613 Fixes: #6907 Fixes: #6908 Tests: unit(dev), manual(run the crash reproducer, observe no crash) " * 'query-classification-regressions/v1' of https://github.com/denesb/scylla: multishard_mutation_query: use cached semaphore messaging: make verb handler registering independent of current scheduling group multishard_mutation_query: validate the semaphore of the looked-up reader reader_concurrency_semaphore: make inactive read handles unique across semaphores reader_concurrency_semaphore: add name() accessor reader_concurrency_semaphore: allow passing name to no-limit constructor	2020-07-27 13:56:52 +03:00
Botond Dénes	0df4c2fd3b	messaging: make verb handler registering independent of current scheduling group `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore.	2020-07-27 10:11:21 +03:00
Asias He	cd7d64f588	gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb The new verb is used to replace the current gossip shadow round implementation. Current shadow round implementation reuses the gossip syn and ack async message, which has plenty of drawbacks. It is hard to tell if the syn messages to a specific peer node has responded. The delayed responses from shadow round can apply to the normal gossip states even if the shadow round is done. The syn and ack message handler are full special cases due to the shadow round. All gossip application states including the one that are not relevant are sent back. The gossip application states are applied and the gossip listeners are called as if is in the normal gossip operation. It is completely unnecessary to call the gossip listeners in the shadow round. This patch introduces a new verb to request the exact gossip application states the shadow round needed with a synchronous verb and applies the application states without calling the gossip listeners. This patch makes the shadow round easier to reason about, more robust and efficient. Refs: #6845 Tests: update_cluster_layout_tests.py	2020-07-27 09:15:11 +08:00
Pavel Emelyanov	7a7b1b3108	messaging: Add missing handlers unregistration helpers Handlers for each verb have both -- register and unregister helpers, but unregistration ones for some verbs are missing, so here they are. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-22 16:31:57 +03:00
Rafael Ávila de Espíndola	e15c8ee667	Everywhere: Explicitly instantiate make_lw_shared seastar::make_lw_shared has a constructor taking a T&&. There is no such constructor in std::make_shared: https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared This means that we have to move from make_lw_shared(T(...) to make_lw_shared<T>(...) If we don't want to depend on the idiosyncrasies of seastar::make_lw_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-21 10:33:49 -07:00
Pavel Emelyanov	8618a02815	migration_manager: Remove db/schema_tables.hh inclustion into header The schema_tables.hh -> migration_manager.hh couple seems to work as one of "single header for everyhing" creating big blot for many seemingly unrelated .hh's. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:43 +03:00
Asias He	67f6da6466	repair: Switch to btree_set for repair_hash. In one of the longevity tests, we observed 1.3s reactor stall which came from repair_meta::get_full_row_hashes_source_op. It traced back to a call to std::unordered_set::insert() which triggered big memory allocation and reclaim. I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set and absl::btree_set. The absl::btree_set was the only one that seastar oversized allocation checker did not warn in my tests where around 300K repair hashes were inserted into the container. - unordered_set: hash_sets=295634, time=333029199 ns - flat_hash_set: hash_sets=295634, time=312484711 ns - node_hash_set: hash_sets=295634, time=346195835 ns - btree_set: hash_sets=295634, time=341379801 ns The btree_set is a bit slower than unordered_set but it does not have huge memory allocation. I do not measure real difference of total time to finish repair of the same dataset with unordered_set and btree_set. To fix, switch to absl btree_set container. Fixes #6190	2020-07-09 11:35:18 +03:00
Rafael Ávila de Espíndola	af44684418	messaging_service: Don't return variadic futures from make_sink_and_source_for_*	2020-06-29 16:50:45 -07:00
Avi Kivity	e5be3352cf	database, streaming, messaging: drop streaming memtables Before Scylla 3.0, we used to send streaming mutations using individual RPC requests and flush them together using dedicated streaming memtables. This mechanism is no longer in use and all versions that use it have long reached end-of-life. Remove this code.	2020-06-25 15:25:54 +02:00
Eliran Sinvani	14520e843a	messagin service: fix execution order in messaging_service constructor The messaging service constructor's body does two main things in this order: 1. it registers the CLIENT_ID verb with rpc. 2. it initializes the scheduling mechanism in charge of locating the right scheduling group for each verb. The registration function uses the scheduling mechanism to determine the scheduling group for the verb. This commit simply reverses the order of execution. Fixes #6628	2020-06-11 12:14:10 +03:00
Pavel Emelyanov	67d5fad65f	storage_service: Remove some inclusions of its header GC pass over .cc files. Some really do not need it, some need for features/gossiper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-01 09:08:40 +03:00
Botond Dénes	16d8cdadc9	messaging_service: introduce the tenant concept Tenants get their own connections for statement verbs and are further isolated from each other by different scheduling groups. A tenant is identified by a scheduling group and a name. When selecting the client index for a statement verb, we look up the tenant whose scheduling group matches the current one. This scheduling group is persisted across the RPC call, using the name to identify the tenant on the remote end, where a reverse lookup (name -> scheduling group) happens. Instead of a single scheduling group to be used for all statement verbs, messaging_service::scheduling_config now contains a list of tenants. The first among these is the default tenant, the one we use when the current scheduling group doesn't match that of any configured tenant. To make this mapping easier, we reshuffle the client index assignment, such that statement and statement-ack verbs have the idx 2 and 3 respectively, instead of 0 and 3. The tenant configuration is configured at message service construction time and cannot be changed after. Adding such capability should be easy but is not needed for query classification, the current user of the tenant concept. Currently two tenants are configured: $user (default tenant) and $system.	2020-05-28 11:34:32 +03:00
Avi Kivity	db8974fef3	messaging_service: de-static-ify _scheduling_info_for_connection_index Per-user SLA means we have connection classifications determined dynamically, as SLAs are added or removed. This means the classification information cannot be static. Fix by making it a non-static vector (instead of a static array), allowing it to be extended. The scheduling group member pointer is replaced by a scheduling group as a member pointer won't work anymore - we won't have a member to refer to.	2020-05-28 10:40:08 +03:00
Avi Kivity	10dd08c9b0	messaging_service: supply and interpret rpc isolation_cookies On the client side, we supply an isolation cookie based on the connection index On the server side, we convert an isolation cookie back to a scheduling_group. This has two advantages: - rpc processes the entire connection using the scheduling group, so that code is also isolated and accounted for - we can later add per-user connections; the previous approach of looking at the verb to decide the scheduling_group doesn't help because we don't have a set of verbs per user With this, the main group sees <0.1% usage under simple read and write loads.	2020-05-28 10:40:08 +03:00
Avi Kivity	dbce57fa3c	messaging_service: extract connection_index -> scheduling_group translation Move it from a function-local static to a class static variable. We will want to extend it in two ways: - add more information per connection index (like the rpc isolation cookie) - support adding more connections for per-user SLA As a first step, make it an array of structures and make it accessible to all of messaging_service.	2020-05-28 10:40:08 +03:00
Asias He	c02fea5f04	repair: Ignore table removed in sync_data_using_repair Commit `75cf255c67` (repair: Ignore keyspace that is removed in sync_data_using_repair) is not enough to fix the issue because when the repair master checks if the table is dropped, the table might not be dropped yet on the repair master. To fix, the repair master should check if the follower failed the repair because the table is dropped by checking the error returned from follower. With this patch, we would see WARN 2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0 completed successfully, keyspace=ks, ignoring dropped tables={cf} when the table is dropped during bootstrap. Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test Fixes: #5942	2020-05-24 13:39:59 +03:00
Calle Wilund	08d069f78d	messaging_service: Use reloadable TLS certificates Changes messaging service rpc to use reloadable tls certificates iff tls is enabled- Note that this means that the service cannot start listening at construction time if TLS is active, and user need to call start_listen_ex to initialize and actually start the service. Since "normal" messaging service is actually started from gms, this route too is made a continuation.	2020-05-04 11:32:21 +00:00
Botond Dénes	7dabf75682	service: messaging_service: resolve rpc set_logger deprecation warning Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200407091413.310764-1-bdenes@scylladb.com>	2020-04-22 10:05:35 +03:00
Asias He	13a9c5eaf7	repair: Send reason for node operations Since `956b092012` (Merge "Repair based node operation" from Asias), repair is used by other node operations like bootstrap, decommission and so on. Send the reason for the repair, so that we can handle the materialized view update correctly according to the reason of the operation. We want to trigger the view update only if the repair is used by repair operation. Otherwise, the view table will be handled twice, 1) when the view table is synced using repair 2) when the base table is synced using repair and view table update is triggered. Fixes #5930 Fixes #5998	2020-04-13 13:47:26 +03:00
Avi Kivity	88ade3110f	treewide: replace calls to engine().some_api() with some_api() This removes the need to include reactor.hh, a source of compile time bloat. In some places, the call is qualified with seastar:: in order to resolve ambiguities with a local name. Includes are adjusted to make everything compile. We end up having 14 translation units including reactor.hh, primarily for deprecated things like reactor::at_exit(). Ref #1	2020-04-05 12:46:04 +03:00
Gleb Natapov	8a408ac5a8	lwt: remove entries from system.paxos table after successful learn stage The learning stage of PAXOS protocol leaves behind an entry in system.paxos table with the last learned value (which can be large). In case not all participants learned it successfully next round on the same key may complete the learning using this info. But if all nodes learned the value the entry does not serve useful purpose any longer. The patch adds another round, "prune", which is executed in background (limited to 1000 simultaneous instances) and removes the entry in case all nodes replied successfully to the "learn" round. It uses the ballot's timestamp to do the deletion, so not to interfere with the next round. Since deletion happens very close to previous writes it will likely happen in memtable and will never reach sstable, so that reduces memtable flush and compaction overhead. Fixes #5779 Message-Id: <20200330154853.GA31074@scylladb.com>	2020-03-30 21:02:14 +03:00
Rafael Ávila de Espíndola	c5795e8199	everywhere: Replace engine().cpu_id() with this_shard_id() This is a bit simpler and might allow removing a few includes of reactor.hh. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200326194656.74041-1-espindola@scylladb.com>	2020-03-27 11:40:03 +03:00
Gleb Natapov	5753ab7195	lwt: drop invoke_on in paxos_state prepare and accept Since lwt requests are now running on an owning shard there is no longer a need to invoke cross shard call on paxos_state level. RPC calls may still arrive to a wrong shard so we need to make cross shard call there.	2020-01-13 10:26:02 +02:00
Benny Halevy	9ec98324ed	messaging_service: unregister_handler: return rpc unregister_handler future Now that seastar returns it. Fixes https://github.com/scylladb/scylla/issues/5228 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191212143214.99328-1-bhalevy@scylladb.com>	2019-12-12 16:38:36 +02:00
Benny Halevy	105c8ef5a9	messaging_service: wait on unregister_handler Prepare for returning future<> from seastar rpc unregister_handler. Refs https://github.com/scylladb/scylla/issues/5228 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191208153924.1953-1-bhalevy@scylladb.com>	2019-12-11 14:17:41 +02:00
Piotr Dulikowski	adfa7d7b8d	messaging_service: don't move `unsigned` values in handlers Performing std::move on integral types is pointless. This commit gets rid of moves of values of `unsigned` type in rpc handlers.	2019-12-05 00:58:31 +01:00
Piotr Dulikowski	2e802ca650	hh: add HINT_MUTATION verb Introduce a new verb dedicated for receiving and sending hints: HINT_MUTATION. It is handled on the streaming connection, which is separate from the one used for handling mutations sent by coordinator during a write. The intent of using a separate connection is to increase fariness while handling hints and user requests - this way, a situation can be avoided in which one type of requests saturate the connection, negatively impacting the other one.	2019-12-05 00:51:49 +01:00
Vladimir Davydov	bf5f864d80	paxos: piggyback result query on prepare response Current LWT implementation uses at least three network round trips: - first, execute PAXOS prepare phase - second, query the current value of the updated key - third, propose the change to participating replicas (there's also learn phase, but we don't wait for it to complete). The idea behind the optimization implemented by this patch is simple: piggyback the current value of the updated key on the prepare response to eliminate one round trip. To generate less network traffic, only the closest to the coordinator replica sends data while other participating replicas send digests which are used to check data consistency. Note, this patch changes the API of some RPC calls used by PAXOS, but this should be okay as long as the feature in the early development stage and marked experimental. To assess the impact of this optimization on LWT performance, I ran a simple benchmark that starts a number of concurrent clients each of which updates its own key (uncontended case) stored in a cluster of three AWS i3.2xlarge nodes located in the same region (us-west-1) and measures the aggregate bandwidth and latency. The test uses shard-aware gocql driver. Here are the results: latency 99% (ms) bandwidth (rq/s) timeouts (rq/s) clients before after before after before after 1 2 2 626 637 0 0 5 4 3 2616 2843 0 0 10 3 3 4493 4767 0 0 50 7 7 10567 10833 0 0 100 15 15 12265 12934 0 0 200 48 30 13593 14317 0 0 400 185 60 14796 15549 0 0 600 290 94 14416 15669 0 0 800 568 118 14077 15820 2 0 1000 710 118 13088 15830 9 0 2000 1388 232 13342 15658 85 0 3000 1110 363 13282 15422 233 0 4000 1735 454 13387 15385 329 0 That is, this optimization improves max LWT bandwidth by about 15% and allows to run 3-4x more clients while maintaining the same level of system responsiveness.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	3d1d4b018f	paxos: remove unnecessary move constructor invocations invoke_on() guarantees that captures object won't be destroyed until the future returned by the invoked function is resolved so there's no need to move key, token, proposal for calling paxos_state::*_impl helpers.	2019-11-24 11:35:29 +02:00
Gleb Natapov	8d6201a23b	lwt: Add RPC verbs needed for paxos implementation Paxos protocol has three stages: prepare, accept, learn. This patch adds rpc verb for each of those stages. To be term compatible with Cassandra the patch calls those stages: prepare, propose, commit.	2019-10-27 23:21:51 +03:00
Avi Kivity	ba64ec78cf	messaging_service: use rpc::tuple instead of variadic futures for rpc Since variadic future<> is deprecated, switch to rpc::tuple for multiple return values in rpc calls. This is more or less mechanical translation.	2019-09-26 12:09:31 +02:00

1 2 3 4 5 ...

324 Commits