scylladb

Author	SHA1	Message	Date
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Asias He	10cf97375e	streaming: Expose reason for streaming On receiving a mutation_fragment or a mutation triggered by a streaming operation, we pass an enum stream_reason to notify the receiver what the streaming is used for. So the receiver can decide further operation, e.g., send view updates, beyond applying the streaming data on disk. Fixes #3276 Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com> (cherry picked from commit `7f826d3343`)	2018-11-15 17:45:31 +02:00
Calle Wilund	76ff2e5c3d	messaging_service: Make rpc streaming sink respect tls connection Fixes #3787 Message service streaming sink was created using direct call to rpc::client::make_sink. This in turn needs a new socker, which it creates completely ignoring what underlying transport is active for the client in question. Fix by retaining the tls credential pointer in the client wrapper, and using this in a sink method to determine whether to create a new tls socker, or just go ahead with a plain one. Message-Id: <20181010003249.30526-1-calle@scylladb.com> (cherry picked from commit `3cb50c861d`)	2018-10-23 07:36:21 +00:00
Avi Kivity	4553238653	messaging: fix unbounded allocation in TLS RPC server The non-TLS RPC server has an rpc::resource_limits configuration that limits its memory consumption, but the TLS server does not. That means a many-node TLS configuration can OOM if all nodes gang up on a single replica. Fix by passing the limits to the TLS server too. Fixes #3757. Message-Id: <20180907192607.19802-1-avi@scylladb.com>	2018-09-10 12:11:16 +01:00
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Avi Kivity	c4013f6fe1	messaging: categorize more streaming/repair verbs as streaming Since the messaging service will assign a scheduling group based on the client index, it's more important now to get the verbs categorized correctly. Re-categorize REPLICATION_FINISHED, REPAIR_CHECKSUM_RANGE, and most importantly STREAM_MUTATION_FRAGMENTS to the repair/streaming oriented connections so we get the correct scheduling.	2018-07-15 15:44:10 +03:00
Avi Kivity	ff3d7839ab	messaging: remove default when computing rpc client index A default means that when adding new verbs, we may forget to categorize a verb correctly. Without the default, the compiler will complain due to -Wswitch.	2018-07-15 15:40:29 +03:00
Avi Kivity	fe2db68be8	messaging: convert do_get_rpc_client_idx into a switch A switch is more readable for multiple choice with no clearly preferred choice.	2018-07-15 15:26:50 +03:00
Avi Kivity	3b1e04091c	messaging: choose connection index via a look-up table Looking up is faster than a bunch of if()s.	2018-07-15 15:21:06 +03:00
Avi Kivity	8ee807321f	Merge "scylla streaming with rpc streaming" from Asias " This work is on top of Gleb's rpc streaming which is merged recently. What this series does is to replace scylla streaming service's data plane to use the new rpc streaming instead of the old rpc verb to send the mutations for scylla streaming. Other parts of scylla streaming, the control plane, are not changed. In my test, to bootstrap a new node to the existing one node cluster, smp 2, scylla stores data on ramdisk to minimize disk io impact. I saw x2 improvment in streaming bandwidth. Before: [shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000] Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds After: [shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000] Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds Tests: dtest update_cluster_layout_tests.py Fixes: #3591 " * tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev: streaming: Add rpc streaming support storage_service: Introduce STREAM_WITH_RPC_STREAM feature streaming: Add estimate_partitions to send_info messaging_service: Add streaming with rpc streaming support messaging_service: Add streaming_domain database: Add add_sstable_and_update_cache database: Add make_streaming_sstable_for_write	2018-07-15 12:36:52 +03:00
Avi Kivity	8c993e0728	messaging: tag RPC services with scheduling groups Assign a scheduling_group for each RPC service. Assignement is done by connection (get_rpc_client_idx()) - all verbs on the same connection are assigned the same group. While this may seem arbitrary, it avoids priority inversion; if two verbs on the same connection have different scheduling groups, the verb with the low shares may cause a backlog and stall the connection, including following requests from verbs that ought to have higher shares. The scheduling_group parameters are encapsulated in different classes as they are passed around to avoid adding dependencies. Message-Id: <20180708140433.6426-1-avi@scylladb.com>	2018-07-13 13:57:08 +02:00
Asias He	ddfb4590ce	messaging_service: Add streaming with rpc streaming support Preparation for adding rpc streaming in scylla streaming. - register_stream_mutation_fragments is used to register the rpc streaming verb - make_sink_and_source_for_stream_mutation_fragments is used to get the sink and source object for the sender - make_sink_for_stream_mutation_fragments is used to get a sink object for the receiver	2018-07-13 08:36:46 +08:00
Asias He	671e1b08fe	messaging_service: Add streaming_domain The rpc streaming needs a streaming_domain id for the same logical server. Chose one for our messaging service.	2018-07-13 08:36:46 +08:00
Gleb Natapov	646e400918	Provide available memory size to messaging_service object during creation	2018-06-11 15:34:13 +03:00
Avi Kivity	dd12214628	messaging_service: move msg_addr into its own header file Make it possible to use msg_addr without depending on messaging_service.hh.	2018-03-12 20:05:23 +02:00
Avi Kivity	cd668061fc	storage_service: remove system_keyspace.hh include Re-distribute include among the files that really need it.	2018-03-11 18:53:49 +02:00
Duarte Nunes	440ea56010	message/messaging_service: Specify algorithm when requesting digest While not strictly needed, specify which algorithm to use when request a digest from a remote node. This is more flexible than relying on a cluster wide feature, although that's what we'll do in subsequent patches. It also makes the verb more consistent with the data request. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Glauber Costa	08a0c3714c	allow request-specific read timeouts in storage proxy reads Timeouts are a global property. However, for tables in keyspaces like the system keyspace, we don't want to uphold that timeout--in fact, we wan't no timeout there at all. We already apply such configuration for requests waiting in the queued sstable queue: system keyspace requests won't be removed. However, the storage proxy will insert its own timeouts in those requests, causing them to fail. This patch changes the storage proxy read layer so that the timeout is applied based on the column family configuration, which is in turn inherited from the keyspace configuration. This matches our usual way of passing db parameters down. In terms of implementation, we can either move the timeout inside the abstract read executor or keep it external. The former is a bit cleaner, the the latter has the nice property that all executors generated will share the exact same timeout point. In this patch, we chose the latter. We are also careful to propagate the timeout information to the replica. So even if we are talking about the local replica, when we add the request to the concurrency queue, we will do it in accordance with the timeout specified by the storage proxy layer. After this patch, Scylla is able to start just fine with very low timeouts--since read timeouts in the system keyspace are now ignored. Fixes #2462 Implementation notes, and general comments about open discussion in 2462: * Because we are not bypassing the timeout, just setting it high enough, I consider the concerns about the batchlog moot: if we fail for any other reason that will be propagated. Last case, because the timeout is per-CF, we could do what we do for the dirty memory manager and move the batchlog alone to use a different timeout setting. * Storage proxy likes specifying its timeouts as a time_point, whereas when we get low enough as to deal with the read_concurrency_config, we are talking about deltas. So at some point we need to convert time_points to durations. We do that in the database query functions. v2: - use per-request instead of per-table timeouts. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Vlad Zolotarov	be6f8be9cb	messaging_service: fix a mutli-NIC support Don't enforce the outgoing connections from the 'listen_address' interface only. If 'local_address' is given to connect() it will enforce it to use a particular interface to connect from, even if the destination address should be accessed from a different interface. If we don't specify the 'local_address' the source interface will be chosen according to the routing configuration. Fixes #3066 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>	2017-12-17 10:51:20 +02:00
Gleb Natapov	16964de1f3	storage_proxy: fail read/write requests early if it cannot be completed due to errors If errors make reaching CL impossible a request can be aborted earlier without waiting for timeout.	2017-12-05 16:46:25 +02:00
Duarte Nunes	1fbe9dc851	message/messaging_service: Close all server sockets We were stopping the loop prematurely. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171127181417.8167-1-duarte@scylladb.com>	2017-11-28 11:08:08 +02:00
Asias He	8fa35d6ddf	messaging_service: Get rid of timeout and retry logic for streaming verb With the "Use range_streamer everywhere" (`7217b7ab36`) seires, all the user of streaming now do streaming with relative small ranges and can retry streaming at higher level. There are problems with timeout and retry at RPC verb level in streaming: 1) Timeout can be false negative. 2) We can not cancel the send operations which are already called. When user aborts the streaming, the retry logic keeps running for a long time. This patch removes all the timeout and retry logic for streaming verbs. After this, the timeout is the job of TCP, the retry is the job of the upper layer. Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>	2017-08-29 17:20:00 +03:00
Avi Kivity	3edec66903	Revert "repair: Make send_repair_checksum_range timeout" This reverts commit `98757069a5`. We have the failure detector which will detect an unresponsive node and fail the RPC. Adding a timeout can just introduce false positives.	2017-08-06 13:09:36 +03:00
Asias He	98757069a5	repair: Make send_repair_checksum_range timeout If the verb never returns the repair will hangs forever. Make it use the timeout version of the send_message. Fixes #2662	2017-08-02 21:41:50 +08:00
Duarte Nunes	85e85ec72e	Don't catch polymorphic exceptions by value It makes gcc a very sad compiler. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726172053.5639-2-duarte@scylladb.com>	2017-07-27 09:39:58 +03:00
Asias He	0ba4e73068	streaming: Introduce the failed parameter for complete message Use this flag to notify the peer that the session is failed so that the peer can close the failed session more quickly. The flag is used as a rpc::optional so it is compatible use old version of the verb.	2017-07-18 11:24:31 +08:00
Tomasz Grabiec	07ed512060	migration_manager: Give empty response to schema pulls from incompatible nodes The old nodes which are still using v2 schema tables will fail to apply our response, with error messages complaining about not being able to locate schema of certain versions (new schema tables). This change inhibits such errors by responding with an empty mutation list.	2017-07-07 19:09:57 +02:00
Avi Kivity	c4ae2206c7	messaging: respect inter_dc_tcp_nodelay configuration parameter We respect it partially (client side only) for now. Fixes #6. Message-Id: <20170623172048.23103-1-avi@scylladb.com>	2017-06-24 21:49:27 +02:00
Gleb Natapov	23c51b3e57	messaging_service: connection drop notifier Allow registering callbacks that will be called when connection is going down.	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Calle Wilund	d5f57bd047	messaging_service: Move log printout to actual listen start Fixes #1845 Log printout was before we actually had evaluated endpoint to create, thus never included SSL info. Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>	2017-02-22 17:08:21 +01:00
Paweł Dziepak	bf60b7844b	messaging_service: add COUNTER_MUTATION verb This verb is going to be used for coordinator<->leader communication during counter updates.	2017-02-02 10:35:14 +00:00
Amnon Heiman	45b6070832	Merge seastar upstream * seastar 397685c...c1dbd89 (13): > lowres_clock: drop cache-line alignment for _timer > net/packet: add missing include > Merge "Adding histogram and description support" from Amnon > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&' > Set the option '--server' of tests/tcp_sctp_client to be required > core/memory: Remove superfluous assignment > core/memory: Remove dead code > core/reactor: Use logger instead of cerr > fix inverted logic in overprovision parameter > rpc: fix timeout checking condition > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier Includes treewide change to accomodate rpc changing its timeout clock to lowres_clock. Includes fixup from Amnon: collectd api should use the metrics getters As part of a preperation of the change in the metrics layer, this change the way the collectd api uses the metrics value to use the getters instead of calling the member directly. This will be important when the internal implementation will changed from union to variant. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>	2017-02-01 14:39:08 +02:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Gleb Natapov	0a2dd39c75	messaging_service: move MUTATION_DONE messages to separate connection If a node gets more MUTATION request that it can handle via RPC it will stop reading from this RPC connection, but this will prevent it from getting MUTATION_DONE responses for requests it coordinates because currently MUTATION and MUTATION_DONE messages shares same connection. To solve this problem this patches moves MUTATION_DONE messages to separate connection. Fixes: #1843 Message-Id: <20161201155942.GC11581@scylladb.com>	2016-12-21 11:10:15 +02:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	85034c1b57	Convert to use dht::partition_range	2016-12-19 08:04:30 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Avi Kivity	18078bea9b	storage_proxy: avoid calculating digest when only one replica is contacted If we're talking to just one replica, the digest is not going to be used, so better not to calculate it at all. The optimization helps with LOCAL_ONE queries where the result is large, but does not contain large blobs (many small rows). This patch adds a digest_algorithm parameter to the READ_DATA verb that can take on two values: none and MD5 (default), and sets it to none when we're reading from one replica. In the future we may add other values for more hardware-friendly digest algorithms. Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>	2016-11-17 13:04:30 +02:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Avi Kivity	c94fb1bf12	build: reduce inclusions of messaging_service.hh Remove inclusions from header files (primary offender is fb_utilities.hh) and introduce new messaging_service_fwd.hh to reduce rebuilds when the messaging service changes. Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>	2016-10-05 11:46:49 +03:00
Gleb Natapov	c95df8f053	messaging_service: use correct value for listen_to_bc_address is a constructor used by tests Also make sure to not listen on the same exact address twice in case listen_address == broadcast_address. Scylla configuration code does not allow such thing to be configured, but better to be safe. Message-Id: <20160927102316.GO32178@scylladb.com>	2016-09-27 11:27:23 +01:00
Gleb Natapov	26ae8e8365	implement listen_on_broadcast_address option When using multiple physical network interfaces, set this to true to listen on broadcast_address in addition to the listen_address, allowing nodes to communicate in both interfaces. Ignore this property if the network configuration automatically routes between the public and private networks such as EC2. Message-Id: <20160921094810.GA28654@scylladb.com>	2016-09-26 08:49:54 +03:00
Gleb Natapov	a2cdddb795	storage_proxy: forward mutation write with correct timeout value Now that mutation handler knows how much time is left for mutation write to be handled it can use this knowledge to set correct timeout for forwarded mutations. Message-Id: <20160828080637.GE9243@scylladb.com>	2016-08-29 13:06:36 +03:00
Vlad Zolotarov	4c16df9e4c	service: instrument MUTATE flow with tracing Store the trace state in the abstract_write_response_handler. Instrument send_mutation RPC to receive an additional rpc::optional parameter that will contain optional<trace_info> value. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Paweł Dziepak	7e06499458	repair: convert hashing to streamed_mutations This patch makes hashing for repair calculate checksums in a way that doesn't require rebuilding whole mutation. Unfortunately, such checksums are incompatible with the old ones so the old way for computing checksums is preserved for compatibility reasons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Gleb Natapov	726b79ea91	messaging_service: enable internode_compression option Use LZ4 for internode compression if enabled. Message-Id: <20160711141734.GZ18455@scylladb.com>	2016-07-11 18:30:21 +03:00
Paweł Dziepak	32a5de7a1f	db: handle receiving fragmented mutations If mutations are fragmented during streaming a special care must be taken so that isolation guarantees are not broken. Mutations received with flag "fragmented" set are applied to a memtable that is used only by that particular streaming task and the sstables created by flushing such memtables are not made visible until the task is complte. Also, in case the streaming fails all data is dropped. This means that fragmented mutations cannot benefit from coalescing of writes from multiple streaming plans, hence separate way of handling them so that there is no loss of performance for small partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00

1 2 3 4 5

245 Commits