scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 10:00:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	6486d6c8bd	storage_proxy: use preferred/last replicas	2018-09-03 10:31:44 +03:00
Botond Dénes	577a06ce1b	storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent	2018-09-03 10:31:44 +03:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Botond Dénes	2f66bde26f	storage_proxy: use query_mutations_from_all_shards() for range scans	2018-09-03 10:31:44 +03:00
Avi Kivity	908e497f3d	storage_proxy: make _mutate_stage inherit its caller's scheduling_group Right now, storage_proxy's mutate_stage violates isolation by running in a plain execution_stage without a scheduling_group. This means do_mutate() will run under the main scheduling_group, at least until we reach the database apply execution stage, which is correct. Fix by moving to an inheriting execution stage; this works because the messaging service will tell RPC to set the correct execution stage for us. We could explicitly specify statement_scheduling_group, but inheriting the scheduling group allows us to have multiple statment scheduling groups, later.	2018-08-24 19:04:49 +03:00
Gleb Natapov	7277ee2939	storage_proxy: do not fail read without speculation on connection error After `ac27d1c93b` if a read executor has just enough targets to achieve request's CL and a connection to one of them will be dropped during execution ReadFailed error will be returned immediately and client will not have a chance to issue speculative read (retry). The patch changes the code to not return ReadFailed error immediately, but wait for timeout instead and give a client chance to issue speculative read in case read executor does not have additional targets to send speculative reads to by itself. Fixes #3699. Message-Id: <20180819131646.GK2326@scylladb.com>	2018-08-20 10:12:31 +03:00
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Nadav Har'El	25bd139508	cross-tree: clean up use of std::random_device() std::random_device() uses the relatively slow /dev/urandom, and we rarely if ever intend to use it directly - we normally want to use it to seed a faster random_engine (a pseudo-random number generator). In many places in the code, we first created a random_device variable, and then using it created a random_engine variable. However, this practice created the risk of a programmer accidentally using the random_device object, instead of the random_engine object, because both have the same API; This hurts performance. This risk materialized in just two places in the code, utils/uuid.cc and gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is not included in this patch, and the fix for gossiper.{cc,hh} is included here. To avoid risking the same mistake in the future, this patch switches across the code to an idiom where the random_device object is not named, so cannot be accidentally used. We use the following idiom: std::default_random_engine _engine{std::random_device{}()}; Here std::random_device{}() creates the random device (/dev/urandom) and pulls a random integer from it. It then uses this seed to create the random_engine (the pseudo-random number generator). The std::random_device{} object is temporary and unnamed, and cannot be unintentionally used directly. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180726154958.4405-1-nyh@scylladb.com>	2018-07-26 16:54:58 +01:00
Avi Kivity	bea1f715dc	storage_proxy: count cross-shard operations Count operations which were started on one shard and were performed on another, due to non-shard-aware driver and/or RPC. Message-Id: <20180723155118.8545-1-avi@scylladb.com>	2018-07-25 16:21:04 +01:00
Botond Dénes	cc4acb6e26	storage_proxy: use the original row limits for the final results merging `query_partition_key_range()` does the final result merging and trimming (if necessary) to make sure we don't send more rows to the client than requested. This merging and trimming is done by a continuation attached to the `query_partition_key_range_concurrent()` which does the actual querying. The continuations captures via value the `row_limit` and `partition_limit` fields of the `query::read_command` object of the query. This has an unexpected consequence. The lambda object is constructed after the call to `query_partition_key_range_concurrent()` returns. If this call doesn't defer, any modifications done to the read command object done by `query_partition_key_range_concurrent()` will be visible to the lambda. This is undesirable because `query_partition_key_range_concurrent()` updates the read command object directly as the vnodes are traversed which in turn will result in the lambda doing the final trimming according to a decremented `row_limits`, which will cause the paging logic to declare the query as exhausted prematurely because the page will not be full. To avoid all this make a copy of the relevant limit fields before `query_partition_key_range_concurrent()` is called and pass these copies to the continuation, thus ensuring that the final trimming will be done according to the original page limits. Spotted while investigating a dtest failure on my 1865/range-scans/v2 branch. On that branch the way range scans are executed on replicas is completely refactored. These changes appearantly reduce the number of continuations in the read path to the point where an entire page can be filled without deferring and thus causing the problem to surface. Fixes #3605. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>	2018-07-16 16:54:50 +03:00
Gleb Natapov	617666efb0	storage_proxy: use logger's exception printer to report read failure Use existing exception pretty printer since it handles nested exceptions. Message-Id: <20180709122826.GT28899@scylladb.com>	2018-07-09 15:31:14 +03:00
Gleb Natapov	ac27d1c93b	storage_proxy: fix rpc connection failure handling by read operation Currently rpc::closed_error is not counted towards replica failure during read and thus read operation waits for timeout even if one of the nodes dies. Fix this by counting rpc::closed_error towards failed attempts. Fixes #3590. Message-Id: <20180708123522.GC28899@scylladb.com>	2018-07-09 10:05:31 +03:00
Avi Kivity	512baf536f	storage_proxy: implement write timeouts Require a timeout parameter for storage_proxy::mutate_begin() and all its callers (all the way to thrift and cql modification_statement and batch_statement). This should fix spurious debug-mode test failures, where overcommit and general debug slowness result in the default timeouts being exceeded. Since the tests use infinite timeouts, they should not time out any more. Tests: unit (release), with an extra patch that aborts when a non-infinite timeout is detected. Message-Id: <20180707204424.17116-1-avi@scylladb.com>	2018-07-08 10:27:03 +01:00
Gleb Natapov	19e7493d5b	storage_proxy: initialize write response id counter from wall clock value Initializing write response id to the same value on each reboot may cause stale id to be taken for active one if node restarts after sending only a couple of write request and before receiving replies. On next reboot it will start assigning id's from the same value and receiving old replies will confuse it. Mitigate this by assigning initial id to wall clock value in milliseconds. It will not solve the problem completely, but will mitigate it.	2018-07-01 17:24:40 +03:00
Gleb Natapov	569437aaa5	storage_proxy: drop virtual from signal(gms::inet_address) The function is not overridden, so should not be virtual.	2018-07-01 16:35:59 +03:00
Gleb Natapov	5ee09e5f3b	storage_proxy: do not assert on getting an unexpected write reply In theory we should not get write reply from a node we did not send write to, but in practice stale reply can be received if node reboot between sending write and getting a reply. Do not assert, but log the warning instead and ignore the reply. Fixes: #3153	2018-07-01 16:35:09 +03:00
Paweł Dziepak	6bd71015e7	storage_proxy: use mutation_partition_view::{first, last}_row_key()	2018-06-28 22:11:19 +01:00
Piotr Sarna	b6c1b8c5ef	hints: make space_watchdog device-aware Instead of having one static space limit for all directories, space_watchdog now keeps a per-device limit, shared among hints managers residing on the same disks. References #3516 Signed-off-by: Piotr Sarna <sarna@scylladb.com>	2018-06-22 10:26:45 +02:00
Piotr Sarna	6b3a97e34a	hints: fix max_shard_disk_space_size initialization Previously max_shard_disk_space_size was unconditionally initialized with the capacity of hints_directory. But, it's likely that hints_directory doesn't exist at all if hinted handoff is not enabled, which results in Scylla failing to boot. So, max_shard_disk_space_size is now initialized with the capacity of hints_for_views directory, which is always present. This commit also moves max_shard_disk_space_size to the .cc file where it belongs - resource_manager.cc. Tests: unit (release) Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>	2018-06-14 14:24:01 +01:00
Gleb Natapov	cdf1289b43	Provide available memory size to hinted handoff resource manager during creation	2018-06-11 15:34:13 +03:00
Gleb Natapov	ac88935baa	Provide available memory size to storage_proxy object during creation	2018-06-11 15:34:13 +03:00
Piotr Sarna	f12fdcffdb	storage_proxy: restore optional hinted handoff Since hinted handoff for materialized views is now a separate entity, regular hinted handoff can go back to being optional.	2018-06-04 09:46:06 +02:00
Piotr Sarna	a6aae369da	storage_proxy: add hints manager for views This commit adds a separate hints manager that serves only failed materialized view updates.	2018-06-04 09:46:06 +02:00
Piotr Sarna	204bc17bd7	hints: decouple hints manager metrics from constructor Now that more than one instance of hints manager can be present at the same time, registering metrics is moved out of the constructor to prevent 'registering metrics twice' errors.	2018-06-04 09:46:06 +02:00
Piotr Sarna	ef40f7e628	hints: move send limiter to resource manager Send limiting semaphore is moved from hints manager to resource manager. In consequence, hints manager now keeps a reference to its resource manager.	2018-06-04 09:35:58 +02:00
Piotr Sarna	2315937854	hints: move constants to resource_manager Constants related to managing resources are moved to newly created resource_manager class. Later, this class will be used to manage (potentially shared) resources of hints managers.	2018-06-04 09:35:58 +02:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Piotr Sarna	ffe52681ea	storage_proxy: add mv stats to write handler Previous patch for issue 3416 did not cover passing write stats to write response handler, which results in some write stats being incorrectly counted as user write stats, while they belong to materialized views. This one fixes that by passing correct write stats reference to write response handler constructor. Also at: https://github.com/psarna/scylla/commits/fix_3416_again Closes #3416 Message-Id: <53ef3cc96ccadfdad8992d92ed6a41473419eb0a.1527510473.git.sarna@scylladb.com>	2018-05-28 17:50:49 +01:00
Piotr Sarna	1d590b3ca4	storage_proxy: decouple write_stats from stats This commit extracts metrics related to writes from stats structure, so it can be easily replaced later, e.g. for materialized view metrics. References #3385 References #3416	2018-05-22 16:52:58 +02:00
Piotr Sarna	f5d6326ced	storage_proxy: enable hinted handoff for materialized views This commit initializes and enables hinted handoff for materialized views, even if HH is not explicitly turned on in config. User writes still use hinted handoff only if it is explicitly enabled, while materialized views are allowed to use it unconditionally in order to store failed replica updates somewhere. Fixes #3383	2018-05-21 17:09:27 +02:00
Piotr Sarna	da0d458f5f	storage_proxy: make view updates use consistency_level::ANY This commit makes view replica updates internally use consistency level ANY, so in case an update fails it will fall back to hinted handoff. References #3383	2018-05-21 17:09:27 +02:00
Botond Dénes	ddd70dc113	Use dht::token_range alias for last/preferred replicas Use the pre-existing type alias instead of fully spelling out the type everywhere.	2018-05-10 06:22:39 +03:00
Vlad Zolotarov	48c96d09d6	db::hints::manager: drain hints when the node is decommissioned/removed When node is decommissioned/removed it will drain all its hints and all remote nodes that have hints to it will drain their hints to this node. What "drain" means? - The node that "drains" hints to a specific destination will ignore failures and will continue sending hints till the end of the current segment, erase it and move to the next one till there are no more segments left. After all hints are drained the corresponding hints directory is removed. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-05-08 22:29:21 +01:00
Tomasz Grabiec	423712f1fe	storage_proxy: Request schema from the coordinator in the original DC The mutation forwarding intermediary (src_addr) may not always know about the schema which was used by the original coordinator. I think this may be the cause of the "Schema version ... not found" error seen in one of the clusters which entered some pathological state: storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found) Fixes #3393. Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>	2018-04-30 12:51:09 +03:00
Duarte Nunes	fb54c09e0b	service/storage_proxy: Pass pending endpoints to send_to_endpoint() This will allow us to minimize the number of mutation copies in mutate_MV(). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180325121412.76844-1-duarte@scylladb.com>	2018-03-25 15:45:21 +03:00
Botond Dénes	eee9bda85b	Make the read-repair decision only once Make the read-repair decision on the first page of a paged-query and use it for all the remaining pages. This helps querier-cache hit-rates as reads to nodes will be sent consistently throught the query.	2018-03-19 16:29:43 +02:00
Botond Dénes	2e2abf6edb	storage_proxy: add coordinator_query_options and coordinator_query_result As yet more parameters and return-values are about to be added to all storage_proxy::query_* methods we need a way that scales better than changing the signatures every time. To this end we aggregate all non-mandatory query parameters into `coordinator_query_options` and all return values into `coordinator_query_result`. This way new fields can be simply added to the respective structs while the signatures of the methods themselves and their client code can remain unchanged.	2018-03-19 15:17:35 +02:00
Botond Dénes	536a32bb5e	query_singular(): return the used replicas This patch implements the last_replicas returning part of the query() signature changes for singular queries. It allows for client code to save the last returned replicas and pass it to query() on the next page as the preferred-replicas parameter, thus faciliate the read requests for the next page hitting the same replicas.	2018-03-13 10:34:34 +02:00
Botond Dénes	aaf67bcbaa	Consider preferred replicas when choosing endpoints for query_singular() Propagate the preferred_replicas to db::filter_for_query() and consider them when selecting the endpoints. The algoritm for selecting the endpoints is as follows: * Compute the intersection of the endpoint candidates and the preferred endpoints. * If this yields a set of endpoints that already satisfies the CL requirements use this set. * Otherwise select the remaining endpoints according to the load-balancing strategy, just like before.	2018-03-13 10:34:34 +02:00
Botond Dénes	eac597d726	Add preferred and last replicas to the signature of query() preferred_replicas are added to the parameters and last_replicas are added to the return type. The preferred replicas will be used as a hint for the selection of the replicas to send the read requests to. The last replicas (returned) are the replicas actually selected for the read. This will allow queries to consistently hit the same replicas for each page thus reusing readers created on these replicas. For convenience a query() overload is provided that doesn't take or return the preferred and last replicas. This patch only adds the parameters and propagates them down to query_singular() and query_partition_key_range(). The code to actually use these preferred-replicas will be added in later patches. This reason for separating this is to reduce noise and improve reviewability for those functional changes later.	2018-03-13 10:34:34 +02:00
Avi Kivity	cd668061fc	storage_service: remove system_keyspace.hh include Re-distribute include among the files that really need it.	2018-03-11 18:53:49 +02:00
Duarte Nunes	d7af8ff0e0	service/storage_proxy: Enable hash caching Set the option that enables the underlying memtable and cache readers to request caching of a cell's hash, for requests that require a digest. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	0bab3e59c2	service/storage_service: Add and use xxhash feature We add a cluster feature that informs whether the xxHash algorithm is supported, and allow nodes to switch to it. We use a cluster feature because older versions are not ready to receive a different digest algorithm than MD5 when answering a data request. If we ever should add a new hash algorithm, we would also need to add a new cluster feature for that algorithm. The alternative would be to add code so a coordinator could negotiate what digest algorithm to use with the set of replicas it is contacting. Fixes #2884 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	440ea56010	message/messaging_service: Specify algorithm when requesting digest While not strictly needed, specify which algorithm to use when request a digest from a remote node. This is more flexible than relying on a cluster wide feature, although that's what we'll do in subsequent patches. It also makes the verb more consistent with the data request. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	1ee7413b6e	storage_proxy: Extract decision about digest algorithm to use Introduce the digest_algorithm() function, which encapsulates the decision of which digest algorithm to use. Right now it is set to MD5, but future patches will change this. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	6b4b429883	query-result: Introduce class result_options Introduce class result_options to carry result options through the request pipeline, which at this point mean the result type and the digest algorithm. This class allows us to encapsulate the concrete digest algorithm to use. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
Duarte Nunes	5f6aab832b	digest_algorithm: Add xxHash option Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
José Guilherme Vanz	380bc0aa0d	Swap arguments order of mutation constructor Swap arguments in the mutation constructor keeping the same standard from the constructor variants. Refs #3084 Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com> Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>	2018-01-21 12:58:42 +02:00
Raphael S. Carvalho	20179c415b	service/storage_proxy: dont copy schema to primary_key::less_compare_clustering ctor schema is expensive to copy, and it's done in a possible hot path. bumped into it when reading code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180120211217.7273-1-raphaelsc@scylladb.com>	2018-01-20 23:16:15 +02:00
Glauber Costa	08a0c3714c	allow request-specific read timeouts in storage proxy reads Timeouts are a global property. However, for tables in keyspaces like the system keyspace, we don't want to uphold that timeout--in fact, we wan't no timeout there at all. We already apply such configuration for requests waiting in the queued sstable queue: system keyspace requests won't be removed. However, the storage proxy will insert its own timeouts in those requests, causing them to fail. This patch changes the storage proxy read layer so that the timeout is applied based on the column family configuration, which is in turn inherited from the keyspace configuration. This matches our usual way of passing db parameters down. In terms of implementation, we can either move the timeout inside the abstract read executor or keep it external. The former is a bit cleaner, the the latter has the nice property that all executors generated will share the exact same timeout point. In this patch, we chose the latter. We are also careful to propagate the timeout information to the replica. So even if we are talking about the local replica, when we add the request to the concurrency queue, we will do it in accordance with the timeout specified by the storage proxy layer. After this patch, Scylla is able to start just fine with very low timeouts--since read timeouts in the system keyspace are now ignored. Fixes #2462 Implementation notes, and general comments about open discussion in 2462: * Because we are not bypassing the timeout, just setting it high enough, I consider the concerns about the batchlog moot: if we fail for any other reason that will be propagated. Last case, because the timeout is per-CF, we could do what we do for the dirty memory manager and move the batchlog alone to use a different timeout setting. * Storage proxy likes specifying its timeouts as a time_point, whereas when we get low enough as to deal with the read_concurrency_config, we are talking about deltas. So at some point we need to convert time_points to durations. We do that in the database query functions. v2: - use per-request instead of per-table timeouts. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00

1 2 3 4 5 ...

471 Commits