scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Avi Kivity	32d55bbb4c	Merge seastar upstream * seastar 0773e98...6fbd792 (2): > tls: Only run our "verify" function in client session > Merge "Clean the metric definition" from Amnon Includes patch from Amnon adjusting the metrics registration due to seastar API changes.	2016-12-13 12:17:14 +02:00
Gleb Natapov	a05516f14c	storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters Message-Id: <20161206105154.GL1866@scylladb.com>	2016-12-08 11:42:52 +02:00
Vlad Zolotarov	e5e7ac1bd4	service::storage_proxy: rework the collectd counters registration Use the new seastar's metrics_registration framework: - Change the registration syntax. - Add a long description for each counter. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-12-01 22:38:09 -05:00
Vlad Zolotarov	3bf12e4ffc	service/storage_proxy: regroup collectd statistics Instead of putting all statistics under the same "storage_proxy" category separate them into 2 groups according to where the corresponding counters are updated: - "storage_proxy_replica" - "storage_proxy_coordinator" Fixes #1763 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-12-01 22:27:47 -05:00
Tomasz Grabiec	48bbd6733c	storage_proxy: Do not flood logs with timeout errors Timeout errors are flooding the log after local mutate can time out. We don't log remote mutate timeouts, so for consistency we won't log local ones as well. There is a database counter for timed out writes which can be consulted in order to check if they're occuring. Perhaps this would be better solved by a generic log message throttling/coalescing mechanism, but that's not ready yet.	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	14cb31f69a	storage_proxy: Delay timeout response until background work ceases Write requests which timed out may still occupy memory for a while due to local write. It should time out soon as well but there is a time window in which it has not yet. If we don't delay timeout response, the request would be seen as not consuming any memory too early. This in turn would cause the CQL server to allow more requests than we want. In some cases causing OOM or exceeding memory limits and causing excessive cache eviciton. Fixes #1756.	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	ba3779802f	storage_proxy: Propagate timeout to local writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	6d195a1538	storage_proxy: Use shared ownership for abstract_write_response_handler	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	5805330d98	storage_proxy: Add counter for all alive write handlers Currently the counter uses _response_handlers.size(), but after later patches we may have an active (timed out) write with no response handler, so count live instances instead.	2016-11-29 16:40:58 +01:00
Avi Kivity	6bdb8ba31d	storage_proxy: don't query concurrently needlessly during range queries storage_proxy has an optimization where it tries to query multiple token ranges concurrently to satisfy very large requests (an optimization which is likely meaningless when paging is enabled, as it always should be). However, the rows-per-range code severely underestimates the number of rows per range, resulting in a large number of "read-ahead" internal queries being performed, the results of most of which are discarded. Fix by disabling this code. We should likely remove it completely, but let's start with a band-aid that can be backported. Fixes #1863. Message-Id: <20161120165741.2488-1-avi@scylladb.com>	2016-11-21 18:19:46 +02:00
Avi Kivity	18078bea9b	storage_proxy: avoid calculating digest when only one replica is contacted If we're talking to just one replica, the digest is not going to be used, so better not to calculate it at all. The optimization helps with LOCAL_ONE queries where the result is large, but does not contain large blobs (many small rows). This patch adds a digest_algorithm parameter to the READ_DATA verb that can take on two values: none and MD5 (default), and sets it to none when we're reading from one replica. In the future we may add other values for more hardware-friendly digest algorithms. Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>	2016-11-17 13:04:30 +02:00
Tomasz Grabiec	11c5f4ab50	storage_proxy: Add counters for throttled writes	2016-11-15 17:18:25 +01:00
Glauber Costa	93386bcec7	histograms: do not use latency_in_nano Now that the histogram has its own unit expressed in its template parameter, there is no reason to convert it to nano just so we may need to convert it back if the histogram needs another unit. This patch will keep everything as a duration until last moment, and then we'll convert when needed. This was suggested by Amnon. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>	2016-11-14 18:01:43 +02:00
Gleb Natapov	93f068bd44	storage_proxy: fix speculation target selection logic Current speculation target selection logic has several bugs in multi-dc setup. It may select a non local target for CL=LOCAL and it may select more than one target to speculate, one of which is non local. Examples: 1. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_QUORUM. In this scenario db::filter_for_query() will return both replicas from local DC and speculation target selection logic will peek one one which will be in different DC. 2. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_ONE + RRD.DC_LOCAL In this scenario db::filter_for_query() will return all nodes in local DC and there already be enough nodes to speculate, but current logic will add one node from non local dc as a speculation target. The patch below fixed both of those scenarios. Message-Id: <20161103154637.GS7766@scylladb.com>	2016-11-08 18:32:47 +01:00
Avi Kivity	b3299d5bc3	storage_proxy: simplify range queries Instead of asking a shard for cmd->partition_limit and cmd->row_limit, just ask it for the number of partitions and rows still needed to satisfy the query. This removes the need to trim the shard's result.	2016-11-03 19:10:20 +02:00
Avi Kivity	a668e575f6	storage_proxy: execute multi-partition query sequentially over shards Since every shard might cause the row_limit quota to be satisfied, every shard might be the last one we need. Hence it is better to process shards sequentially, stopping if the quota is reached or the range is exhausted. The original code tried to yield to reduce latency, but this is now unnecessary, as we're doing a lot less work per iteration (if it becomes necessary, we should do it on the replica shard, not the coordinating shard).	2016-11-03 19:10:20 +02:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Vlad Zolotarov	f75a350a8f	service::storage_proxy: use global_trace_state_ptr when using invoke_on When trace_state may migrate to a different shard a global_trace_state_ptr has to be used. This patch completes the patch below: commit `7e180c7bd3` Author: Vlad Zolotarov <vladz@cloudius-systems.com> Date: Tue Sep 20 19:09:27 2016 +0300 tracing: introduce the tracing::global_trace_state_ptr class Fixes #1770 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>	2016-10-21 11:34:13 +03:00
Avi Kivity	63f053e9b7	storage_proxy: fix mutation reordering with wrapping ranges If we have a range query involving a wrapping range (i.e., from thrift), and mutations from both halves of the result are involved, then we will return the results in the wrong order (and potentially the wrong partitions) since we order by token, so the results from the second half of the wrapping range end up before the first. Fix by splitting the two queries, and merging the second half with lower priority compared to the first half. Note: this will be fixed in a better way once we have the sharding iterator, as then we can query sequentially. Fixes #1761. Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>	2016-10-12 15:59:16 +02:00
Vlad Zolotarov	7e180c7bd3	tracing: introduce the tracing::global_trace_state_ptr class This object, similarly to a global_schema_ptr, allows to dynamically create the trace_state_ptr objects on different shards in a context of the original tracing session. This object would create a secondary tracing session object from the original trace_state_ptr object when a trace_state_ptr object is needed on a "remote" shard, similarly to what we do when we need it on a remote Node. Fixes #1678 Fixes #1647 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>	2016-10-02 11:31:37 +03:00
Duarte Nunes	46b86ff801	storage_proxy: Pass along trace_state for queries This patch changes the storage_proxy so it passed along a trace_state_ptr to the layers below, when querying locally or receiving a remote query request. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	39e0fb1260	storage_proxy: Support multiple partition ranges This patch adds the ability to query multiple partition ranges. This is needed since `55f2cf1626`, where we started unwrapping partition ranges in Thrift. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1472474594-15368-1-git-send-email-duarte@scylladb.com>	2016-08-30 17:43:40 +03:00
Avi Kivity	fb3a83a811	Merge "Slow query logging" from Vlad "This series introduces a "slow query logging" feature that allows logging the queries that take more than a specified threshold time to complete. Once such a query detected, it will be logged in a system_traces.node_slow_log table. In addition all trace for that query that have been collected on a Coordinator are going to be written as well. If the handling time on a replica in the context of a query takes more than (the same) threshold they are going to be written too. The raw in a node_slow_log contains a session_id of a corresponding tracing session, thereby allowing the user to query the system_traces tables for the corresponding trace records. The schema of the node_slow_log table is as follows: CREATE TABLE system_traces.node_slow_log ( node_ip inet, shard int, session_id uuid, date timestamp, start_time timeuuid, command text, duration int, parameters map<text, text>, source_ip inet, table_names set<text>, username text, PRIMARY KEY (start_time, node_ip, shard)) WITH default_time_to_live = 86400 where - node_ip: IP of the coordinator Node. - shard: shard ID on a Coordinator where the query was handled. - session_id: ID of a corresponding tracing session. - date: a time when the query has began. - start_time: a time-based UUID for this query (needed for a primary key mostly). - command: a query string. - duration: a time it took to handle this query (in microseconds). - parameters: a map of query parameters (like in system_traces.sessions). - source_ip: IP of a Client that sent this query. - table_names: a set of "<keyspace>.<table name>" strings representing column families used in this query. - username: a user name used for this query. The good thing is that most of the data we needed is already collected by the regular tracing framework. The only missing ones are a username and tables' names. So, this series makes the framework collect them too. The whole feature is integrated in the Tracing framework. The main changes to the framework that were made are as follows: - Store the constant capabilities of the tracing session in an enum_set, e.g.: - primary/secondary. - write on close. - Introduce two new capabilities to a tracing session of a specific query: - full tracing: collect all traces for this query (as it is before this series). - log slow query: log this query if its duration is above the threshold. These two capabilities may be defined independently. - Add the logic that handles the "log slow query"-only case: - Build the parameters<sstring, sstring> map only if the "duration" is above the given threshold. - The same about writing the trace entries. - In a not-only "log slow query" case: - Write the node_slow_log entry. - Extend the trace_info struct to pass slow query threshold and TTL to the replica Node. In addition to above this series add the capability to configure the slow query logging threshold and a TTL for the node_slow_log records. The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming) patches that are meant to align the naming of the existing methods with the ones the last one is going to add."	2016-08-29 13:11:36 +03:00
Gleb Natapov	a2cdddb795	storage_proxy: forward mutation write with correct timeout value Now that mutation handler knows how much time is left for mutation write to be handled it can use this knowledge to set correct timeout for forwarded mutations. Message-Id: <20160828080637.GE9243@scylladb.com>	2016-08-29 13:06:36 +03:00
Vlad Zolotarov	8609900621	tracing: introduce trace_state capabilities bit field - Instead of keeping separate booleans introduce a trace_state_props_set enum_set and pass it around instead of separate booleans. - Change the trace_info to hold this value in addition to write_on_close. Initialize a corresponding bit in an enum_set based on a write_on_close value in a trace_info constructor for a backward compatibility. - Separate a trace_state constructor into two: - For a primary session object. - For a secondary session object. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 18:34:36 +03:00
Duarte Nunes	3275fabe53	storage_proxy: Short circuit query without clustering ranges This patch makes the storage_proxy return an empty result when the query doesn't define any clustering ranges (default or specific). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	54ad038aa6	storage_proxy: Enforce partition_limit This patch enforces the partition_limit at the mutation_result_merger. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Paweł Dziepak	0f902738f0	Revert "storage_proxy: Enforce partition_limit" This reverts commit `141ea49e05`. There was a confusion around the meaning of "partition limit". Parts of our code interpreted it just as "maximum number of partitions". This is also how Cassandra behaves. However, the other parts of the code, including data query, interpreted it as "maximum number of live partitions" or otherwise skipped dead partitions resulting in #1447. A decision has been made to stick to the "maximum number of live partitions" interpretation everywhere. The consequences are, among others, that the patch reverted by this one is no longer correct. While, the actual series fixing the interpretations of partition limit and getting rid of the confusion is yet to come, the purpose of this revert is to make backporting easier (as the patch being reverted hasn't made it to branch-1.3 yet).	2016-08-02 16:53:01 +01:00
Duarte Nunes	141ea49e05	storage_proxy: Enforce partition_limit This patch enforces the partition_limit at the mutation_result_merger. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470065526-3174-1-git-send-email-duarte@scylladb.com>	2016-08-02 10:10:43 +01:00
Vlad Zolotarov	57b58cad8e	SELECT tracing instrumentation: improve inter-nodes communication stages messages Add/fix "sending to"/"received from" messages. With this patch the single key select trace with a data on an external node looks as follows: Tracing session: 65dbfcc0-4f51-11e6-8dd2-000000000001 activity \| timestamp \| source \| source_elapsed ---------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+---------------- Execute CQL3 query \| 2016-07-21 17:42:50.124000 \| 127.0.0.2 \| 0 Parsing a statement [shard 1] \| 2016-07-21 17:42:50.124127 \| 127.0.0.2 \| -- Processing a statement [shard 1] \| 2016-07-21 17:42:50.124190 \| 127.0.0.2 \| 64 Creating read executor for token 2309717968349690594 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 1] \| 2016-07-21 17:42:50.124229 \| 127.0.0.2 \| 103 read_data: sending a message to /127.0.0.1 [shard 1] \| 2016-07-21 17:42:50.124234 \| 127.0.0.2 \| 108 read_data: message received from /127.0.0.2 [shard 1] \| 2016-07-21 17:42:50.124358 \| 127.0.0.1 \| 14 read_data handling is done, sending a response to /127.0.0.2 [shard 1] \| 2016-07-21 17:42:50.124434 \| 127.0.0.1 \| 89 read_data: got response from /127.0.0.1 [shard 1] \| 2016-07-21 17:42:50.124662 \| 127.0.0.2 \| 536 Done processing - preparing a result [shard 1] \| 2016-07-21 17:42:50.124695 \| 127.0.0.2 \| 569 Request complete \| 2016-07-21 17:42:50.124580 \| 127.0.0.2 \| 580 Fixes #1481 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1469112271-22818-1-git-send-email-vladz@cloudius-systems.com>	2016-07-21 19:46:43 +03:00
Vlad Zolotarov	7c590295ef	SELECT instrumentation: add a nice trace point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	b36b69c1d6	service::storage_proxy: remove a default value for a tracing::trace_state_ptr parameter Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	baa6496816	service::storage_proxy: READ instrumentation: store trace state object in abstract_read_executor Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	962bddf8fe	transport: CQL tracing: instrument a BATCH command Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	4c16df9e4c	service: instrument MUTATE flow with tracing Store the trace state in the abstract_write_response_handler. Instrument send_mutation RPC to receive an additional rpc::optional parameter that will contain optional<trace_info> value. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	0552ffcd17	service/storage_proxy: tracing: adjust the existing SELECT instrumentation with the new trace() interface From now on trace_state::trace() is able to receive the sprint-ready format string with the arguments that will be applied only during the flush event. This patch also optimizes the way the source address is evaluated - do it only once instead of twice if tracing is requested. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	a5022a09a4	tracing: use 'write' instead of 'flush' and 'store' for consistency with seastar's API In names of functions and variables: s/flush_/write_/ s/store_/write_/ In a i_tracing_backend_helper: s/flush()/kick()/ Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Paweł Dziepak	32a5de7a1f	db: handle receiving fragmented mutations If mutations are fragmented during streaming a special care must be taken so that isolation guarantees are not broken. Mutations received with flag "fragmented" set are applied to a memtable that is used only by that particular streaming task and the sstables created by flushing such memtables are not made visible until the task is complte. Also, in case the streaming fails all data is dropped. This means that fragmented mutations cannot benefit from coalescing of writes from multiple streaming plans, hence separate way of handling them so that there is no loss of performance for small partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	4031c0ed8f	streaming: pass plan_id to column family for apply and flush plan_id is needed to keep track of the origin of mutations so that if they are fragmented all fragments are made visible at the same time, when that particular streaming plan_id completes. Basically, each streaming plan that sends big (fragmented) mutations is going to have its own memtables and a list of sstables which will get flushed and made visible when that plan completes (or dropped if it fails). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Duarte Nunes	386c0dd4b2	storage_proxy: Correctly calculate new limit This patch fixes a bug where we would always return query::max_rows when calculating the new limit for a retry read command. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1467289746-18177-1-git-send-email-duarte@scylladb.com>	2016-06-30 14:49:56 +02:00
Gleb Natapov	8bf82cc31c	put additional info into cql timeout exception Fixes #1397 Message-Id: <20160628101829.GR14658@scylladb.com>	2016-06-30 12:03:48 +02:00
Duarte Nunes	82dbf5bff3	storage_proxy: Trace when retrying a query Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:15 +02:00
Duarte Nunes	01b18063ea	query: Add per-partition row limit This patch as a per-partition row limit. It ensures both local queries and the reconciliation logic abide by this limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:46:51 +02:00
Duarte Nunes	20d9813a89	storage_proxy: Fetch last replica row just in time This patch changes the way we fetch each replica's last row to determine if we got incomplete information from any of them. Instead of fetching the last rows up front, we fetch them on demand only if we actually trigger the code that needs them. We now get the last row from the versions vector of vectors. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Duarte Nunes	4ce9fc24cb	storage_proxy: Extract finding last row This patch extracts to a function the code that actually determines the last row of a partition based on the direction of the query. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Paweł Dziepak	579de26e95	storage_proxy: drop make_local_reader() This code was used only by its unit test. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Gleb Natapov	4659800ab9	storage_proxy: implement custom speculative retry strategy User may specify time after which speculative retry should happen instead of relying on cf statics. Use provided value in speculative executor. Message-Id: <20160616104422.GH5961@scylladb.com>	2016-06-16 13:45:56 +03:00
Gleb Natapov	7f54333c45	storage_proxy: fix complication on older boost boost before 1.56.0 had broken boost:size() implementation. Do not use it. Message-Id: <20160615123134.GD5961@scylladb.com>	2016-06-15 15:34:57 +03:00
Gleb Natapov	e089166cfa	storage_proxy: wait only for expected CL when writing back data during read repair When read repair writes diffs back to replicas it is enough to wait for requested CL to guaranty read monotonicity. This patch makes read repair write reuse regular mutate functionality which already tracks CL status. This is done by changing write response handler to not hold mutation directly, but instead hold a container that, depending on whether this is read repair write or regular one, can provide different mutation per destination. Message-Id: <20160613124727.GL1096@scylladb.com>	2016-06-13 19:01:51 +03:00
Vlad Zolotarov	89375d4c2a	service::storage_proxy: tracing: instrument read_digest and read_mutation_data Instrument read_digest and read_mutation_data handlers similarly to a read_data handler instrumentation. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465304055-4263-1-git-send-email-vladz@cloudius-systems.com>	2016-06-09 14:32:42 +02:00

1 2 3 4 5 ...

324 Commits