scylladb

Author	SHA1	Message	Date
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Piotr Sarna	2e544a0c89	storage_proxy: add metrics for too many in-flight hints failures When there are too many in-flight hints, writes start returning overloaded exceptions. We're missing metrics for that, and these could be useful when judging if the system is in overloaded state.	2020-11-10 16:26:18 +01:00
Amnon Heiman	6e1f042b93	storage_proxy: use time_estimated_histogram for latencies This patch change storage_proxy to use time_estimated_histogram. Besides the type, it changes how values are inserted and how the histogram is used by the API. An example how a metric looks like after the change: scylla_storage_proxy_coordinator_write_latency_bucket{le="640.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="896.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="1024.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="1280.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="1536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0 scylla_storage_proxy_coordinator_write_latency_bucket{le="1792.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2 scylla_storage_proxy_coordinator_write_latency_bucket{le="2048.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2 scylla_storage_proxy_coordinator_write_latency_bucket{le="2560.000000",scheduling_group_name="statement",shard="0",type="histogram"} 3 scylla_storage_proxy_coordinator_write_latency_bucket{le="3072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5 scylla_storage_proxy_coordinator_write_latency_bucket{le="3584.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5 scylla_storage_proxy_coordinator_write_latency_bucket{le="4096.000000",scheduling_group_name="statement",shard="0",type="histogram"} 7 scylla_storage_proxy_coordinator_write_latency_bucket{le="5120.000000",scheduling_group_name="statement",shard="0",type="histogram"} 8 scylla_storage_proxy_coordinator_write_latency_bucket{le="6144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9 scylla_storage_proxy_coordinator_write_latency_bucket{le="7168.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11 scylla_storage_proxy_coordinator_write_latency_bucket{le="8192.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11 scylla_storage_proxy_coordinator_write_latency_bucket{le="10240.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19 scylla_storage_proxy_coordinator_write_latency_bucket{le="12288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 49 scylla_storage_proxy_coordinator_write_latency_bucket{le="14336.000000",scheduling_group_name="statement",shard="0",type="histogram"} 132 scylla_storage_proxy_coordinator_write_latency_bucket{le="16384.000000",scheduling_group_name="statement",shard="0",type="histogram"} 294 scylla_storage_proxy_coordinator_write_latency_bucket{le="20480.000000",scheduling_group_name="statement",shard="0",type="histogram"} 1035 scylla_storage_proxy_coordinator_write_latency_bucket{le="24576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2790 scylla_storage_proxy_coordinator_write_latency_bucket{le="28672.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5788 scylla_storage_proxy_coordinator_write_latency_bucket{le="32768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9815 scylla_storage_proxy_coordinator_write_latency_bucket{le="40960.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19821 scylla_storage_proxy_coordinator_write_latency_bucket{le="49152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 30063 scylla_storage_proxy_coordinator_write_latency_bucket{le="57344.000000",scheduling_group_name="statement",shard="0",type="histogram"} 38642 scylla_storage_proxy_coordinator_write_latency_bucket{le="65536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 44987 scylla_storage_proxy_coordinator_write_latency_bucket{le="81920.000000",scheduling_group_name="statement",shard="0",type="histogram"} 51821 scylla_storage_proxy_coordinator_write_latency_bucket{le="98304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 54197 scylla_storage_proxy_coordinator_write_latency_bucket{le="114688.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55054 scylla_storage_proxy_coordinator_write_latency_bucket{le="131072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55363 scylla_storage_proxy_coordinator_write_latency_bucket{le="163840.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55520 scylla_storage_proxy_coordinator_write_latency_bucket{le="196608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55545 scylla_storage_proxy_coordinator_write_latency_bucket{le="229376.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="262144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="327680.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="393216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="458752.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="524288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="655360.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="786432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="917504.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="1048576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="1310720.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="1572864.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="1835008.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="2097152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="2621440.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="3145728.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="3670016.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="4194304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="5242880.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="6291456.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="7340032.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="8388608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="10485760.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="12582912.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="14680064.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="16777216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="20971520.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="25165824.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="29360128.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="33554432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549 scylla_storage_proxy_coordinator_write_latency_bucket{le="+Inf",scheduling_group_name="statement",shard="0",type="histogram"} 55549 Fixes #4746 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-06-15 08:23:02 +03:00
Gleb Natapov	d555fb60d7	lwt: add counters for background and foreground paxos operations Paxos may leave an operation in a background after returning result to a caller. Lest add a counter for background/foreground paxos handlers so that it will be easier to detect memory related issues. Message-Id: <20200510092942.GA24506@scylladb.com>	2020-05-11 14:37:00 +02:00
Pavel Emelyanov	513ce1e6a5	storage_proxy_stats: Make get_ep_stat() noexcept The .get_ep_stat(ep) call can throw when registering metrics (we have issue for it, #5697). This is not expected by it callers, in particular abstract_write_response_handler::timeout_cb breaks in the middle and doesn't call the on_timeout() and the _proxy->remove_response_handler(), which results in not removed and not released responce handler. In turn not released response handler doesn't set the _ready future on which response_wait() waits -> stuck. Although the issue with .get_ep_stat() should be fixed, an exception in it mustn't lead to deadlocks, so the fix is to make the get_ep_stat() noexcept by catching the exception and returning a dummy stat object instead to let caller(s) finish. Fixes #5985 Tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200430163639.5242-1-xemul@scylladb.com>	2020-04-30 19:40:08 +03:00
Gleb Natapov	8a408ac5a8	lwt: remove entries from system.paxos table after successful learn stage The learning stage of PAXOS protocol leaves behind an entry in system.paxos table with the last learned value (which can be large). In case not all participants learned it successfully next round on the same key may complete the learning using this info. But if all nodes learned the value the entry does not serve useful purpose any longer. The patch adds another round, "prune", which is executed in background (limited to 1000 simultaneous instances) and removes the entry in case all nodes replied successfully to the "learn" round. It uses the ballot's timestamp to do the deletion, so not to interfere with the next round. Since deletion happens very close to previous writes it will likely happen in memtable and will never reach sstable, so that reduces memtable flush and compaction overhead. Fixes #5779 Message-Id: <20200330154853.GA31074@scylladb.com>	2020-03-30 21:02:14 +03:00
Konstantin Osipov	94ee511f6a	lwt: implement cas_failed_read_round_optimization metric Presently lightweight transactions piggy back the old row value on prepare round response. If one of the participants did not provide the old value or the values from peers don't match, we perform a full read round which will repair the Paxos table and the base table, if necessary, at all participants. Capture the fact that read optimization has failed in a metric. Message-Id: <20200304192955.84208-2-kostja@scylladb.com>	2020-03-05 12:20:45 +01:00
Piotr Sarna	70c9889ef7	storage_proxy: remove dead metrics code This patch removes an implementation of register_split_metrics_for, which is not used anywhere in the codebase. Message-Id: <e83f3e9d109113fe0553919032f005d4ab3a3023.1581851904.git.sarna@scylladb.com>	2020-02-16 17:00:45 +02:00
Gleb Natapov	ed3e423922	lwt: add counter for a case where timeout is sent prematurely There is a case in current PAXOS implementation where timeout is returned because the code cannot guaranty whether the value is accepted or not in case of a contention. The counter will help to correlate this condition with failed requests. Message-Id: <20200211160653.30317-2-gleb@scylladb.com>	2020-02-16 11:22:30 +02:00
Eliran Sinvani	971711a546	storage proxy: migrate to per scheduling group statistics This commit builds on top of the introduced per scheduling group statistics template and employs it for achieving a per scheduling group statistics in storage_proxy. Some of the statistics also had meaning as a global - per shard one. Those are the ones for determining if to throttle the write request. This was handled by creating a global stats struct that will hold those stats and by changing the stat update to also include the global one. One point that complicated it is an already existing aggregation over the per shard stats that now became a per scheduling group per shard stats, converting the aggregation to a two-dimensional aggregation. One thing this commit doesn't handle is validating that an individual statistic didn't "cross a scheduling group boundary", such validation is possible but it can easily be added in the future. There is a subtlety to doing so since if the operation did cross to other scheduling group two connected statistics can lose balance for example written bytes and completed write transactions. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:44 +01:00
Eliran Sinvani	8cfc2aad57	internalize storage proxy statistics metric registration The storage proxy statistics structure did not contain a method for registering the statistics for metric groups, instead, each user had to register some of the metrics by itself. There is no real reason for separating the metrics registration from the statistics data. There is even less justification for doing this only for part of the stats as is the case for those statistics. This commit internalize the metrics registration in the storage_proxy stats structures. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:40 +01:00
Vladimir Davydov	c27ab87410	storage_proxy: add cas request accounting This patch implements accounting of Cassandra's metrics related to lightweight transactions, namely: cas_read_latency transactional read latency (histogram) cas_write_latency transactional write latency (histogram) cas_read_timeouts number of transactional read timeouts cas_write_timeouts number of transactional write timeouts cas_read_unavailable number of transactional read unavailable errors cas_write_unavailable number of transactional write unavailable errors cas_read_unfinished_commit number of transaction commit attempts that occurred on read cas_write_unfinished_commit number of transaction commit attempts that occurred on write cas_write_condition_not_met number of transaction preconditions that did not match current values cas_read_contention how many contended reads were encountered (histogram) cas_write_contention how many contended writes were encountered (histogram)	2019-10-29 19:25:47 +03:00
Konstantin Osipov	56f3bda4c7	metrics: introduce a metric for non-local reads A read which arrived to a non-replica and had to be forwarded to a replica by the coordinator is accounted in an own metric, reads_coordinator_outside_replica_set. Most often such read is produced by a driver which is unaware of token distribution on the ring. If a read was forwarded to another replica due to heat weighted load balancing or query preference set by the user, it's not accounted in the metric. In case of a multi-partition read (a query using IN statement, e.g. x in (1, 2, 3)), if any of the keys is read from a non-local node the read is accounted as a non-local. The rationale behind it is that if the user tries to be careful and send IN queries only to the same vnode, they are rewarded with the counter staying at zero, while if they send multi-partition IN queries without any precautions, they will see the metric go up which gives them a starting point for investigating performance problems. Closes #4338	2019-07-08 19:23:38 +03:00
Konstantin Osipov	da1d1b74da	metrics: account writes forwarded by a coordinator in an own metric. Add a metric to account writes which arrived to a non-replica and had to be forwarded by a coordinator to a replica. The name of the added metric is 'writes_coordinator_outside_replica_set'. Do not account forwarded read repair writes, since they are already accounted by a reads_coordinator_outside_replica_set metric, added in a subsequent patch. In scope of #4338.	2019-07-08 18:17:48 +03:00
Nadav Har'El	ccf731a820	Materialized views: add metric for current flow-control delay The materialized views flow control mechanism works by adding a certain delay to each client request, designed to slow down the client to the rate at we can complete the background view work. Until now we could observe this mechanism only indirectly, in whether or not it succeeded to keep the view backlog bounded; But we had no way to directly observe the delay that we decided to add. In fact, we had a bug where this delay was constantly zero, and we didn't even notice :-) So in this patch we add a new metric, scylla_storage_proxy_coordinator_last_mv_flow_control_delay The metric is a floating point number, in units of seconds. This metric is somewhat peculiar that it always contains the last delay used for some request - unlike other metrics it doesn't measure the "current" value of something. Moreover, it can jump wildly because there is no guarantee that each request's delay will be identical (in particular, different requests may involve different base replicas which have different view backlogs, so decide on different delays). In the future we may want to supplement this metric with some sort of delay histogram. But even this simple metric is already useful to debug certain scenarios and understand if the materialized-views flow control is working or not. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190227133630.26328-1-nyh@scylladb.com>	2019-03-20 09:14:59 -03:00
Duarte Nunes	819b6f3406	service/storage_proxy: Add counters for delayed base writes Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Gleb Natapov	207b57a892	storage_proxy: count number of timed out write attempts after CL is reached It is useful to have this counter to investigate the reason for read repairs. Non zero value means that writes were lost after CL is reached and RR is expected. Message-Id: <20181009120900.GF22665@scylladb.com>	2018-10-09 15:17:07 +03:00
Tomasz Grabiec	82270c8699	storage_proxy: Fix misqualification of reads as foreground or background in some cases The foreground reads metric is derived from the number of live read executors minus the number of background reads. Background reads are counted down when their resolver times out. However, a read executor may still be around for a while, resulting in such reads being accounted as foreground. Usually, the gap in which this happens is short, because executor reference holders timeout quickly as well. It's not always the case though. For instance, local read executor doesn't time out quickly when the target shard has an overloaded CPU, and it takes a while before the request goes through all the queues, even if IO is not involved. Observed in #3628. Fixes #3734. Another problem is that all reads which received CL responses are accounted as background, until all replicas respond, but if such read needs reconciliation, it's still practically a foreground read and should be accounted as such. Found during code review. Fixes #3745. This patch fixes both issues by rearranging accounting to track foreground reads instead of background reads, and considering all reads as foreground until the resulting promise is resolved. Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 20:42:51 +03:00
Avi Kivity	bea1f715dc	storage_proxy: count cross-shard operations Count operations which were started on one shard and were performed on another, due to non-shard-aware driver and/or RPC. Message-Id: <20180723155118.8545-1-avi@scylladb.com>	2018-07-25 16:21:04 +01:00
Piotr Sarna	1d590b3ca4	storage_proxy: decouple write_stats from stats This commit extracts metrics related to writes from stats structure, so it can be easily replaced later, e.g. for materialized view metrics. References #3385 References #3416	2018-05-22 16:52:58 +02:00

21 Commits