mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-23 10:00:35 +00:00
Today I realised that although we have per-table metrics, they are not *really* available by default. I was suprised to find that we don't have (as far as I can tell) a document explaining why it is so, or how to enable them anyway. Moreover, the more I investigated this issue, the more I realised how little I know on Scylla's metrics - how they are calculated, how they are collected, their different types, and so on. So I sat down to figure out everything I wanted to learn about Scylla metrics, and then wrote it all down in a new document, docs/metrics.md. There are some missing pieces in this document marked by TODO, and probably additional missing pieces that I'm not aware of, but I think this is already a good start and can be (and should be) improved-on later. We really need to have more of these documents describing various Scylla subsystems to new developers - what each subsystem does, why it does what it does, where is the code, and so on. I am facing these problems every day as a seasoned developer - I can't even imagine what our new developers face when trying to understand a subsystem they are not yet familiar with. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180920131103.20590-1-nyh@scylladb.com>
149 lines
7.9 KiB
Markdown
149 lines
7.9 KiB
Markdown
# Scylla Metrics
|
|
Scylla exposes dozens of different metrics which are valuable for
|
|
understanding the performance of a node, and for diagnosing performance
|
|
problems when those occur. Among other things, you can see counts of requests,
|
|
activity of disks, cpus and network, memory usage of different types,
|
|
activity in different individual tables, and many many more metrics.
|
|
|
|
Scylla's metrics are implemented using Seastar's metrics infrastructure.
|
|
Scylla's code updates metrics continuously in memory variables, and then
|
|
exposes them through an HTTP request, http://scyllanode:9180/metrics.
|
|
The response to this request is a text file listing the metrics and their
|
|
current values at the time of the query. This protocol, and the format of
|
|
the response was defined by the Prometheus metric collection system and
|
|
is described in detail here: https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md
|
|
|
|
> Note that the REST API in port 9180 is only devoted to publishing metrics.
|
|
> Scylla also has a separate and more powerful REST API on port 10000.
|
|
|
|
This very simple REST API is useful for quick scripting and development work,
|
|
but in Scylla production you'd usually want to collect metrics from multiple
|
|
Scylla nodes, collect a history of each metric over time, and provide a
|
|
graphical UI for viewing graphs of these histories. For this purpose,
|
|
we provide the separate scylla-grafana-monitoring project - see
|
|
https://github.com/scylladb/scylla-grafana-monitoring on how to install and
|
|
use it. The scylla-grafana-monitoring project allows you to continuously
|
|
collect metrics from several Scylla nodes into a Prometheus metric-collection
|
|
server, and then to visualize these metrics using Grafana and a web browser.
|
|
|
|
## Metric labels: instance and shard
|
|
Different Scylla nodes will have different values for each metric (e.g.,
|
|
`scylla_cql_reads`, the total number of CQL read requests). Moreover, Scylla
|
|
is sharded, meaning that inside each node each core works on its own data
|
|
and keeps its own separate metrics. So in the metrics output, each metric
|
|
identifier contains, beyond the metric's name, also additional labels to
|
|
qualify which node and which shard this metric comes from. For example:
|
|
```
|
|
scylla_cql_reads{instance="sid",shard="0",type="derive"} 20
|
|
```
|
|
In this case, the metric comes from a node (which we call "instance") whose
|
|
name is "sid", and from shard 0.
|
|
|
|
The appearance of the instance and shard ids on each metric is what allows
|
|
a single server (e.g. the Prometheus server mentioned above) to collect
|
|
metrics from many Scylla nodes and their shards. The visualization tool
|
|
(e.g., Grafana) can then show the metrics of different nodes and shards
|
|
separately, or to calculate and display various sums - e.g., the sum
|
|
on all shards of each node, or the total sum of all shards and all nodes.
|
|
|
|
The "instance" label is superflous for this goal - the Prometheus server
|
|
knows which node it got each metric from - so we plan to remove it in the
|
|
future - see https://github.com/scylladb/seastar/issues/477
|
|
|
|
The "type" label should be ignored - it appears for historic reasons
|
|
(it was used by collectd) and is planned to be removed in the future.
|
|
|
|
## Per-table metrics
|
|
Most of Scylla's metrics are global (in each shard). Scylla also supports
|
|
per-table metrics, which are maintained separately for each table in the
|
|
database.
|
|
|
|
On a deployment with a large number of tables, this can result in a very
|
|
large number of metrics at each time, and overwhelm Scylla's HTTP
|
|
server and/or the Prometheus server collecting these metrics. For this
|
|
reason, the per-table metrics are currently **disabled** by default:
|
|
The per-table metrics are defined in the `table::set_metrics()` function,
|
|
and only added when the `enable_keyspace_column_family_metrics` flag is
|
|
enabled (and it is disabled by default).
|
|
|
|
To enable this flag and the per-table metrics, you can pass the parameters
|
|
`--enable-keyspace-column-family-metrics 1` in the Scylla command line, or
|
|
set this parameter in Scylla's configuration file.
|
|
|
|
We are planning to rethink this approach in the future. In particular,
|
|
it's not great that we currently need to restart Scylla to make these
|
|
metrics available. Scylla already maintains these per-table metrics in
|
|
per-table memory variables, and we just need a way to optionally expose
|
|
them through the HTTP request.
|
|
|
|
To tell the metrics of the different tables apart, each metric's identifier
|
|
contains the "ks" (*keyspace*) and "cf" (*column family* - the old name
|
|
for table) as labels. For example,
|
|
|
|
```
|
|
scylla_column_family_pending_compaction{instance="sid",cf="IndexInfo",ks="system",shard="0",type="gauge"} 0.000000
|
|
```
|
|
|
|
Here we can see the "scylla_column_family_pending_compactions" metric
|
|
measured in shard 0 of node "sid", for the table "IndexInfo" in keyspace
|
|
"system".
|
|
|
|
## Types of metrics
|
|
Scylla metrics fall under three types: "counter", "gauge" and "histogram".
|
|
|
|
Most metrics are of the "counter" type. A counter metric tracks a cumulative
|
|
value over objects or events that existed throughout the lifetime of the
|
|
node. For example, the "total number of requests processed so far", or
|
|
"the total number of bytes written to disk".
|
|
|
|
When visualizing counter metrics, it is often useful to look at the
|
|
*derivative*, or rate of change, of the number, instead of at the cumulative
|
|
number itself. Note that Scylla only provides the cumulative number - the
|
|
visualization tool used by the user (such as Grafana mentioned earlier) is
|
|
responsible for calculating the rate of change - by taking two measurements
|
|
of the cumulative value at two different times, and calculating the difference
|
|
of cumulative value divided by the time difference. For example, by
|
|
subtracting the "total number of requests" values queried one second apart,
|
|
we can show the number of requests handled during that second.
|
|
|
|
> In some contexts, we call counter metrics "derive" metrics. We do this
|
|
> mainly for historic reasons, because our previous focus on the "collectd"
|
|
> metric collection daemon - which Scylla still supports but is no longer
|
|
> our recommended choice. Collectd has both "derive" and "counter" metrics
|
|
> with a subtle difference: Both indicate cumulative values, but "counter"
|
|
> is a sum of non-negative values, while "derive" is a sum of values which
|
|
> may be negative. This distinction is not important in Scylla: all our
|
|
> cumulative metrics are sums of non-negative values, and are monotonically
|
|
> increasing. So in this document we picked the term "counter" and use it
|
|
> exclusively.
|
|
|
|
Contrary to counter metrics which accumulate a measurement throughout the
|
|
lifetime of the node, a **gauge** metric measures the state of objects
|
|
currently existing in the system. For example, the number of requests being
|
|
processed *right now*, the size of some queue, the amount of memory devoted
|
|
now to the row cache, or the amount of disk used now for the data storage.
|
|
|
|
Gauge metrics are less common than counter metrics. When visualizing them,
|
|
one usually wants to look at the metric itself rather than its rate of
|
|
change. However, even for gauge metrics it is sometimes useful to visualize
|
|
their derivative - for example, a user might want to visualize the rate of
|
|
change to the amount of disk storage.
|
|
|
|
Internally, Scylla calculates many of the gauge metrics just like calculates
|
|
counter metrics - as a cumulative value: For example, Scylla maintains a
|
|
metric of the number of requests being processed *right now* by adding 1 to
|
|
the metric when starting to process a request, and subtracting 1 when the
|
|
request's processing is complete. This metric is nevertheless labeled "gauge"
|
|
because it provides a metric over currently-existing objects in the system
|
|
(requests being processed), not a sum of historic information.
|
|
|
|
TODO: histogram metrics. They are described in the Prometheus document linked
|
|
above.
|
|
|
|
## List of metrics
|
|
Looking at the response for http://scyllanode:9180/metrics is the best
|
|
way to see the list of metrics currently exposed by Scylla, because it
|
|
includes a textual description in a comment above each metric.
|
|
|
|
TODO: mention source files in which a developer should add new metrics.
|