mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-26 19:35:12 +00:00
Dump a diagnostics report on each shard when receiving a SIGQUIT. The
report is logged with a dedicated logger, called diagnostics.
The report has multiple parts:
* seastar memory diagnostics, similar to that printed by the scylla
memory command (from scylla-gdb.py).
* reader concurrency semaphore diagnostics for each semaphore.
Example report:
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Dumping seastar memory diagnostics
Used memory: 3988M
Free memory: 58M
Total memory: 4G
Hard failures: 0
LSA
allocated: 4M
used: 16
free: 4G
Cache:
total: 1M
used: 642K
free: 398K
Memtables:
total: 3M
Regular:
real dirty: 0B
virt dirty: 0B
System:
real dirty: 3M
virt dirty: 3M
Replica:
Read Concurrency Semaphores:
user: 0/100, 0B/81M, queued: 0
streaming: 0/10, 0B/81M, queued: 0
system: 0/10, 0B/81M, queued: 0
compaction: 0/unlimited, 0B/unlimited
view update: 0/50, 0B/40M, queued: 0
Execution Stages:
apply stage:
Total: 0
Tables - Ongoing Operations:
Pending writes (top 10):
0 Total (all)
Pending reads (top 10):
0 Total (all)
Pending streams (top 10):
0 Total (all)
Small pools:
objsz spansz usedobj memory unused wst%
8 4K 858 16K 9K 58
10 4K 5 8K 8K 99
12 4K 5 8K 8K 99
14 4K 0 0B 0B 0
16 4K 2k 44K 15K 35
32 4K 4k 136K 16K 11
32 4K 8k 280K 24K 8
32 4K 3k 92K 6K 6
32 4K 4k 140K 21K 14
48 4K 3k 180K 25K 14
48 4K 2k 120K 27K 22
64 4K 2k 156K 18K 11
64 4K 19k 1M 11K 0
80 4K 3k 236K 16K 6
96 4K 6k 572K 49K 8
112 4K 2k 276K 72K 25
128 4K 477 80K 20K 25
160 4K 194 60K 30K 49
192 4K 1k 232K 39K 16
224 4K 2k 468K 15K 3
256 4K 182 100K 55K 54
320 8K 349 152K 43K 28
384 8K 332 288K 164K 56
448 4K 243 180K 74K 40
512 4K 256 244K 116K 47
640 16K 185 192K 76K 39
768 16K 394 432K 137K 31
896 8K 54 192K 144K 75
1024 4K 288 432K 144K 33
1280 32K 92 256K 140K 54
1536 32K 11 128K 111K 86
1792 16K 10 144K 126K 87
2048 8K 487 1M 90K 8
2560 64K 113 384K 100K 26
3072 64K 9 256K 228K 89
3584 32K 3 288K 277K 96
4096 16K 129 912K 396K 43
5120 128K 21 384K 275K 71
6144 128K 4 512K 486K 94
7168 64K 3 576K 553K 96
8192 32K 373 3M 56K 1
10240 64K 6 832K 770K 92
12288 64K 17 960K 756K 78
14336 128K 2 1M 1M 97
16384 64K 14 1M 992K 81
Page spans:
index size free used spans
0 4K 4K 5M 1k
1 8K 8K 2M 213
2 16K 16K 2M 106
3 32K 64K 6M 200
4 64K 64K 4M 71
5 128K 384K 3934M 31k
6 256K 1M 256K 5
7 512K 512K 512K 2
8 1M 2M 0B 2
9 2M 2M 2M 2
10 4M 4M 0B 1
11 8M 16M 0B 2
12 16M 32M 0B 2
13 32M 0B 32M 1
14 64M 0B 0B 0
15 128M 0B 0B 0
16 256M 0B 0B 0
17 512M 0B 0B 0
18 1G 0B 0B 0
19 2G 0B 0B 0
20 4G 0B 0B 0
21 8G 0B 0B 0
22 16G 0B 0B 0
23 32G 0B 0B 0
24 64G 0B 0B 0
25 128G 0B 0B 0
26 256G 0B 0B 0
27 512G 0B 0B 0
28 1T 0B 0B 0
29 2T 0B 0B 0
30 4T 0B 0B 0
31 8T 0B 0B 0
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Semaphore user with 0/100 count and 0/84850769 memory resources: user request, dumping permit diagnostics:
permits count memory table/operation/state
0 0 0B total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 0
reads_enqueued_for_admission: 0
reads_enqueued_for_memory: 0
reads_admitted_immediately: 0
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 0
current_permits: 0
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Semaphore streaming with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:
permits count memory table/operation/state
0 0 0B total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 6
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 6
reads_enqueued_for_admission: 0
reads_enqueued_for_memory: 0
reads_admitted_immediately: 6
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 6
current_permits: 0
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Semaphore compaction with 0/2147483647 count and 0/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count memory table/operation/state
0 0 0B total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 0
reads_enqueued_for_admission: 0
reads_enqueued_for_memory: 0
reads_admitted_immediately: 0
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 27
current_permits: 0
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Semaphore system with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:
permits count memory table/operation/state
1 0 0B *.*/view_builder/active
1 0 0B total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 234
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 234
reads_enqueued_for_admission: 154
reads_enqueued_for_memory: 0
reads_admitted_immediately: 80
reads_queued_because_ready_list: 154
reads_queued_because_need_cpu_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 235
current_permits: 1
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
INFO 2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
Semaphore view_update with 0/50 count and 0/42425384 memory resources: user request, dumping permit diagnostics:
permits count memory table/operation/state
0 0 0B total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 0
reads_enqueued_for_admission: 0
reads_enqueued_for_memory: 0
reads_admitted_immediately: 0
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 0
current_permits: 0
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
Fixes: scylladb/scylladb#7400
Closes scylladb/scylladb#21692
115 lines
5.9 KiB
ReStructuredText
115 lines
5.9 KiB
ReStructuredText
=========================
|
|
ScyllaDB Diagnostic Tools
|
|
=========================
|
|
|
|
ScyllaDB has a wide selection of tools and information sources available for diagnosing problems.
|
|
This document covers both built-in and standalone tools that can help you diagnose a problem with ScyllaDB.
|
|
This document focuses on enumerating the available tools and information sources, rather than providing a guide on how to diagnose generic or specific issues.
|
|
|
|
Logs
|
|
----
|
|
|
|
The most obvious source of information to find out more about why ScyllaDB is misbehaving.
|
|
On production systems, ScyllaDB logs to syslog; thus logs can usually be viewed via ``journalctl``.
|
|
See `Logging </getting-started/logging/>`_ on more information on how to access the logs.
|
|
|
|
|
|
ScyllaDB has the following log levels: ``trace``, ``debug``, ``info``, ``warn``, ``error``.
|
|
By default only logs with level ``info`` or above are logged. Some administrators might even set this to ``warn`` to reduce the amount of logs.
|
|
ScyllaDB has many different loggers, usually there is one for each subsystem or module.
|
|
You can change the log-level of a certain logger to ``debug`` or ``trace``, to get more visibility into the respective subsystem.
|
|
This can be done in one of the following ways:
|
|
|
|
* configuration file (``scylla.yaml``):
|
|
|
|
.. code-block:: yaml
|
|
|
|
logger_log_level:
|
|
mylogger: debug
|
|
|
|
* command line:
|
|
|
|
.. code-block:: shell
|
|
|
|
--logger-log-level mylogger=debug
|
|
|
|
* nodetool:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ nodetool setlogginglevel mylogger debug
|
|
|
|
* REST API:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ scylla-api-client system logger/{name} POST --name mylogger --level debug
|
|
|
|
The first two methods require a restart, the latter two work at runtime.
|
|
Note that setting the log-level of even a single logger to ``debug`` or below might generate a huge amount of log traffic.
|
|
Try to time the log-level bump to when an event of interest start and revert it quickly afterward to avoid saturating your log.
|
|
|
|
Monitoring
|
|
----------
|
|
|
|
ScyllaDB has a comprehensive monitoring and alerting solution, displaying the different counters from ScyllaDB and the underlying OS, as well as alerts for common problems and pitfalls.
|
|
Note that by default monitoring shows information about the entire cluster and even when selecting a certain node, it aggregates the counters from the node's shards.
|
|
Sometimes counter values aggregated over all the shards or even over multiple or all nodes is what you want.
|
|
Just be aware of the aggregation and know that you can always select the nodes and shards of interest, or display counters by node and shard (disable aggregation).
|
|
See `ScyllaDB Monitoring Stack <https://monitoring.docs.scylladb.com/stable/>`_ for more details.
|
|
|
|
Tracing
|
|
-------
|
|
|
|
Tracing allows you to retrieve the internal log of events happening in the context of a single query.
|
|
Therefore, tracing is only useful to diagnose problems related to a certain query and cannot be used to diagnose generic problems.
|
|
That said, when it comes to diagnosing problems with a certain query, tracing is an excellent tool, allowing you to have a peek at what happens when that query is processed, including the timestamp of each event.
|
|
For more details, see `Tracing </using-scylla/tracing>`_.
|
|
|
|
Nodetool
|
|
--------
|
|
|
|
Although ``nodetool`` is primarily an administration tool, it has various commands that retrieve and display useful information about the state of a certain ScyllaDB node.
|
|
Look for commands with "stats", "info", "describe", "get", "histogram" in their names.
|
|
For a comprehensive list of all available nodetool commands, see the `Nodetool Reference </operating-scylla/nodetool>`_.
|
|
|
|
REST API
|
|
--------
|
|
|
|
ScyllaDB has a REST API which is a superset of all ``nodetool`` commands, in the sense that it is the backend serving all of them.
|
|
It has many more endpoints, many of which can supply valuable information about the internal state of ScyllaDB.
|
|
For more information, see `REST API </operating-scylla/rest>`_.
|
|
|
|
System Tables
|
|
-------------
|
|
|
|
ScyllaDB has various internal system tables containing valuable information on its state.
|
|
Some of these are virtual tables, tables whose content is derived from in-memory state, rather than on-disk storage as is the case for regular tables. Virtual tables look like and act like regular tables.
|
|
For a complete list of all internal tables (including virtual ones), see `System Keyspace <https://github.com/scylladb/scylladb/blob/master/docs/dev/system_keyspace.md>`_.
|
|
|
|
Diagnostics dump on SIGQUIT
|
|
---------------------------
|
|
|
|
Sending ``SIGQUIT`` to ScyllaDB will result in a diagnostics dump to the logs. The diagnostics logs are logged via the dedicated ``diagnostics`` logger.
|
|
The diagnostics is dumped on every shard. The diagnostics dump of a shard has multiple parts:
|
|
|
|
* A summary of the state of the memory allocator and some other stats. The format is very similar to that used by ``scylla memory`` (from ``scylla-gdb.py``). See `Debugging Out Of Memory (OOM) crashes <https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md#debugging-out-of-memory-oom-crashes>`_.
|
|
* Reader semaphore concurrency diagnostics dump for each semaphore in the database, see `reader_concurrency_semaphore.md <https://github.com/scylladb/scylladb/blob/master/docs/dev/reader-concurrency-semaphore.md#reader-concurrency-semaphore-diagnostic-dumps>`_ for more details.
|
|
|
|
Other Tools
|
|
-----------
|
|
|
|
ScyllaDB has various other tools, mainly to work with sstables.
|
|
If you are diagnosing a problem that is related to sstables misbehaving or being corrupt, you may find these useful:
|
|
|
|
* `sstabledump </operating-scylla/admin-tools/sstabledump/>`_
|
|
* `ScyllaDB SStable </operating-scylla/admin-tools/scylla-sstable/>`_
|
|
* `ScyllaDB Types </operating-scylla/admin-tools/scylla-types/>`_
|
|
|
|
GDB
|
|
---
|
|
|
|
The ultimate tool to extract any information from live ScyllaDB processes or coredumps.
|
|
However, it requires intimate knowledge of ScyllaDB internals to be useful.
|
|
For more details on how to debug scylla, see `Debugging <https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md>`_.
|