Files
scylladb/docs/operating-scylla/diagnostics.rst
Botond Dénes 055a36ae55 main: dump diagnostics on SIGQUIT
Dump a diagnostics report on each shard when receiving a SIGQUIT. The
report is logged with a dedicated logger, called diagnostics.
The report has multiple parts:
* seastar memory diagnostics, similar to that printed by the scylla
  memory command (from scylla-gdb.py).
* reader concurrency semaphore diagnostics for each semaphore.

Example report:

    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Dumping seastar memory diagnostics
    Used memory:   3988M
    Free memory:   58M
    Total memory:  4G
    Hard failures: 0

    LSA
      allocated: 4M
      used:      16
      free:      4G

    Cache:
      total: 1M
      used:  642K
      free:  398K

    Memtables:
     total: 3M
     Regular:
      real dirty: 0B
      virt dirty: 0B
     System:
      real dirty: 3M
      virt dirty: 3M

    Replica:
      Read Concurrency Semaphores:
        user: 0/100, 0B/81M, queued: 0
        streaming: 0/10, 0B/81M, queued: 0
        system: 0/10, 0B/81M, queued: 0
        compaction: 0/unlimited, 0B/unlimited
        view update: 0/50, 0B/40M, queued: 0
      Execution Stages:
        apply stage:
             Total: 0
      Tables - Ongoing Operations:
        Pending writes (top 10):
          0 Total (all)
        Pending reads (top 10):
          0 Total (all)
        Pending streams (top 10):
          0 Total (all)

    Small pools:
    objsz spansz usedobj memory unused wst%
        8     4K     858    16K     9K   58
       10     4K       5     8K     8K   99
       12     4K       5     8K     8K   99
       14     4K       0     0B     0B    0
       16     4K      2k    44K    15K   35
       32     4K      4k   136K    16K   11
       32     4K      8k   280K    24K    8
       32     4K      3k    92K     6K    6
       32     4K      4k   140K    21K   14
       48     4K      3k   180K    25K   14
       48     4K      2k   120K    27K   22
       64     4K      2k   156K    18K   11
       64     4K     19k     1M    11K    0
       80     4K      3k   236K    16K    6
       96     4K      6k   572K    49K    8
      112     4K      2k   276K    72K   25
      128     4K     477    80K    20K   25
      160     4K     194    60K    30K   49
      192     4K      1k   232K    39K   16
      224     4K      2k   468K    15K    3
      256     4K     182   100K    55K   54
      320     8K     349   152K    43K   28
      384     8K     332   288K   164K   56
      448     4K     243   180K    74K   40
      512     4K     256   244K   116K   47
      640    16K     185   192K    76K   39
      768    16K     394   432K   137K   31
      896     8K      54   192K   144K   75
     1024     4K     288   432K   144K   33
     1280    32K      92   256K   140K   54
     1536    32K      11   128K   111K   86
     1792    16K      10   144K   126K   87
     2048     8K     487     1M    90K    8
     2560    64K     113   384K   100K   26
     3072    64K       9   256K   228K   89
     3584    32K       3   288K   277K   96
     4096    16K     129   912K   396K   43
     5120   128K      21   384K   275K   71
     6144   128K       4   512K   486K   94
     7168    64K       3   576K   553K   96
     8192    32K     373     3M    56K    1
    10240    64K       6   832K   770K   92
    12288    64K      17   960K   756K   78
    14336   128K       2     1M     1M   97
    16384    64K      14     1M   992K   81

    Page spans:
    index  size  free  used spans
        0    4K    4K    5M    1k
        1    8K    8K    2M   213
        2   16K   16K    2M   106
        3   32K   64K    6M   200
        4   64K   64K    4M    71
        5  128K  384K 3934M   31k
        6  256K    1M  256K     5
        7  512K  512K  512K     2
        8    1M    2M    0B     2
        9    2M    2M    2M     2
       10    4M    4M    0B     1
       11    8M   16M    0B     2
       12   16M   32M    0B     2
       13   32M    0B   32M     1
       14   64M    0B    0B     0
       15  128M    0B    0B     0
       16  256M    0B    0B     0
       17  512M    0B    0B     0
       18    1G    0B    0B     0
       19    2G    0B    0B     0
       20    4G    0B    0B     0
       21    8G    0B    0B     0
       22   16G    0B    0B     0
       23   32G    0B    0B     0
       24   64G    0B    0B     0
       25  128G    0B    0B     0
       26  256G    0B    0B     0
       27  512G    0B    0B     0
       28    1T    0B    0B     0
       29    2T    0B    0B     0
       30    4T    0B    0B     0
       31    8T    0B    0B     0

    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore user with 0/100 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 0
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore streaming with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 6
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 6
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 6
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 6
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore compaction with 0/2147483647 count and 0/9223372036854775807 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 27
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore system with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state
    1	0	0B	*.*/view_builder/active

    1	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 234
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 234
    reads_enqueued_for_admission: 154
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 80
    reads_queued_because_ready_list: 154
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 235
    current_permits: 1
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore view_update with 0/50 count and 0/42425384 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 0
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0

Fixes: scylladb/scylladb#7400

Closes scylladb/scylladb#21692
2024-11-28 18:52:29 +02:00

115 lines
5.9 KiB
ReStructuredText

=========================
ScyllaDB Diagnostic Tools
=========================
ScyllaDB has a wide selection of tools and information sources available for diagnosing problems.
This document covers both built-in and standalone tools that can help you diagnose a problem with ScyllaDB.
This document focuses on enumerating the available tools and information sources, rather than providing a guide on how to diagnose generic or specific issues.
Logs
----
The most obvious source of information to find out more about why ScyllaDB is misbehaving.
On production systems, ScyllaDB logs to syslog; thus logs can usually be viewed via ``journalctl``.
See `Logging </getting-started/logging/>`_ on more information on how to access the logs.
ScyllaDB has the following log levels: ``trace``, ``debug``, ``info``, ``warn``, ``error``.
By default only logs with level ``info`` or above are logged. Some administrators might even set this to ``warn`` to reduce the amount of logs.
ScyllaDB has many different loggers, usually there is one for each subsystem or module.
You can change the log-level of a certain logger to ``debug`` or ``trace``, to get more visibility into the respective subsystem.
This can be done in one of the following ways:
* configuration file (``scylla.yaml``):
.. code-block:: yaml
logger_log_level:
mylogger: debug
* command line:
.. code-block:: shell
--logger-log-level mylogger=debug
* nodetool:
.. code-block:: shell
$ nodetool setlogginglevel mylogger debug
* REST API:
.. code-block:: shell
$ scylla-api-client system logger/{name} POST --name mylogger --level debug
The first two methods require a restart, the latter two work at runtime.
Note that setting the log-level of even a single logger to ``debug`` or below might generate a huge amount of log traffic.
Try to time the log-level bump to when an event of interest start and revert it quickly afterward to avoid saturating your log.
Monitoring
----------
ScyllaDB has a comprehensive monitoring and alerting solution, displaying the different counters from ScyllaDB and the underlying OS, as well as alerts for common problems and pitfalls.
Note that by default monitoring shows information about the entire cluster and even when selecting a certain node, it aggregates the counters from the node's shards.
Sometimes counter values aggregated over all the shards or even over multiple or all nodes is what you want.
Just be aware of the aggregation and know that you can always select the nodes and shards of interest, or display counters by node and shard (disable aggregation).
See `ScyllaDB Monitoring Stack <https://monitoring.docs.scylladb.com/stable/>`_ for more details.
Tracing
-------
Tracing allows you to retrieve the internal log of events happening in the context of a single query.
Therefore, tracing is only useful to diagnose problems related to a certain query and cannot be used to diagnose generic problems.
That said, when it comes to diagnosing problems with a certain query, tracing is an excellent tool, allowing you to have a peek at what happens when that query is processed, including the timestamp of each event.
For more details, see `Tracing </using-scylla/tracing>`_.
Nodetool
--------
Although ``nodetool`` is primarily an administration tool, it has various commands that retrieve and display useful information about the state of a certain ScyllaDB node.
Look for commands with "stats", "info", "describe", "get", "histogram" in their names.
For a comprehensive list of all available nodetool commands, see the `Nodetool Reference </operating-scylla/nodetool>`_.
REST API
--------
ScyllaDB has a REST API which is a superset of all ``nodetool`` commands, in the sense that it is the backend serving all of them.
It has many more endpoints, many of which can supply valuable information about the internal state of ScyllaDB.
For more information, see `REST API </operating-scylla/rest>`_.
System Tables
-------------
ScyllaDB has various internal system tables containing valuable information on its state.
Some of these are virtual tables, tables whose content is derived from in-memory state, rather than on-disk storage as is the case for regular tables. Virtual tables look like and act like regular tables.
For a complete list of all internal tables (including virtual ones), see `System Keyspace <https://github.com/scylladb/scylladb/blob/master/docs/dev/system_keyspace.md>`_.
Diagnostics dump on SIGQUIT
---------------------------
Sending ``SIGQUIT`` to ScyllaDB will result in a diagnostics dump to the logs. The diagnostics logs are logged via the dedicated ``diagnostics`` logger.
The diagnostics is dumped on every shard. The diagnostics dump of a shard has multiple parts:
* A summary of the state of the memory allocator and some other stats. The format is very similar to that used by ``scylla memory`` (from ``scylla-gdb.py``). See `Debugging Out Of Memory (OOM) crashes <https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md#debugging-out-of-memory-oom-crashes>`_.
* Reader semaphore concurrency diagnostics dump for each semaphore in the database, see `reader_concurrency_semaphore.md <https://github.com/scylladb/scylladb/blob/master/docs/dev/reader-concurrency-semaphore.md#reader-concurrency-semaphore-diagnostic-dumps>`_ for more details.
Other Tools
-----------
ScyllaDB has various other tools, mainly to work with sstables.
If you are diagnosing a problem that is related to sstables misbehaving or being corrupt, you may find these useful:
* `sstabledump </operating-scylla/admin-tools/sstabledump/>`_
* `ScyllaDB SStable </operating-scylla/admin-tools/scylla-sstable/>`_
* `ScyllaDB Types </operating-scylla/admin-tools/scylla-types/>`_
GDB
---
The ultimate tool to extract any information from live ScyllaDB processes or coredumps.
However, it requires intimate knowledge of ScyllaDB internals to be useful.
For more details on how to debug scylla, see `Debugging <https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md>`_.