ceph/doc/mgr/prometheus.rst

   1 .. _mgr-prometheus:
   2
   3 =================
   4 Prometheus Module
   5 =================
   6
   7 Provides a Prometheus exporter to pass on Ceph performance counters
   8 from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
   9 messages from all MgrClient processes (mons and OSDs, for instance)
  10 with performance counter schema data and actual counter data, and keeps
  11 a circular buffer of the last N samples.  This module creates an HTTP
  12 endpoint (like all Prometheus exporters) and retrieves the latest sample
  13 of every counter when polled (or "scraped" in Prometheus terminology).
  14 The HTTP path and query parameters are ignored; all extant counters
  15 for all reporting entities are returned in text exposition format.
  16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
  17
  18 Enabling prometheus output
  19 ==========================
  20
  21 The *prometheus* module is enabled with::
  22
  23   ceph mgr module enable prometheus
  24
  25 Configuration
  26 -------------
  27
  28 By default the module will accept HTTP requests on port ``9283`` on all
  29 IPv4 and IPv6 addresses on the host.  The port and listen address are both
  30 configurable with ``ceph config-key set``, with keys
  31 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
  32 This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
  33
  34 RBD IO statistics
  35 -----------------
  36
  37 The module can optionally collect RBD per-image IO statistics by enabling
  38 dynamic OSD performance counters. The statistics are gathered for all images
  39 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
  40 configuration parameter. The parameter is a comma or space separated list
  41 of ``pool[/namespace]`` entries. If the namespace is not specified the
  42 statistics are collected for all namespaces in the pool.
  43
  44 The module makes the list of all available images scanning the specified
  45 pools and namespaces and refreshes it periodically. The period is
  46 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
  47 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
  48 force refresh earlier if it detects statistics from a previously unknown
  49 RBD image.
  50
  51 Statistic names and labels
  52 ==========================
  53
  54 The names of the stats are exactly as Ceph names them, with
  55 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
  56 and ``ceph_`` prefixed to all names.
  57
  58
  59 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
  60 that identifies the type and ID of the daemon they come from.  Some
  61 statistics can come from different types of daemon, so when querying
  62 e.g. an OSD's RocksDB stats, you would probably want to filter
  63 on ceph_daemon starting with "osd" to avoid mixing in the monitor
  64 rocksdb stats.
  65
  66
  67 The *cluster* statistics (i.e. those global to the Ceph cluster)
  68 have labels appropriate to what they report on.  For example,
  69 metrics relating to pools have a ``pool_id`` label.
  70
  71
  72 The long running averages that represent the histograms from core Ceph
  73 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
  74 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
  75 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
  76
  77 Pool and OSD metadata series
  78 ----------------------------
  79
  80 Special series are output to enable displaying and querying on
  81 certain metadata fields.
  82
  83 Pools have a ``ceph_pool_metadata`` field like this:
  84
  85 ::
  86
  87     ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
  88
  89 OSDs have a ``ceph_osd_metadata`` field like this:
  90
  91 ::
  92
  93     ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
  94
  95
  96 Correlating drive statistics with node_exporter
  97 -----------------------------------------------
  98
  99 The prometheus output from Ceph is designed to be used in conjunction
 100 with the generic host monitoring from the Prometheus node_exporter.
 101
 102 To enable correlation of Ceph OSD statistics with node_exporter's
 103 drive statistics, special series are output like this:
 104
 105 ::
 106
 107     ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
 108
 109 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
 110 the ``*`` operator in your prometheus query. All metadata metrics (like ``
 111 ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
 112 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
 113 the resulting metric has additional labels from one side of the query.
 114
 115 See the
 116 `prometheus documentation`__ for more information about constructing queries.
 117
 118 __ https://prometheus.io/docs/prometheus/latest/querying/basics
 119
 120 The goal is to run a query like
 121
 122 ::
 123
 124     rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}
 125
 126 Out of the box the above query will not return any metrics since the ``instance`` labels of
 127 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
 128 will be the currently active MGR node.
 129
 130  The following two section outline two approaches to remedy this.
 131
 132 Use label_replace
 133 =================
 134
 135 The ``label_replace`` function (cp.
 136 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
 137 can add a label to, or alter a label of, a metric within a query.
 138
 139 To correlate an OSD and its disks write rate, the following query can be used:
 140
 141 ::
 142
 143     label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}
 144
 145 Configuring Prometheus server
 146 =============================
 147
 148 honor_labels
 149 ------------
 150
 151 To enable Ceph to output properly-labeled data relating to any host,
 152 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
 153 to your prometheus configuration.
 154
 155 This allows Ceph to export the proper ``instance`` label without prometheus
 156 overwriting it. Without this setting, Prometheus applies an ``instance`` label
 157 that includes the hostname and port of the endpoint that the series came from.
 158 Because Ceph clusters have multiple manager daemons, this results in an
 159 ``instance`` label that changes spuriously when the active manager daemon
 160 changes.
 161
 162 If this is undesirable a custom ``instance`` label can be set in the
 163 Prometheus target configuration: you might wish to set it to the hostname
 164 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
 165
 166 node_exporter hostname labels
 167 -----------------------------
 168
 169 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
 170 in the ``instance`` field.  This is generally the short hostname of the node.
 171
 172 This is only necessary if you want to correlate Ceph stats with host stats,
 173 but you may find it useful to do it in all cases in case you want to do
 174 the correlation in the future.
 175
 176 Example configuration
 177 ---------------------
 178
 179 This example shows a single node configuration running ceph-mgr and
 180 node_exporter on a server called ``senta04``. Note that this requires to add the
 181 appropriate instance label to every ``node_exporter`` target individually.
 182
 183 This is just an example: there are other ways to configure prometheus
 184 scrape targets and label rewrite rules.
 185
 186 prometheus.yml
 187 ~~~~~~~~~~~~~~
 188
 189 ::
 190
 191     global:
 192       scrape_interval:     15s
 193       evaluation_interval: 15s
 194
 195     scrape_configs:
 196       - job_name: 'node'
 197         file_sd_configs:
 198           - files:
 199             - node_targets.yml
 200       - job_name: 'ceph'
 201         honor_labels: true
 202         file_sd_configs:
 203           - files:
 204             - ceph_targets.yml
 205
 206
 207 ceph_targets.yml
 208 ~~~~~~~~~~~~~~~~
 209
 210
 211 ::
 212
 213     [
 214         {
 215             "targets": [ "senta04.mydomain.com:9283" ],
 216             "labels": {}
 217         }
 218     ]
 219
 220
 221 node_targets.yml
 222 ~~~~~~~~~~~~~~~~
 223
 224 ::
 225
 226     [
 227         {
 228             "targets": [ "senta04.mydomain.com:9100" ],
 229             "labels": {
 230                 "instance": "senta04"
 231             }
 232         }
 233     ]
 234
 235
 236 Notes
 237 =====
 238
 239 Counters and gauges are exported; currently histograms and long-running
 240 averages are not.  It's possible that Ceph's 2-D histograms could be
 241 reduced to two separate 1-D histograms, and that long-running averages
 242 could be exported as Prometheus' Summary type.
 243
 244 Timestamps, as with many Prometheus exporters, are established by
 245 the server's scrape time (Prometheus expects that it is polling the
 246 actual counter process synchronously).  It is possible to supply a
 247 timestamp along with the stat report, but the Prometheus team strongly
 248 advises against this.  This means that timestamps will be delayed by
 249 an unpredictable amount; it's not clear if this will be problematic,
 250 but it's worth knowing about.