ceph/doc/mgr/prometheus.rst

   1 .. _mgr-prometheus:
   2
   3 =================
   4 Prometheus Module
   5 =================
   6
   7 Provides a Prometheus exporter to pass on Ceph performance counters
   8 from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
   9 messages from all MgrClient processes (mons and OSDs, for instance)
  10 with performance counter schema data and actual counter data, and keeps
  11 a circular buffer of the last N samples.  This module creates an HTTP
  12 endpoint (like all Prometheus exporters) and retrieves the latest sample
  13 of every counter when polled (or "scraped" in Prometheus terminology).
  14 The HTTP path and query parameters are ignored; all extant counters
  15 for all reporting entities are returned in text exposition format.
  16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
  17
  18 Enabling prometheus output
  19 ==========================
  20
  21 The *prometheus* module is enabled with::
  22
  23   ceph mgr module enable prometheus
  24
  25 Configuration
  26 -------------
  27
  28 .. note::
  29
  30     The Prometheus manager module needs to be restarted for configuration changes to be applied.
  31
  32 .. mgr_module:: prometheus
  33 .. confval:: server_addr
  34 .. confval:: server_port
  35 .. confval:: scrape_interval
  36 .. confval:: cache
  37 .. confval:: stale_cache_strategy
  38 .. confval:: rbd_stats_pools
  39 .. confval:: rbd_stats_pools_refresh_interval
  40 .. confval:: standby_behaviour
  41 .. confval:: standby_error_status_code
  42
  43 By default the module will accept HTTP requests on port ``9283`` on all IPv4
  44 and IPv6 addresses on the host.  The port and listen address are both
  45 configurable with ``ceph config set``, with keys
  46 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
  47 is registered with Prometheus's `registry
  48 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
  49
  50 ::
  51
  52     ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
  53     ceph config set mgr mgr/prometheus/server_port 9283
  54
  55 .. warning::
  56
  57     The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
  58     Prometheus' scrape interval to work properly and not cause any issues.
  59
  60 The scrape interval in the module is used for caching purposes
  61 and to determine when a cache is stale.
  62
  63 It is not recommended to use a scrape interval below 10 seconds.  It is
  64 recommended to use 15 seconds as scrape interval, though, in some cases it
  65 might be useful to increase the scrape interval.
  66
  67 To set a different scrape interval in the Prometheus module, set
  68 ``scrape_interval`` to the desired value::
  69
  70     ceph config set mgr mgr/prometheus/scrape_interval 20
  71
  72 On large clusters (>1000 OSDs), the time to fetch the metrics may become
  73 significant.  Without the cache, the Prometheus manager module could, especially
  74 in conjunction with multiple Prometheus instances, overload the manager and lead
  75 to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
  76 by default.  This means that there is a possibility that the cache becomes
  77 stale.  The cache is considered stale when the time to fetch the metrics from
  78 Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
  79
  80 If that is the case, **a warning will be logged** and the module will either
  81
  82 * respond with a 503 HTTP status code (service unavailable) or,
  83 * it will return the content of the cache, even though it might be stale.
  84
  85 This behavior can be configured. By default, it will return a 503 HTTP status
  86 code (service unavailable). You can set other options using the ``ceph config
  87 set`` commands.
  88
  89 To tell the module to respond with possibly stale data, set it to ``return``::
  90
  91     ceph config set mgr mgr/prometheus/stale_cache_strategy return
  92
  93 To tell the module to respond with "service unavailable", set it to ``fail``::
  94
  95     ceph config set mgr mgr/prometheus/stale_cache_strategy fail
  96
  97 If you are confident that you don't require the cache, you can disable it::
  98
  99     ceph config set mgr mgr/prometheus/cache false
 100
 101 If you are using the prometheus module behind some kind of reverse proxy or
 102 loadbalancer, you can simplify discovering the active instance by switching
 103 to ``error``-mode::
 104
 105     ceph config set mgr mgr/prometheus/standby_behaviour error
 106
 107 If set, the prometheus module will repond with a HTTP error when requesting ``/``
 108 from the standby instance. The default error code is 500, but you can configure
 109 the HTTP response code with::
 110
 111     ceph config set mgr mgr/prometheus/standby_error_status_code 503
 112
 113 Valid error codes are between 400-599.
 114
 115 To switch back to the default behaviour, simply set the config key to ``default``::
 116
 117     ceph config set mgr mgr/prometheus/standby_behaviour default
 118
 119 .. _prometheus-rbd-io-statistics:
 120
 121 Ceph Health Checks
 122 ------------------
 123
 124 The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
 125 exposing them to the Prometheus server as discrete metrics. This allows Prometheus
 126 alert rules to be configured for specific health check events.
 127
 128 The metrics take the following form;
 129
 130 ::
 131
 132     # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
 133     # TYPE ceph_health_detail gauge
 134     ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
 135     ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
 136     ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
 137
 138 The health check history is made available through the following commands;
 139
 140 ::
 141
 142     healthcheck history ls [--format {plain|json|json-pretty}]
 143     healthcheck history clear
 144
 145 The ``ls`` command provides an overview of the health checks that the cluster has
 146 encountered, or since the last ``clear`` command was issued. The example below;
 147
 148 ::
 149
 150     [ceph: root@c8-node1 /]# ceph healthcheck history ls
 151     Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
 152     OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
 153     OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 154     PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 155     3 health check(s) listed
 156
 157
 158 RBD IO statistics
 159 -----------------
 160
 161 The module can optionally collect RBD per-image IO statistics by enabling
 162 dynamic OSD performance counters. The statistics are gathered for all images
 163 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
 164 configuration parameter. The parameter is a comma or space separated list
 165 of ``pool[/namespace]`` entries. If the namespace is not specified the
 166 statistics are collected for all namespaces in the pool.
 167
 168 Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
 169
 170   ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
 171
 172 The module makes the list of all available images scanning the specified
 173 pools and namespaces and refreshes it periodically. The period is
 174 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
 175 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
 176 force refresh earlier if it detects statistics from a previously unknown
 177 RBD image.
 178
 179 Example to turn up the sync interval to 10 minutes::
 180
 181   ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
 182
 183 Statistic names and labels
 184 ==========================
 185
 186 The names of the stats are exactly as Ceph names them, with
 187 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
 188 and ``ceph_`` prefixed to all names.
 189
 190
 191 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
 192 that identifies the type and ID of the daemon they come from.  Some
 193 statistics can come from different types of daemon, so when querying
 194 e.g. an OSD's RocksDB stats, you would probably want to filter
 195 on ceph_daemon starting with "osd" to avoid mixing in the monitor
 196 rocksdb stats.
 197
 198
 199 The *cluster* statistics (i.e. those global to the Ceph cluster)
 200 have labels appropriate to what they report on.  For example,
 201 metrics relating to pools have a ``pool_id`` label.
 202
 203
 204 The long running averages that represent the histograms from core Ceph
 205 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
 206 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
 207 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
 208
 209 Pool and OSD metadata series
 210 ----------------------------
 211
 212 Special series are output to enable displaying and querying on
 213 certain metadata fields.
 214
 215 Pools have a ``ceph_pool_metadata`` field like this:
 216
 217 ::
 218
 219     ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
 220
 221 OSDs have a ``ceph_osd_metadata`` field like this:
 222
 223 ::
 224
 225     ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
 226
 227
 228 Correlating drive statistics with node_exporter
 229 -----------------------------------------------
 230
 231 The prometheus output from Ceph is designed to be used in conjunction
 232 with the generic host monitoring from the Prometheus node_exporter.
 233
 234 To enable correlation of Ceph OSD statistics with node_exporter's
 235 drive statistics, special series are output like this:
 236
 237 ::
 238
 239     ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
 240
 241 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
 242 the ``*`` operator in your prometheus query. All metadata metrics (like ``
 243 ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
 244 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
 245 the resulting metric has additional labels from one side of the query.
 246
 247 See the
 248 `prometheus documentation`__ for more information about constructing queries.
 249
 250 __ https://prometheus.io/docs/prometheus/latest/querying/basics
 251
 252 The goal is to run a query like
 253
 254 ::
 255
 256     rate(node_disk_bytes_written[30s]) and
 257     on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 258
 259 Out of the box the above query will not return any metrics since the ``instance`` labels of
 260 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
 261 will be the currently active MGR node.
 262
 263 The following two section outline two approaches to remedy this.
 264
 265 .. note::
 266
 267     If you need to group on the `ceph_daemon` label instead of `device` and
 268     `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
 269     It is advised that you use `ceph_disk_occupation` instead.
 270
 271     The difference is that `ceph_disk_occupation_human` may group several OSDs
 272     into the value of a single `ceph_daemon` label in cases where multiple OSDs
 273     share a disk.
 274
 275 Use label_replace
 276 =================
 277
 278 The ``label_replace`` function (cp.
 279 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
 280 can add a label to, or alter a label of, a metric within a query.
 281
 282 To correlate an OSD and its disks write rate, the following query can be used:
 283
 284 ::
 285
 286     label_replace(
 287         rate(node_disk_bytes_written[30s]),
 288         "exported_instance",
 289         "$1",
 290         "instance",
 291         "(.*):.*"
 292     ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 293
 294 Configuring Prometheus server
 295 =============================
 296
 297 honor_labels
 298 ------------
 299
 300 To enable Ceph to output properly-labeled data relating to any host,
 301 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
 302 to your prometheus configuration.
 303
 304 This allows Ceph to export the proper ``instance`` label without prometheus
 305 overwriting it. Without this setting, Prometheus applies an ``instance`` label
 306 that includes the hostname and port of the endpoint that the series came from.
 307 Because Ceph clusters have multiple manager daemons, this results in an
 308 ``instance`` label that changes spuriously when the active manager daemon
 309 changes.
 310
 311 If this is undesirable a custom ``instance`` label can be set in the
 312 Prometheus target configuration: you might wish to set it to the hostname
 313 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
 314
 315 node_exporter hostname labels
 316 -----------------------------
 317
 318 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
 319 in the ``instance`` field.  This is generally the short hostname of the node.
 320
 321 This is only necessary if you want to correlate Ceph stats with host stats,
 322 but you may find it useful to do it in all cases in case you want to do
 323 the correlation in the future.
 324
 325 Example configuration
 326 ---------------------
 327
 328 This example shows a single node configuration running ceph-mgr and
 329 node_exporter on a server called ``senta04``. Note that this requires one
 330 to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
 331
 332 This is just an example: there are other ways to configure prometheus
 333 scrape targets and label rewrite rules.
 334
 335 prometheus.yml
 336 ~~~~~~~~~~~~~~
 337
 338 ::
 339
 340     global:
 341       scrape_interval:     15s
 342       evaluation_interval: 15s
 343
 344     scrape_configs:
 345       - job_name: 'node'
 346         file_sd_configs:
 347           - files:
 348             - node_targets.yml
 349       - job_name: 'ceph'
 350         honor_labels: true
 351         file_sd_configs:
 352           - files:
 353             - ceph_targets.yml
 354
 355
 356 ceph_targets.yml
 357 ~~~~~~~~~~~~~~~~
 358
 359
 360 ::
 361
 362     [
 363         {
 364             "targets": [ "senta04.mydomain.com:9283" ],
 365             "labels": {}
 366         }
 367     ]
 368
 369
 370 node_targets.yml
 371 ~~~~~~~~~~~~~~~~
 372
 373 ::
 374
 375     [
 376         {
 377             "targets": [ "senta04.mydomain.com:9100" ],
 378             "labels": {
 379                 "instance": "senta04"
 380             }
 381         }
 382     ]
 383
 384
 385 Notes
 386 =====
 387
 388 Counters and gauges are exported; currently histograms and long-running
 389 averages are not.  It's possible that Ceph's 2-D histograms could be
 390 reduced to two separate 1-D histograms, and that long-running averages
 391 could be exported as Prometheus' Summary type.
 392
 393 Timestamps, as with many Prometheus exporters, are established by
 394 the server's scrape time (Prometheus expects that it is polling the
 395 actual counter process synchronously).  It is possible to supply a
 396 timestamp along with the stat report, but the Prometheus team strongly
 397 advises against this.  This means that timestamps will be delayed by
 398 an unpredictable amount; it's not clear if this will be problematic,
 399 but it's worth knowing about.