ceph/doc/mgr/prometheus.rst

   1 .. _mgr-prometheus:
   2
   3 =================
   4 Prometheus Module
   5 =================
   6
   7 Provides a Prometheus exporter to pass on Ceph performance counters
   8 from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
   9 messages from all MgrClient processes (mons and OSDs, for instance)
  10 with performance counter schema data and actual counter data, and keeps
  11 a circular buffer of the last N samples.  This module creates an HTTP
  12 endpoint (like all Prometheus exporters) and retrieves the latest sample
  13 of every counter when polled (or "scraped" in Prometheus terminology).
  14 The HTTP path and query parameters are ignored; all extant counters
  15 for all reporting entities are returned in text exposition format.
  16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
  17
  18 Enabling prometheus output
  19 ==========================
  20
  21 The *prometheus* module is enabled with:
  22
  23 .. prompt:: bash $
  24
  25    ceph mgr module enable prometheus
  26
  27 Configuration
  28 -------------
  29
  30 .. note::
  31
  32     The Prometheus manager module needs to be restarted for configuration changes to be applied.
  33
  34 .. mgr_module:: prometheus
  35 .. confval:: server_addr
  36 .. confval:: server_port
  37 .. confval:: scrape_interval
  38 .. confval:: cache
  39 .. confval:: stale_cache_strategy
  40 .. confval:: rbd_stats_pools
  41 .. confval:: rbd_stats_pools_refresh_interval
  42 .. confval:: standby_behaviour
  43 .. confval:: standby_error_status_code
  44 .. confval:: exclude_perf_counters
  45
  46 By default the module will accept HTTP requests on port ``9283`` on all IPv4
  47 and IPv6 addresses on the host.  The port and listen address are both
  48 configurable with ``ceph config set``, with keys
  49 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
  50 is registered with Prometheus's `registry
  51 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
  52
  53 .. prompt:: bash $
  54
  55    ceph config set mgr mgr/prometheus/server_addr 0.0.0.
  56    ceph config set mgr mgr/prometheus/server_port 9283
  57
  58 .. warning::
  59
  60     The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
  61     Prometheus' scrape interval to work properly and not cause any issues.
  62
  63 The scrape interval in the module is used for caching purposes
  64 and to determine when a cache is stale.
  65
  66 It is not recommended to use a scrape interval below 10 seconds.  It is
  67 recommended to use 15 seconds as scrape interval, though, in some cases it
  68 might be useful to increase the scrape interval.
  69
  70 To set a different scrape interval in the Prometheus module, set
  71 ``scrape_interval`` to the desired value:
  72
  73 .. prompt:: bash $
  74
  75    ceph config set mgr mgr/prometheus/scrape_interval 20
  76
  77 On large clusters (>1000 OSDs), the time to fetch the metrics may become
  78 significant.  Without the cache, the Prometheus manager module could, especially
  79 in conjunction with multiple Prometheus instances, overload the manager and lead
  80 to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
  81 by default.  This means that there is a possibility that the cache becomes
  82 stale.  The cache is considered stale when the time to fetch the metrics from
  83 Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
  84
  85 If that is the case, **a warning will be logged** and the module will either
  86
  87 * respond with a 503 HTTP status code (service unavailable) or,
  88 * it will return the content of the cache, even though it might be stale.
  89
  90 This behavior can be configured. By default, it will return a 503 HTTP status
  91 code (service unavailable). You can set other options using the ``ceph config
  92 set`` commands.
  93
  94 To tell the module to respond with possibly stale data, set it to ``return``:
  95
  96 .. prompt:: bash $
  97
  98     ceph config set mgr mgr/prometheus/stale_cache_strategy return
  99
 100 To tell the module to respond with "service unavailable", set it to ``fail``:
 101
 102 .. prompt:: bash $
 103
 104    ceph config set mgr mgr/prometheus/stale_cache_strategy fail
 105
 106 If you are confident that you don't require the cache, you can disable it:
 107
 108 .. prompt:: bash $
 109
 110    ceph config set mgr mgr/prometheus/cache false
 111
 112 If you are using the prometheus module behind some kind of reverse proxy or
 113 loadbalancer, you can simplify discovering the active instance by switching
 114 to ``error``-mode:
 115
 116 .. prompt:: bash $
 117
 118    ceph config set mgr mgr/prometheus/standby_behaviour error
 119
 120 If set, the prometheus module will respond with a HTTP error when requesting ``/``
 121 from the standby instance. The default error code is 500, but you can configure
 122 the HTTP response code with:
 123
 124 .. prompt:: bash $
 125
 126    ceph config set mgr mgr/prometheus/standby_error_status_code 503
 127
 128 Valid error codes are between 400-599.
 129
 130 To switch back to the default behaviour, simply set the config key to ``default``:
 131
 132 .. prompt:: bash $
 133
 134    ceph config set mgr mgr/prometheus/standby_behaviour default
 135
 136 .. _prometheus-rbd-io-statistics:
 137
 138 Ceph Health Checks
 139 ------------------
 140
 141 The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
 142 exposing them to the Prometheus server as discrete metrics. This allows Prometheus
 143 alert rules to be configured for specific health check events.
 144
 145 The metrics take the following form;
 146
 147 ::
 148
 149     # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
 150     # TYPE ceph_health_detail gauge
 151     ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
 152     ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
 153     ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
 154
 155 The health check history is made available through the following commands;
 156
 157 ::
 158
 159     healthcheck history ls [--format {plain|json|json-pretty}]
 160     healthcheck history clear
 161
 162 The ``ls`` command provides an overview of the health checks that the cluster has
 163 encountered, or since the last ``clear`` command was issued. The example below;
 164
 165 ::
 166
 167     [ceph: root@c8-node1 /]# ceph healthcheck history ls
 168     Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
 169     OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
 170     OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 171     PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 172     3 health check(s) listed
 173
 174
 175 RBD IO statistics
 176 -----------------
 177
 178 The module can optionally collect RBD per-image IO statistics by enabling
 179 dynamic OSD performance counters. The statistics are gathered for all images
 180 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
 181 configuration parameter. The parameter is a comma or space separated list
 182 of ``pool[/namespace]`` entries. If the namespace is not specified the
 183 statistics are collected for all namespaces in the pool.
 184
 185 Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
 186
 187 .. prompt:: bash $
 188
 189    ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
 190
 191 The wildcard can be used to indicate all pools or namespaces:
 192
 193 .. prompt:: bash $
 194
 195    ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
 196
 197 The module makes the list of all available images scanning the specified
 198 pools and namespaces and refreshes it periodically. The period is
 199 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
 200 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
 201 force refresh earlier if it detects statistics from a previously unknown
 202 RBD image.
 203
 204 Example to turn up the sync interval to 10 minutes:
 205
 206 .. prompt:: bash $
 207
 208    ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
 209
 210 Ceph daemon performance counters metrics
 211 -----------------------------------------
 212
 213 With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
 214 perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
 215 the module option ``exclude_perf_counters`` to ``false``:
 216
 217 .. prompt:: bash $
 218
 219    ceph config set mgr mgr/prometheus/exclude_perf_counters false
 220
 221 Ceph daemon performance counters metrics
 222 -----------------------------------------
 223
 224 With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
 225 perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
 226 the module option ``exclude_perf_counters`` to ``false``::
 227
 228     ceph config set mgr mgr/prometheus/exclude_perf_counters false
 229
 230 Statistic names and labels
 231 ==========================
 232
 233 The names of the stats are exactly as Ceph names them, with
 234 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
 235 and ``ceph_`` prefixed to all names.
 236
 237
 238 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
 239 that identifies the type and ID of the daemon they come from.  Some
 240 statistics can come from different types of daemon, so when querying
 241 e.g. an OSD's RocksDB stats, you would probably want to filter
 242 on ceph_daemon starting with "osd" to avoid mixing in the monitor
 243 rocksdb stats.
 244
 245
 246 The *cluster* statistics (i.e. those global to the Ceph cluster)
 247 have labels appropriate to what they report on.  For example,
 248 metrics relating to pools have a ``pool_id`` label.
 249
 250
 251 The long running averages that represent the histograms from core Ceph
 252 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
 253 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
 254 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
 255
 256 Pool and OSD metadata series
 257 ----------------------------
 258
 259 Special series are output to enable displaying and querying on
 260 certain metadata fields.
 261
 262 Pools have a ``ceph_pool_metadata`` field like this:
 263
 264 ::
 265
 266     ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
 267
 268 OSDs have a ``ceph_osd_metadata`` field like this:
 269
 270 ::
 271
 272     ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
 273
 274
 275 Correlating drive statistics with node_exporter
 276 -----------------------------------------------
 277
 278 The prometheus output from Ceph is designed to be used in conjunction
 279 with the generic host monitoring from the Prometheus node_exporter.
 280
 281 To enable correlation of Ceph OSD statistics with node_exporter's
 282 drive statistics, special series are output like this:
 283
 284 ::
 285
 286     ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
 287
 288 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
 289 the ``*`` operator in your prometheus query. All metadata metrics (like ``
 290 ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
 291 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
 292 the resulting metric has additional labels from one side of the query.
 293
 294 See the
 295 `prometheus documentation`__ for more information about constructing queries.
 296
 297 __ https://prometheus.io/docs/prometheus/latest/querying/basics
 298
 299 The goal is to run a query like
 300
 301 ::
 302
 303     rate(node_disk_written_bytes_total[30s]) and
 304     on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 305
 306 Out of the box the above query will not return any metrics since the ``instance`` labels of
 307 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
 308 will be the currently active MGR node.
 309
 310 The following two section outline two approaches to remedy this.
 311
 312 .. note::
 313
 314     If you need to group on the `ceph_daemon` label instead of `device` and
 315     `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
 316     It is advised that you use `ceph_disk_occupation` instead.
 317
 318     The difference is that `ceph_disk_occupation_human` may group several OSDs
 319     into the value of a single `ceph_daemon` label in cases where multiple OSDs
 320     share a disk.
 321
 322 Use label_replace
 323 =================
 324
 325 The ``label_replace`` function (cp.
 326 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
 327 can add a label to, or alter a label of, a metric within a query.
 328
 329 To correlate an OSD and its disks write rate, the following query can be used:
 330
 331 ::
 332
 333     label_replace(
 334         rate(node_disk_written_bytes_total[30s]),
 335         "exported_instance",
 336         "$1",
 337         "instance",
 338         "(.*):.*"
 339     ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 340
 341 Configuring Prometheus server
 342 =============================
 343
 344 honor_labels
 345 ------------
 346
 347 To enable Ceph to output properly-labeled data relating to any host,
 348 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
 349 to your prometheus configuration.
 350
 351 This allows Ceph to export the proper ``instance`` label without prometheus
 352 overwriting it. Without this setting, Prometheus applies an ``instance`` label
 353 that includes the hostname and port of the endpoint that the series came from.
 354 Because Ceph clusters have multiple manager daemons, this results in an
 355 ``instance`` label that changes spuriously when the active manager daemon
 356 changes.
 357
 358 If this is undesirable a custom ``instance`` label can be set in the
 359 Prometheus target configuration: you might wish to set it to the hostname
 360 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
 361
 362 node_exporter hostname labels
 363 -----------------------------
 364
 365 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
 366 in the ``instance`` field.  This is generally the short hostname of the node.
 367
 368 This is only necessary if you want to correlate Ceph stats with host stats,
 369 but you may find it useful to do it in all cases in case you want to do
 370 the correlation in the future.
 371
 372 Example configuration
 373 ---------------------
 374
 375 This example shows a single node configuration running ceph-mgr and
 376 node_exporter on a server called ``senta04``. Note that this requires one
 377 to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
 378
 379 This is just an example: there are other ways to configure prometheus
 380 scrape targets and label rewrite rules.
 381
 382 prometheus.yml
 383 ~~~~~~~~~~~~~~
 384
 385 ::
 386
 387     global:
 388       scrape_interval:     15s
 389       evaluation_interval: 15s
 390
 391     scrape_configs:
 392       - job_name: 'node'
 393         file_sd_configs:
 394           - files:
 395             - node_targets.yml
 396       - job_name: 'ceph'
 397         honor_labels: true
 398         file_sd_configs:
 399           - files:
 400             - ceph_targets.yml
 401
 402
 403 ceph_targets.yml
 404 ~~~~~~~~~~~~~~~~
 405
 406
 407 ::
 408
 409     [
 410         {
 411             "targets": [ "senta04.mydomain.com:9283" ],
 412             "labels": {}
 413         }
 414     ]
 415
 416
 417 node_targets.yml
 418 ~~~~~~~~~~~~~~~~
 419
 420 ::
 421
 422     [
 423         {
 424             "targets": [ "senta04.mydomain.com:9100" ],
 425             "labels": {
 426                 "instance": "senta04"
 427             }
 428         }
 429     ]
 430
 431
 432 Notes
 433 =====
 434
 435 Counters and gauges are exported; currently histograms and long-running
 436 averages are not.  It's possible that Ceph's 2-D histograms could be
 437 reduced to two separate 1-D histograms, and that long-running averages
 438 could be exported as Prometheus' Summary type.
 439
 440 Timestamps, as with many Prometheus exporters, are established by
 441 the server's scrape time (Prometheus expects that it is polling the
 442 actual counter process synchronously).  It is possible to supply a
 443 timestamp along with the stat report, but the Prometheus team strongly
 444 advises against this.  This means that timestamps will be delayed by
 445 an unpredictable amount; it's not clear if this will be problematic,
 446 but it's worth knowing about.