ceph/doc/mgr/prometheus.rst

   1 .. _mgr-prometheus:
   2
   3 =================
   4 Prometheus Module
   5 =================
   6
   7 Provides a Prometheus exporter to pass on Ceph performance counters
   8 from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
   9 messages from all MgrClient processes (mons and OSDs, for instance)
  10 with performance counter schema data and actual counter data, and keeps
  11 a circular buffer of the last N samples.  This module creates an HTTP
  12 endpoint (like all Prometheus exporters) and retrieves the latest sample
  13 of every counter when polled (or "scraped" in Prometheus terminology).
  14 The HTTP path and query parameters are ignored; all extant counters
  15 for all reporting entities are returned in text exposition format.
  16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
  17
  18 Enabling prometheus output
  19 ==========================
  20
  21 The *prometheus* module is enabled with:
  22
  23 .. prompt:: bash $
  24
  25    ceph mgr module enable prometheus
  26
  27 Configuration
  28 -------------
  29
  30 .. note::
  31
  32     The Prometheus manager module needs to be restarted for configuration changes to be applied.
  33
  34 .. mgr_module:: prometheus
  35 .. confval:: server_addr
  36 .. confval:: server_port
  37 .. confval:: scrape_interval
  38 .. confval:: cache
  39 .. confval:: stale_cache_strategy
  40 .. confval:: rbd_stats_pools
  41 .. confval:: rbd_stats_pools_refresh_interval
  42 .. confval:: standby_behaviour
  43 .. confval:: standby_error_status_code
  44
  45 By default the module will accept HTTP requests on port ``9283`` on all IPv4
  46 and IPv6 addresses on the host.  The port and listen address are both
  47 configurable with ``ceph config set``, with keys
  48 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
  49 is registered with Prometheus's `registry
  50 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
  51
  52 .. prompt:: bash $
  53
  54    ceph config set mgr mgr/prometheus/server_addr 0.0.0.
  55    ceph config set mgr mgr/prometheus/server_port 9283
  56
  57 .. warning::
  58
  59     The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
  60     Prometheus' scrape interval to work properly and not cause any issues.
  61
  62 The scrape interval in the module is used for caching purposes
  63 and to determine when a cache is stale.
  64
  65 It is not recommended to use a scrape interval below 10 seconds.  It is
  66 recommended to use 15 seconds as scrape interval, though, in some cases it
  67 might be useful to increase the scrape interval.
  68
  69 To set a different scrape interval in the Prometheus module, set
  70 ``scrape_interval`` to the desired value:
  71
  72 .. prompt:: bash $
  73
  74    ceph config set mgr mgr/prometheus/scrape_interval 20
  75
  76 On large clusters (>1000 OSDs), the time to fetch the metrics may become
  77 significant.  Without the cache, the Prometheus manager module could, especially
  78 in conjunction with multiple Prometheus instances, overload the manager and lead
  79 to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
  80 by default.  This means that there is a possibility that the cache becomes
  81 stale.  The cache is considered stale when the time to fetch the metrics from
  82 Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
  83
  84 If that is the case, **a warning will be logged** and the module will either
  85
  86 * respond with a 503 HTTP status code (service unavailable) or,
  87 * it will return the content of the cache, even though it might be stale.
  88
  89 This behavior can be configured. By default, it will return a 503 HTTP status
  90 code (service unavailable). You can set other options using the ``ceph config
  91 set`` commands.
  92
  93 To tell the module to respond with possibly stale data, set it to ``return``:
  94
  95 .. prompt:: bash $
  96
  97     ceph config set mgr mgr/prometheus/stale_cache_strategy return
  98
  99 To tell the module to respond with "service unavailable", set it to ``fail``:
 100
 101 .. prompt:: bash $
 102
 103    ceph config set mgr mgr/prometheus/stale_cache_strategy fail
 104
 105 If you are confident that you don't require the cache, you can disable it:
 106
 107 .. prompt:: bash $
 108
 109    ceph config set mgr mgr/prometheus/cache false
 110
 111 If you are using the prometheus module behind some kind of reverse proxy or
 112 loadbalancer, you can simplify discovering the active instance by switching
 113 to ``error``-mode:
 114
 115 .. prompt:: bash $
 116
 117    ceph config set mgr mgr/prometheus/standby_behaviour error
 118
 119 If set, the prometheus module will respond with a HTTP error when requesting ``/``
 120 from the standby instance. The default error code is 500, but you can configure
 121 the HTTP response code with:
 122
 123 .. prompt:: bash $
 124
 125    ceph config set mgr mgr/prometheus/standby_error_status_code 503
 126
 127 Valid error codes are between 400-599.
 128
 129 To switch back to the default behaviour, simply set the config key to ``default``:
 130
 131 .. prompt:: bash $
 132
 133    ceph config set mgr mgr/prometheus/standby_behaviour default
 134
 135 .. _prometheus-rbd-io-statistics:
 136
 137 Ceph Health Checks
 138 ------------------
 139
 140 The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
 141 exposing them to the Prometheus server as discrete metrics. This allows Prometheus
 142 alert rules to be configured for specific health check events.
 143
 144 The metrics take the following form;
 145
 146 ::
 147
 148     # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
 149     # TYPE ceph_health_detail gauge
 150     ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
 151     ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
 152     ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
 153
 154 The health check history is made available through the following commands;
 155
 156 ::
 157
 158     healthcheck history ls [--format {plain|json|json-pretty}]
 159     healthcheck history clear
 160
 161 The ``ls`` command provides an overview of the health checks that the cluster has
 162 encountered, or since the last ``clear`` command was issued. The example below;
 163
 164 ::
 165
 166     [ceph: root@c8-node1 /]# ceph healthcheck history ls
 167     Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
 168     OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
 169     OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 170     PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
 171     3 health check(s) listed
 172
 173
 174 RBD IO statistics
 175 -----------------
 176
 177 The module can optionally collect RBD per-image IO statistics by enabling
 178 dynamic OSD performance counters. The statistics are gathered for all images
 179 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
 180 configuration parameter. The parameter is a comma or space separated list
 181 of ``pool[/namespace]`` entries. If the namespace is not specified the
 182 statistics are collected for all namespaces in the pool.
 183
 184 Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
 185
 186 .. prompt:: bash $
 187
 188    ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
 189
 190 The wildcard can be used to indicate all pools or namespaces:
 191
 192 .. prompt:: bash $
 193
 194    ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
 195
 196 The module makes the list of all available images scanning the specified
 197 pools and namespaces and refreshes it periodically. The period is
 198 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
 199 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
 200 force refresh earlier if it detects statistics from a previously unknown
 201 RBD image.
 202
 203 Example to turn up the sync interval to 10 minutes:
 204
 205 .. prompt:: bash $
 206
 207    ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
 208
 209 Ceph daemon performance counters metrics
 210 -----------------------------------------
 211
 212 With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
 213 perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
 214 the module option ``exclude_perf_counters`` to ``false``:
 215
 216 .. prompt:: bash $
 217
 218    ceph config set mgr mgr/prometheus/exclude_perf_counters false
 219
 220 Statistic names and labels
 221 ==========================
 222
 223 The names of the stats are exactly as Ceph names them, with
 224 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
 225 and ``ceph_`` prefixed to all names.
 226
 227
 228 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
 229 that identifies the type and ID of the daemon they come from.  Some
 230 statistics can come from different types of daemon, so when querying
 231 e.g. an OSD's RocksDB stats, you would probably want to filter
 232 on ceph_daemon starting with "osd" to avoid mixing in the monitor
 233 rocksdb stats.
 234
 235
 236 The *cluster* statistics (i.e. those global to the Ceph cluster)
 237 have labels appropriate to what they report on.  For example,
 238 metrics relating to pools have a ``pool_id`` label.
 239
 240
 241 The long running averages that represent the histograms from core Ceph
 242 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
 243 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
 244 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
 245
 246 Pool and OSD metadata series
 247 ----------------------------
 248
 249 Special series are output to enable displaying and querying on
 250 certain metadata fields.
 251
 252 Pools have a ``ceph_pool_metadata`` field like this:
 253
 254 ::
 255
 256     ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
 257
 258 OSDs have a ``ceph_osd_metadata`` field like this:
 259
 260 ::
 261
 262     ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
 263
 264
 265 Correlating drive statistics with node_exporter
 266 -----------------------------------------------
 267
 268 The prometheus output from Ceph is designed to be used in conjunction
 269 with the generic host monitoring from the Prometheus node_exporter.
 270
 271 To enable correlation of Ceph OSD statistics with node_exporter's
 272 drive statistics, special series are output like this:
 273
 274 ::
 275
 276     ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
 277
 278 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
 279 the ``*`` operator in your prometheus query. All metadata metrics (like ``
 280 ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
 281 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
 282 the resulting metric has additional labels from one side of the query.
 283
 284 See the
 285 `prometheus documentation`__ for more information about constructing queries.
 286
 287 __ https://prometheus.io/docs/prometheus/latest/querying/basics
 288
 289 The goal is to run a query like
 290
 291 ::
 292
 293     rate(node_disk_written_bytes_total[30s]) and
 294     on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 295
 296 Out of the box the above query will not return any metrics since the ``instance`` labels of
 297 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
 298 will be the currently active MGR node.
 299
 300 The following two section outline two approaches to remedy this.
 301
 302 .. note::
 303
 304     If you need to group on the `ceph_daemon` label instead of `device` and
 305     `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
 306     It is advised that you use `ceph_disk_occupation` instead.
 307
 308     The difference is that `ceph_disk_occupation_human` may group several OSDs
 309     into the value of a single `ceph_daemon` label in cases where multiple OSDs
 310     share a disk.
 311
 312 Use label_replace
 313 =================
 314
 315 The ``label_replace`` function (cp.
 316 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
 317 can add a label to, or alter a label of, a metric within a query.
 318
 319 To correlate an OSD and its disks write rate, the following query can be used:
 320
 321 ::
 322
 323     label_replace(
 324         rate(node_disk_written_bytes_total[30s]),
 325         "exported_instance",
 326         "$1",
 327         "instance",
 328         "(.*):.*"
 329     ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
 330
 331 Configuring Prometheus server
 332 =============================
 333
 334 honor_labels
 335 ------------
 336
 337 To enable Ceph to output properly-labeled data relating to any host,
 338 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
 339 to your prometheus configuration.
 340
 341 This allows Ceph to export the proper ``instance`` label without prometheus
 342 overwriting it. Without this setting, Prometheus applies an ``instance`` label
 343 that includes the hostname and port of the endpoint that the series came from.
 344 Because Ceph clusters have multiple manager daemons, this results in an
 345 ``instance`` label that changes spuriously when the active manager daemon
 346 changes.
 347
 348 If this is undesirable a custom ``instance`` label can be set in the
 349 Prometheus target configuration: you might wish to set it to the hostname
 350 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
 351
 352 node_exporter hostname labels
 353 -----------------------------
 354
 355 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
 356 in the ``instance`` field.  This is generally the short hostname of the node.
 357
 358 This is only necessary if you want to correlate Ceph stats with host stats,
 359 but you may find it useful to do it in all cases in case you want to do
 360 the correlation in the future.
 361
 362 Example configuration
 363 ---------------------
 364
 365 This example shows a single node configuration running ceph-mgr and
 366 node_exporter on a server called ``senta04``. Note that this requires one
 367 to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
 368
 369 This is just an example: there are other ways to configure prometheus
 370 scrape targets and label rewrite rules.
 371
 372 prometheus.yml
 373 ~~~~~~~~~~~~~~
 374
 375 ::
 376
 377     global:
 378       scrape_interval:     15s
 379       evaluation_interval: 15s
 380
 381     scrape_configs:
 382       - job_name: 'node'
 383         file_sd_configs:
 384           - files:
 385             - node_targets.yml
 386       - job_name: 'ceph'
 387         honor_labels: true
 388         file_sd_configs:
 389           - files:
 390             - ceph_targets.yml
 391
 392
 393 ceph_targets.yml
 394 ~~~~~~~~~~~~~~~~
 395
 396
 397 ::
 398
 399     [
 400         {
 401             "targets": [ "senta04.mydomain.com:9283" ],
 402             "labels": {}
 403         }
 404     ]
 405
 406
 407 node_targets.yml
 408 ~~~~~~~~~~~~~~~~
 409
 410 ::
 411
 412     [
 413         {
 414             "targets": [ "senta04.mydomain.com:9100" ],
 415             "labels": {
 416                 "instance": "senta04"
 417             }
 418         }
 419     ]
 420
 421
 422 Notes
 423 =====
 424
 425 Counters and gauges are exported; currently histograms and long-running
 426 averages are not.  It's possible that Ceph's 2-D histograms could be
 427 reduced to two separate 1-D histograms, and that long-running averages
 428 could be exported as Prometheus' Summary type.
 429
 430 Timestamps, as with many Prometheus exporters, are established by
 431 the server's scrape time (Prometheus expects that it is polling the
 432 actual counter process synchronously).  It is possible to supply a
 433 timestamp along with the stat report, but the Prometheus team strongly
 434 advises against this.  This means that timestamps will be delayed by
 435 an unpredictable amount; it's not clear if this will be problematic,
 436 but it's worth knowing about.