]> git.proxmox.com Git - ceph.git/blame - ceph/doc/mgr/prometheus.rst
update ceph source to reef 18.1.2
[ceph.git] / ceph / doc / mgr / prometheus.rst
CommitLineData
11fdf7f2
TL
1.. _mgr-prometheus:
2
3efd9988 3=================
11fdf7f2 4Prometheus Module
c07f9fc5
FG
5=================
6
7Provides a Prometheus exporter to pass on Ceph performance counters
8from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
9messages from all MgrClient processes (mons and OSDs, for instance)
10with performance counter schema data and actual counter data, and keeps
11fdf7f2 11a circular buffer of the last N samples. This module creates an HTTP
c07f9fc5
FG
12endpoint (like all Prometheus exporters) and retrieves the latest sample
13of every counter when polled (or "scraped" in Prometheus terminology).
14The HTTP path and query parameters are ignored; all extant counters
15for all reporting entities are returned in text exposition format.
16(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
17
3efd9988
FG
18Enabling prometheus output
19==========================
c07f9fc5 20
1e59de90 21The *prometheus* module is enabled with:
c07f9fc5 22
1e59de90
TL
23.. prompt:: bash $
24
25 ceph mgr module enable prometheus
c07f9fc5
FG
26
27Configuration
28-------------
29
f6b5b4d7
TL
30.. note::
31
32 The Prometheus manager module needs to be restarted for configuration changes to be applied.
33
20effc67
TL
34.. mgr_module:: prometheus
35.. confval:: server_addr
36.. confval:: server_port
37.. confval:: scrape_interval
38.. confval:: cache
39.. confval:: stale_cache_strategy
40.. confval:: rbd_stats_pools
41.. confval:: rbd_stats_pools_refresh_interval
42.. confval:: standby_behaviour
43.. confval:: standby_error_status_code
44
f6b5b4d7
TL
45By default the module will accept HTTP requests on port ``9283`` on all IPv4
46and IPv6 addresses on the host. The port and listen address are both
f67539c2 47configurable with ``ceph config set``, with keys
f6b5b4d7
TL
48``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
49is registered with Prometheus's `registry
50<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
51
1e59de90
TL
52.. prompt:: bash $
53
54 ceph config set mgr mgr/prometheus/server_addr 0.0.0.
55 ceph config set mgr mgr/prometheus/server_port 9283
f6b5b4d7
TL
56
57.. warning::
58
20effc67 59 The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
f6b5b4d7 60 Prometheus' scrape interval to work properly and not cause any issues.
20effc67
TL
61
62The scrape interval in the module is used for caching purposes
f6b5b4d7
TL
63and to determine when a cache is stale.
64
65It is not recommended to use a scrape interval below 10 seconds. It is
66recommended to use 15 seconds as scrape interval, though, in some cases it
67might be useful to increase the scrape interval.
68
69To set a different scrape interval in the Prometheus module, set
1e59de90 70``scrape_interval`` to the desired value:
f6b5b4d7 71
1e59de90
TL
72.. prompt:: bash $
73
74 ceph config set mgr mgr/prometheus/scrape_interval 20
f6b5b4d7
TL
75
76On large clusters (>1000 OSDs), the time to fetch the metrics may become
a4b75251
TL
77significant. Without the cache, the Prometheus manager module could, especially
78in conjunction with multiple Prometheus instances, overload the manager and lead
79to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
80by default. This means that there is a possibility that the cache becomes
81stale. The cache is considered stale when the time to fetch the metrics from
1e59de90 82Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
f6b5b4d7
TL
83
84If that is the case, **a warning will be logged** and the module will either
85
86* respond with a 503 HTTP status code (service unavailable) or,
87* it will return the content of the cache, even though it might be stale.
88
89This behavior can be configured. By default, it will return a 503 HTTP status
90code (service unavailable). You can set other options using the ``ceph config
91set`` commands.
92
1e59de90
TL
93To tell the module to respond with possibly stale data, set it to ``return``:
94
95.. prompt:: bash $
f6b5b4d7
TL
96
97 ceph config set mgr mgr/prometheus/stale_cache_strategy return
98
1e59de90 99To tell the module to respond with "service unavailable", set it to ``fail``:
f6b5b4d7 100
1e59de90 101.. prompt:: bash $
c07f9fc5 102
1e59de90 103 ceph config set mgr mgr/prometheus/stale_cache_strategy fail
a4b75251 104
1e59de90
TL
105If you are confident that you don't require the cache, you can disable it:
106
107.. prompt:: bash $
108
109 ceph config set mgr mgr/prometheus/cache false
a4b75251 110
20effc67
TL
111If you are using the prometheus module behind some kind of reverse proxy or
112loadbalancer, you can simplify discovering the active instance by switching
1e59de90
TL
113to ``error``-mode:
114
115.. prompt:: bash $
20effc67 116
1e59de90 117 ceph config set mgr mgr/prometheus/standby_behaviour error
20effc67 118
1e59de90 119If set, the prometheus module will respond with a HTTP error when requesting ``/``
20effc67 120from the standby instance. The default error code is 500, but you can configure
1e59de90 121the HTTP response code with:
20effc67 122
1e59de90
TL
123.. prompt:: bash $
124
125 ceph config set mgr mgr/prometheus/standby_error_status_code 503
20effc67
TL
126
127Valid error codes are between 400-599.
128
1e59de90
TL
129To switch back to the default behaviour, simply set the config key to ``default``:
130
131.. prompt:: bash $
20effc67 132
1e59de90 133 ceph config set mgr mgr/prometheus/standby_behaviour default
20effc67 134
e306af50
TL
135.. _prometheus-rbd-io-statistics:
136
20effc67
TL
137Ceph Health Checks
138------------------
139
140The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
141exposing them to the Prometheus server as discrete metrics. This allows Prometheus
142alert rules to be configured for specific health check events.
143
144The metrics take the following form;
145
146::
147
148 # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
149 # TYPE ceph_health_detail gauge
150 ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
151 ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
152 ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
153
154The health check history is made available through the following commands;
155
156::
157
158 healthcheck history ls [--format {plain|json|json-pretty}]
159 healthcheck history clear
160
161The ``ls`` command provides an overview of the health checks that the cluster has
162encountered, or since the last ``clear`` command was issued. The example below;
163
164::
165
166 [ceph: root@c8-node1 /]# ceph healthcheck history ls
167 Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
168 OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
169 OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
170 PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
171 3 health check(s) listed
172
173
11fdf7f2
TL
174RBD IO statistics
175-----------------
176
177The module can optionally collect RBD per-image IO statistics by enabling
178dynamic OSD performance counters. The statistics are gathered for all images
179in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
180configuration parameter. The parameter is a comma or space separated list
181of ``pool[/namespace]`` entries. If the namespace is not specified the
182statistics are collected for all namespaces in the pool.
183
1e59de90 184Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
e306af50 185
1e59de90
TL
186.. prompt:: bash $
187
188 ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
189
190The wildcard can be used to indicate all pools or namespaces:
191
192.. prompt:: bash $
193
194 ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
e306af50 195
11fdf7f2
TL
196The module makes the list of all available images scanning the specified
197pools and namespaces and refreshes it periodically. The period is
198configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
199parameter (in sec) and is 300 sec (5 minutes) by default. The module will
200force refresh earlier if it detects statistics from a previously unknown
201RBD image.
202
1e59de90
TL
203Example to turn up the sync interval to 10 minutes:
204
205.. prompt:: bash $
206
207 ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
208
209Ceph daemon performance counters metrics
210-----------------------------------------
211
212With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
213perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
214the module option ``exclude_perf_counters`` to ``false``:
215
216.. prompt:: bash $
e306af50 217
1e59de90 218 ceph config set mgr mgr/prometheus/exclude_perf_counters false
e306af50 219
3efd9988
FG
220Statistic names and labels
221==========================
222
223The names of the stats are exactly as Ceph names them, with
f6b5b4d7 224illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
3efd9988
FG
225and ``ceph_`` prefixed to all names.
226
227
228All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
229that identifies the type and ID of the daemon they come from. Some
230statistics can come from different types of daemon, so when querying
231e.g. an OSD's RocksDB stats, you would probably want to filter
232on ceph_daemon starting with "osd" to avoid mixing in the monitor
233rocksdb stats.
234
235
236The *cluster* statistics (i.e. those global to the Ceph cluster)
f6b5b4d7 237have labels appropriate to what they report on. For example,
3efd9988
FG
238metrics relating to pools have a ``pool_id`` label.
239
11fdf7f2
TL
240
241The long running averages that represent the histograms from core Ceph
242are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
243This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
244and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
245
3efd9988
FG
246Pool and OSD metadata series
247----------------------------
248
249Special series are output to enable displaying and querying on
250certain metadata fields.
251
252Pools have a ``ceph_pool_metadata`` field like this:
253
254::
255
28e407b8 256 ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
3efd9988
FG
257
258OSDs have a ``ceph_osd_metadata`` field like this:
259
260::
261
28e407b8 262 ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
3efd9988
FG
263
264
265Correlating drive statistics with node_exporter
266-----------------------------------------------
267
268The prometheus output from Ceph is designed to be used in conjunction
269with the generic host monitoring from the Prometheus node_exporter.
270
f6b5b4d7 271To enable correlation of Ceph OSD statistics with node_exporter's
3efd9988
FG
272drive statistics, special series are output like this:
273
274::
275
20effc67 276 ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
3efd9988 277
28e407b8
AA
278To use this to get disk statistics by OSD ID, use either the ``and`` operator or
279the ``*`` operator in your prometheus query. All metadata metrics (like ``
20effc67 280ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
28e407b8
AA
281allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
282the resulting metric has additional labels from one side of the query.
283
284See the
285`prometheus documentation`__ for more information about constructing queries.
286
287__ https://prometheus.io/docs/prometheus/latest/querying/basics
288
289The goal is to run a query like
3efd9988
FG
290
291::
292
1e59de90 293 rate(node_disk_written_bytes_total[30s]) and
20effc67 294 on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
3efd9988 295
28e407b8 296Out of the box the above query will not return any metrics since the ``instance`` labels of
20effc67 297both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
28e407b8 298will be the currently active MGR node.
3efd9988 299
20effc67
TL
300The following two section outline two approaches to remedy this.
301
302.. note::
303
304 If you need to group on the `ceph_daemon` label instead of `device` and
305 `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
306 It is advised that you use `ceph_disk_occupation` instead.
307
308 The difference is that `ceph_disk_occupation_human` may group several OSDs
309 into the value of a single `ceph_daemon` label in cases where multiple OSDs
310 share a disk.
3efd9988 311
28e407b8
AA
312Use label_replace
313=================
314
315The ``label_replace`` function (cp.
316`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
317can add a label to, or alter a label of, a metric within a query.
3efd9988 318
28e407b8
AA
319To correlate an OSD and its disks write rate, the following query can be used:
320
321::
3efd9988 322
20effc67 323 label_replace(
1e59de90 324 rate(node_disk_written_bytes_total[30s]),
20effc67
TL
325 "exported_instance",
326 "$1",
327 "instance",
328 "(.*):.*"
329 ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
28e407b8
AA
330
331Configuring Prometheus server
332=============================
3efd9988
FG
333
334honor_labels
335------------
336
11fdf7f2 337To enable Ceph to output properly-labeled data relating to any host,
3efd9988
FG
338use the ``honor_labels`` setting when adding the ceph-mgr endpoints
339to your prometheus configuration.
340
28e407b8
AA
341This allows Ceph to export the proper ``instance`` label without prometheus
342overwriting it. Without this setting, Prometheus applies an ``instance`` label
11fdf7f2 343that includes the hostname and port of the endpoint that the series came from.
28e407b8
AA
344Because Ceph clusters have multiple manager daemons, this results in an
345``instance`` label that changes spuriously when the active manager daemon
346changes.
3efd9988 347
11fdf7f2
TL
348If this is undesirable a custom ``instance`` label can be set in the
349Prometheus target configuration: you might wish to set it to the hostname
350of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
351
28e407b8 352node_exporter hostname labels
3efd9988
FG
353-----------------------------
354
355Set your ``instance`` labels to match what appears in Ceph's OSD metadata
28e407b8 356in the ``instance`` field. This is generally the short hostname of the node.
3efd9988
FG
357
358This is only necessary if you want to correlate Ceph stats with host stats,
359but you may find it useful to do it in all cases in case you want to do
360the correlation in the future.
361
362Example configuration
363---------------------
364
365This example shows a single node configuration running ceph-mgr and
f67539c2
TL
366node_exporter on a server called ``senta04``. Note that this requires one
367to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
3efd9988
FG
368
369This is just an example: there are other ways to configure prometheus
370scrape targets and label rewrite rules.
371
372prometheus.yml
373~~~~~~~~~~~~~~
374
375::
376
377 global:
378 scrape_interval: 15s
379 evaluation_interval: 15s
380
381 scrape_configs:
382 - job_name: 'node'
383 file_sd_configs:
384 - files:
385 - node_targets.yml
386 - job_name: 'ceph'
387 honor_labels: true
388 file_sd_configs:
389 - files:
390 - ceph_targets.yml
391
392
393ceph_targets.yml
394~~~~~~~~~~~~~~~~
395
396
397::
398
399 [
400 {
401 "targets": [ "senta04.mydomain.com:9283" ],
28e407b8 402 "labels": {}
3efd9988
FG
403 }
404 ]
405
406
407node_targets.yml
408~~~~~~~~~~~~~~~~
409
410::
411
412 [
413 {
414 "targets": [ "senta04.mydomain.com:9100" ],
415 "labels": {
416 "instance": "senta04"
417 }
418 }
419 ]
420
421
c07f9fc5 422Notes
3efd9988 423=====
c07f9fc5 424
20effc67
TL
425Counters and gauges are exported; currently histograms and long-running
426averages are not. It's possible that Ceph's 2-D histograms could be
c07f9fc5
FG
427reduced to two separate 1-D histograms, and that long-running averages
428could be exported as Prometheus' Summary type.
429
c07f9fc5
FG
430Timestamps, as with many Prometheus exporters, are established by
431the server's scrape time (Prometheus expects that it is polling the
432actual counter process synchronously). It is possible to supply a
433timestamp along with the stat report, but the Prometheus team strongly
434advises against this. This means that timestamps will be delayed by
435an unpredictable amount; it's not clear if this will be problematic,
436but it's worth knowing about.