]> git.proxmox.com Git - ceph.git/blob - ceph/doc/mgr/prometheus.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / mgr / prometheus.rst
1 .. _mgr-prometheus:
2
3 =================
4 Prometheus Module
5 =================
6
7 Provides a Prometheus exporter to pass on Ceph performance counters
8 from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
9 messages from all MgrClient processes (mons and OSDs, for instance)
10 with performance counter schema data and actual counter data, and keeps
11 a circular buffer of the last N samples. This module creates an HTTP
12 endpoint (like all Prometheus exporters) and retrieves the latest sample
13 of every counter when polled (or "scraped" in Prometheus terminology).
14 The HTTP path and query parameters are ignored; all extant counters
15 for all reporting entities are returned in text exposition format.
16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
17
18 Enabling prometheus output
19 ==========================
20
21 The *prometheus* module is enabled with::
22
23 ceph mgr module enable prometheus
24
25 Configuration
26 -------------
27
28 .. note::
29
30 The Prometheus manager module needs to be restarted for configuration changes to be applied.
31
32 .. mgr_module:: prometheus
33 .. confval:: server_addr
34 .. confval:: server_port
35 .. confval:: scrape_interval
36 .. confval:: cache
37 .. confval:: stale_cache_strategy
38 .. confval:: rbd_stats_pools
39 .. confval:: rbd_stats_pools_refresh_interval
40 .. confval:: standby_behaviour
41 .. confval:: standby_error_status_code
42
43 By default the module will accept HTTP requests on port ``9283`` on all IPv4
44 and IPv6 addresses on the host. The port and listen address are both
45 configurable with ``ceph config set``, with keys
46 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
47 is registered with Prometheus's `registry
48 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
49
50 ::
51
52 ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
53 ceph config set mgr mgr/prometheus/server_port 9283
54
55 .. warning::
56
57 The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
58 Prometheus' scrape interval to work properly and not cause any issues.
59
60 The scrape interval in the module is used for caching purposes
61 and to determine when a cache is stale.
62
63 It is not recommended to use a scrape interval below 10 seconds. It is
64 recommended to use 15 seconds as scrape interval, though, in some cases it
65 might be useful to increase the scrape interval.
66
67 To set a different scrape interval in the Prometheus module, set
68 ``scrape_interval`` to the desired value::
69
70 ceph config set mgr mgr/prometheus/scrape_interval 20
71
72 On large clusters (>1000 OSDs), the time to fetch the metrics may become
73 significant. Without the cache, the Prometheus manager module could, especially
74 in conjunction with multiple Prometheus instances, overload the manager and lead
75 to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
76 by default. This means that there is a possibility that the cache becomes
77 stale. The cache is considered stale when the time to fetch the metrics from
78 Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
79
80 If that is the case, **a warning will be logged** and the module will either
81
82 * respond with a 503 HTTP status code (service unavailable) or,
83 * it will return the content of the cache, even though it might be stale.
84
85 This behavior can be configured. By default, it will return a 503 HTTP status
86 code (service unavailable). You can set other options using the ``ceph config
87 set`` commands.
88
89 To tell the module to respond with possibly stale data, set it to ``return``::
90
91 ceph config set mgr mgr/prometheus/stale_cache_strategy return
92
93 To tell the module to respond with "service unavailable", set it to ``fail``::
94
95 ceph config set mgr mgr/prometheus/stale_cache_strategy fail
96
97 If you are confident that you don't require the cache, you can disable it::
98
99 ceph config set mgr mgr/prometheus/cache false
100
101 If you are using the prometheus module behind some kind of reverse proxy or
102 loadbalancer, you can simplify discovering the active instance by switching
103 to ``error``-mode::
104
105 ceph config set mgr mgr/prometheus/standby_behaviour error
106
107 If set, the prometheus module will repond with a HTTP error when requesting ``/``
108 from the standby instance. The default error code is 500, but you can configure
109 the HTTP response code with::
110
111 ceph config set mgr mgr/prometheus/standby_error_status_code 503
112
113 Valid error codes are between 400-599.
114
115 To switch back to the default behaviour, simply set the config key to ``default``::
116
117 ceph config set mgr mgr/prometheus/standby_behaviour default
118
119 .. _prometheus-rbd-io-statistics:
120
121 Ceph Health Checks
122 ------------------
123
124 The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
125 exposing them to the Prometheus server as discrete metrics. This allows Prometheus
126 alert rules to be configured for specific health check events.
127
128 The metrics take the following form;
129
130 ::
131
132 # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
133 # TYPE ceph_health_detail gauge
134 ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
135 ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
136 ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
137
138 The health check history is made available through the following commands;
139
140 ::
141
142 healthcheck history ls [--format {plain|json|json-pretty}]
143 healthcheck history clear
144
145 The ``ls`` command provides an overview of the health checks that the cluster has
146 encountered, or since the last ``clear`` command was issued. The example below;
147
148 ::
149
150 [ceph: root@c8-node1 /]# ceph healthcheck history ls
151 Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
152 OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
153 OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
154 PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
155 3 health check(s) listed
156
157
158 RBD IO statistics
159 -----------------
160
161 The module can optionally collect RBD per-image IO statistics by enabling
162 dynamic OSD performance counters. The statistics are gathered for all images
163 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
164 configuration parameter. The parameter is a comma or space separated list
165 of ``pool[/namespace]`` entries. If the namespace is not specified the
166 statistics are collected for all namespaces in the pool.
167
168 Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
169
170 ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
171
172 The module makes the list of all available images scanning the specified
173 pools and namespaces and refreshes it periodically. The period is
174 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
175 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
176 force refresh earlier if it detects statistics from a previously unknown
177 RBD image.
178
179 Example to turn up the sync interval to 10 minutes::
180
181 ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
182
183 Statistic names and labels
184 ==========================
185
186 The names of the stats are exactly as Ceph names them, with
187 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
188 and ``ceph_`` prefixed to all names.
189
190
191 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
192 that identifies the type and ID of the daemon they come from. Some
193 statistics can come from different types of daemon, so when querying
194 e.g. an OSD's RocksDB stats, you would probably want to filter
195 on ceph_daemon starting with "osd" to avoid mixing in the monitor
196 rocksdb stats.
197
198
199 The *cluster* statistics (i.e. those global to the Ceph cluster)
200 have labels appropriate to what they report on. For example,
201 metrics relating to pools have a ``pool_id`` label.
202
203
204 The long running averages that represent the histograms from core Ceph
205 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
206 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
207 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
208
209 Pool and OSD metadata series
210 ----------------------------
211
212 Special series are output to enable displaying and querying on
213 certain metadata fields.
214
215 Pools have a ``ceph_pool_metadata`` field like this:
216
217 ::
218
219 ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
220
221 OSDs have a ``ceph_osd_metadata`` field like this:
222
223 ::
224
225 ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
226
227
228 Correlating drive statistics with node_exporter
229 -----------------------------------------------
230
231 The prometheus output from Ceph is designed to be used in conjunction
232 with the generic host monitoring from the Prometheus node_exporter.
233
234 To enable correlation of Ceph OSD statistics with node_exporter's
235 drive statistics, special series are output like this:
236
237 ::
238
239 ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
240
241 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
242 the ``*`` operator in your prometheus query. All metadata metrics (like ``
243 ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
244 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
245 the resulting metric has additional labels from one side of the query.
246
247 See the
248 `prometheus documentation`__ for more information about constructing queries.
249
250 __ https://prometheus.io/docs/prometheus/latest/querying/basics
251
252 The goal is to run a query like
253
254 ::
255
256 rate(node_disk_bytes_written[30s]) and
257 on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
258
259 Out of the box the above query will not return any metrics since the ``instance`` labels of
260 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
261 will be the currently active MGR node.
262
263 The following two section outline two approaches to remedy this.
264
265 .. note::
266
267 If you need to group on the `ceph_daemon` label instead of `device` and
268 `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
269 It is advised that you use `ceph_disk_occupation` instead.
270
271 The difference is that `ceph_disk_occupation_human` may group several OSDs
272 into the value of a single `ceph_daemon` label in cases where multiple OSDs
273 share a disk.
274
275 Use label_replace
276 =================
277
278 The ``label_replace`` function (cp.
279 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
280 can add a label to, or alter a label of, a metric within a query.
281
282 To correlate an OSD and its disks write rate, the following query can be used:
283
284 ::
285
286 label_replace(
287 rate(node_disk_bytes_written[30s]),
288 "exported_instance",
289 "$1",
290 "instance",
291 "(.*):.*"
292 ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
293
294 Configuring Prometheus server
295 =============================
296
297 honor_labels
298 ------------
299
300 To enable Ceph to output properly-labeled data relating to any host,
301 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
302 to your prometheus configuration.
303
304 This allows Ceph to export the proper ``instance`` label without prometheus
305 overwriting it. Without this setting, Prometheus applies an ``instance`` label
306 that includes the hostname and port of the endpoint that the series came from.
307 Because Ceph clusters have multiple manager daemons, this results in an
308 ``instance`` label that changes spuriously when the active manager daemon
309 changes.
310
311 If this is undesirable a custom ``instance`` label can be set in the
312 Prometheus target configuration: you might wish to set it to the hostname
313 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
314
315 node_exporter hostname labels
316 -----------------------------
317
318 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
319 in the ``instance`` field. This is generally the short hostname of the node.
320
321 This is only necessary if you want to correlate Ceph stats with host stats,
322 but you may find it useful to do it in all cases in case you want to do
323 the correlation in the future.
324
325 Example configuration
326 ---------------------
327
328 This example shows a single node configuration running ceph-mgr and
329 node_exporter on a server called ``senta04``. Note that this requires one
330 to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
331
332 This is just an example: there are other ways to configure prometheus
333 scrape targets and label rewrite rules.
334
335 prometheus.yml
336 ~~~~~~~~~~~~~~
337
338 ::
339
340 global:
341 scrape_interval: 15s
342 evaluation_interval: 15s
343
344 scrape_configs:
345 - job_name: 'node'
346 file_sd_configs:
347 - files:
348 - node_targets.yml
349 - job_name: 'ceph'
350 honor_labels: true
351 file_sd_configs:
352 - files:
353 - ceph_targets.yml
354
355
356 ceph_targets.yml
357 ~~~~~~~~~~~~~~~~
358
359
360 ::
361
362 [
363 {
364 "targets": [ "senta04.mydomain.com:9283" ],
365 "labels": {}
366 }
367 ]
368
369
370 node_targets.yml
371 ~~~~~~~~~~~~~~~~
372
373 ::
374
375 [
376 {
377 "targets": [ "senta04.mydomain.com:9100" ],
378 "labels": {
379 "instance": "senta04"
380 }
381 }
382 ]
383
384
385 Notes
386 =====
387
388 Counters and gauges are exported; currently histograms and long-running
389 averages are not. It's possible that Ceph's 2-D histograms could be
390 reduced to two separate 1-D histograms, and that long-running averages
391 could be exported as Prometheus' Summary type.
392
393 Timestamps, as with many Prometheus exporters, are established by
394 the server's scrape time (Prometheus expects that it is polling the
395 actual counter process synchronously). It is possible to supply a
396 timestamp along with the stat report, but the Prometheus team strongly
397 advises against this. This means that timestamps will be delayed by
398 an unpredictable amount; it's not clear if this will be problematic,
399 but it's worth knowing about.