]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. _mgr-prometheus: |
2 | ||
3efd9988 | 3 | ================= |
11fdf7f2 | 4 | Prometheus Module |
c07f9fc5 FG |
5 | ================= |
6 | ||
7 | Provides a Prometheus exporter to pass on Ceph performance counters | |
8 | from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport | |
9 | messages from all MgrClient processes (mons and OSDs, for instance) | |
10 | with performance counter schema data and actual counter data, and keeps | |
11fdf7f2 | 11 | a circular buffer of the last N samples. This module creates an HTTP |
c07f9fc5 FG |
12 | endpoint (like all Prometheus exporters) and retrieves the latest sample |
13 | of every counter when polled (or "scraped" in Prometheus terminology). | |
14 | The HTTP path and query parameters are ignored; all extant counters | |
15 | for all reporting entities are returned in text exposition format. | |
16 | (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.) | |
17 | ||
3efd9988 FG |
18 | Enabling prometheus output |
19 | ========================== | |
c07f9fc5 | 20 | |
1e59de90 | 21 | The *prometheus* module is enabled with: |
c07f9fc5 | 22 | |
1e59de90 TL |
23 | .. prompt:: bash $ |
24 | ||
25 | ceph mgr module enable prometheus | |
c07f9fc5 FG |
26 | |
27 | Configuration | |
28 | ------------- | |
29 | ||
f6b5b4d7 TL |
30 | .. note:: |
31 | ||
32 | The Prometheus manager module needs to be restarted for configuration changes to be applied. | |
33 | ||
20effc67 TL |
34 | .. mgr_module:: prometheus |
35 | .. confval:: server_addr | |
36 | .. confval:: server_port | |
37 | .. confval:: scrape_interval | |
38 | .. confval:: cache | |
39 | .. confval:: stale_cache_strategy | |
40 | .. confval:: rbd_stats_pools | |
41 | .. confval:: rbd_stats_pools_refresh_interval | |
42 | .. confval:: standby_behaviour | |
43 | .. confval:: standby_error_status_code | |
44 | ||
f6b5b4d7 TL |
45 | By default the module will accept HTTP requests on port ``9283`` on all IPv4 |
46 | and IPv6 addresses on the host. The port and listen address are both | |
f67539c2 | 47 | configurable with ``ceph config set``, with keys |
f6b5b4d7 TL |
48 | ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port |
49 | is registered with Prometheus's `registry | |
50 | <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_. | |
51 | ||
1e59de90 TL |
52 | .. prompt:: bash $ |
53 | ||
54 | ceph config set mgr mgr/prometheus/server_addr 0.0.0. | |
55 | ceph config set mgr mgr/prometheus/server_port 9283 | |
f6b5b4d7 TL |
56 | |
57 | .. warning:: | |
58 | ||
20effc67 | 59 | The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match |
f6b5b4d7 | 60 | Prometheus' scrape interval to work properly and not cause any issues. |
20effc67 TL |
61 | |
62 | The scrape interval in the module is used for caching purposes | |
f6b5b4d7 TL |
63 | and to determine when a cache is stale. |
64 | ||
65 | It is not recommended to use a scrape interval below 10 seconds. It is | |
66 | recommended to use 15 seconds as scrape interval, though, in some cases it | |
67 | might be useful to increase the scrape interval. | |
68 | ||
69 | To set a different scrape interval in the Prometheus module, set | |
1e59de90 | 70 | ``scrape_interval`` to the desired value: |
f6b5b4d7 | 71 | |
1e59de90 TL |
72 | .. prompt:: bash $ |
73 | ||
74 | ceph config set mgr mgr/prometheus/scrape_interval 20 | |
f6b5b4d7 TL |
75 | |
76 | On large clusters (>1000 OSDs), the time to fetch the metrics may become | |
a4b75251 TL |
77 | significant. Without the cache, the Prometheus manager module could, especially |
78 | in conjunction with multiple Prometheus instances, overload the manager and lead | |
79 | to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled | |
80 | by default. This means that there is a possibility that the cache becomes | |
81 | stale. The cache is considered stale when the time to fetch the metrics from | |
1e59de90 | 82 | Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`. |
f6b5b4d7 TL |
83 | |
84 | If that is the case, **a warning will be logged** and the module will either | |
85 | ||
86 | * respond with a 503 HTTP status code (service unavailable) or, | |
87 | * it will return the content of the cache, even though it might be stale. | |
88 | ||
89 | This behavior can be configured. By default, it will return a 503 HTTP status | |
90 | code (service unavailable). You can set other options using the ``ceph config | |
91 | set`` commands. | |
92 | ||
1e59de90 TL |
93 | To tell the module to respond with possibly stale data, set it to ``return``: |
94 | ||
95 | .. prompt:: bash $ | |
f6b5b4d7 TL |
96 | |
97 | ceph config set mgr mgr/prometheus/stale_cache_strategy return | |
98 | ||
1e59de90 | 99 | To tell the module to respond with "service unavailable", set it to ``fail``: |
f6b5b4d7 | 100 | |
1e59de90 | 101 | .. prompt:: bash $ |
c07f9fc5 | 102 | |
1e59de90 | 103 | ceph config set mgr mgr/prometheus/stale_cache_strategy fail |
a4b75251 | 104 | |
1e59de90 TL |
105 | If you are confident that you don't require the cache, you can disable it: |
106 | ||
107 | .. prompt:: bash $ | |
108 | ||
109 | ceph config set mgr mgr/prometheus/cache false | |
a4b75251 | 110 | |
20effc67 TL |
111 | If you are using the prometheus module behind some kind of reverse proxy or |
112 | loadbalancer, you can simplify discovering the active instance by switching | |
1e59de90 TL |
113 | to ``error``-mode: |
114 | ||
115 | .. prompt:: bash $ | |
20effc67 | 116 | |
1e59de90 | 117 | ceph config set mgr mgr/prometheus/standby_behaviour error |
20effc67 | 118 | |
1e59de90 | 119 | If set, the prometheus module will respond with a HTTP error when requesting ``/`` |
20effc67 | 120 | from the standby instance. The default error code is 500, but you can configure |
1e59de90 | 121 | the HTTP response code with: |
20effc67 | 122 | |
1e59de90 TL |
123 | .. prompt:: bash $ |
124 | ||
125 | ceph config set mgr mgr/prometheus/standby_error_status_code 503 | |
20effc67 TL |
126 | |
127 | Valid error codes are between 400-599. | |
128 | ||
1e59de90 TL |
129 | To switch back to the default behaviour, simply set the config key to ``default``: |
130 | ||
131 | .. prompt:: bash $ | |
20effc67 | 132 | |
1e59de90 | 133 | ceph config set mgr mgr/prometheus/standby_behaviour default |
20effc67 | 134 | |
e306af50 TL |
135 | .. _prometheus-rbd-io-statistics: |
136 | ||
20effc67 TL |
137 | Ceph Health Checks |
138 | ------------------ | |
139 | ||
140 | The mgr/prometheus module also tracks and maintains a history of Ceph health checks, | |
141 | exposing them to the Prometheus server as discrete metrics. This allows Prometheus | |
142 | alert rules to be configured for specific health check events. | |
143 | ||
144 | The metrics take the following form; | |
145 | ||
146 | :: | |
147 | ||
148 | # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active) | |
149 | # TYPE ceph_health_detail gauge | |
150 | ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0 | |
151 | ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0 | |
152 | ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0 | |
153 | ||
154 | The health check history is made available through the following commands; | |
155 | ||
156 | :: | |
157 | ||
158 | healthcheck history ls [--format {plain|json|json-pretty}] | |
159 | healthcheck history clear | |
160 | ||
161 | The ``ls`` command provides an overview of the health checks that the cluster has | |
162 | encountered, or since the last ``clear`` command was issued. The example below; | |
163 | ||
164 | :: | |
165 | ||
166 | [ceph: root@c8-node1 /]# ceph healthcheck history ls | |
167 | Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active | |
168 | OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No | |
169 | OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes | |
170 | PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes | |
171 | 3 health check(s) listed | |
172 | ||
173 | ||
11fdf7f2 TL |
174 | RBD IO statistics |
175 | ----------------- | |
176 | ||
177 | The module can optionally collect RBD per-image IO statistics by enabling | |
178 | dynamic OSD performance counters. The statistics are gathered for all images | |
179 | in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools`` | |
180 | configuration parameter. The parameter is a comma or space separated list | |
181 | of ``pool[/namespace]`` entries. If the namespace is not specified the | |
182 | statistics are collected for all namespaces in the pool. | |
183 | ||
1e59de90 | 184 | Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``: |
e306af50 | 185 | |
1e59de90 TL |
186 | .. prompt:: bash $ |
187 | ||
188 | ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN" | |
189 | ||
190 | The wildcard can be used to indicate all pools or namespaces: | |
191 | ||
192 | .. prompt:: bash $ | |
193 | ||
194 | ceph config set mgr mgr/prometheus/rbd_stats_pools "*" | |
e306af50 | 195 | |
11fdf7f2 TL |
196 | The module makes the list of all available images scanning the specified |
197 | pools and namespaces and refreshes it periodically. The period is | |
198 | configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval`` | |
199 | parameter (in sec) and is 300 sec (5 minutes) by default. The module will | |
200 | force refresh earlier if it detects statistics from a previously unknown | |
201 | RBD image. | |
202 | ||
1e59de90 TL |
203 | Example to turn up the sync interval to 10 minutes: |
204 | ||
205 | .. prompt:: bash $ | |
206 | ||
207 | ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600 | |
208 | ||
209 | Ceph daemon performance counters metrics | |
210 | ----------------------------------------- | |
211 | ||
212 | With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon | |
213 | perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting | |
214 | the module option ``exclude_perf_counters`` to ``false``: | |
215 | ||
216 | .. prompt:: bash $ | |
e306af50 | 217 | |
1e59de90 | 218 | ceph config set mgr mgr/prometheus/exclude_perf_counters false |
e306af50 | 219 | |
3efd9988 FG |
220 | Statistic names and labels |
221 | ========================== | |
222 | ||
223 | The names of the stats are exactly as Ceph names them, with | |
f6b5b4d7 | 224 | illegal characters ``.``, ``-`` and ``::`` translated to ``_``, |
3efd9988 FG |
225 | and ``ceph_`` prefixed to all names. |
226 | ||
227 | ||
228 | All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123" | |
229 | that identifies the type and ID of the daemon they come from. Some | |
230 | statistics can come from different types of daemon, so when querying | |
231 | e.g. an OSD's RocksDB stats, you would probably want to filter | |
232 | on ceph_daemon starting with "osd" to avoid mixing in the monitor | |
233 | rocksdb stats. | |
234 | ||
235 | ||
236 | The *cluster* statistics (i.e. those global to the Ceph cluster) | |
f6b5b4d7 | 237 | have labels appropriate to what they report on. For example, |
3efd9988 FG |
238 | metrics relating to pools have a ``pool_id`` label. |
239 | ||
11fdf7f2 TL |
240 | |
241 | The long running averages that represent the histograms from core Ceph | |
242 | are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics. | |
243 | This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_ | |
244 | and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_. | |
245 | ||
3efd9988 FG |
246 | Pool and OSD metadata series |
247 | ---------------------------- | |
248 | ||
249 | Special series are output to enable displaying and querying on | |
250 | certain metadata fields. | |
251 | ||
252 | Pools have a ``ceph_pool_metadata`` field like this: | |
253 | ||
254 | :: | |
255 | ||
28e407b8 | 256 | ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0 |
3efd9988 FG |
257 | |
258 | OSDs have a ``ceph_osd_metadata`` field like this: | |
259 | ||
260 | :: | |
261 | ||
28e407b8 | 262 | ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0 |
3efd9988 FG |
263 | |
264 | ||
265 | Correlating drive statistics with node_exporter | |
266 | ----------------------------------------------- | |
267 | ||
268 | The prometheus output from Ceph is designed to be used in conjunction | |
269 | with the generic host monitoring from the Prometheus node_exporter. | |
270 | ||
f6b5b4d7 | 271 | To enable correlation of Ceph OSD statistics with node_exporter's |
3efd9988 FG |
272 | drive statistics, special series are output like this: |
273 | ||
274 | :: | |
275 | ||
20effc67 | 276 | ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"} |
3efd9988 | 277 | |
28e407b8 AA |
278 | To use this to get disk statistics by OSD ID, use either the ``and`` operator or |
279 | the ``*`` operator in your prometheus query. All metadata metrics (like `` | |
20effc67 | 280 | ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*`` |
28e407b8 AA |
281 | allows to use ``group_left`` and ``group_right`` grouping modifiers, so that |
282 | the resulting metric has additional labels from one side of the query. | |
283 | ||
284 | See the | |
285 | `prometheus documentation`__ for more information about constructing queries. | |
286 | ||
287 | __ https://prometheus.io/docs/prometheus/latest/querying/basics | |
288 | ||
289 | The goal is to run a query like | |
3efd9988 FG |
290 | |
291 | :: | |
292 | ||
1e59de90 | 293 | rate(node_disk_written_bytes_total[30s]) and |
20effc67 | 294 | on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"} |
3efd9988 | 295 | |
28e407b8 | 296 | Out of the box the above query will not return any metrics since the ``instance`` labels of |
20effc67 | 297 | both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human`` |
28e407b8 | 298 | will be the currently active MGR node. |
3efd9988 | 299 | |
20effc67 TL |
300 | The following two section outline two approaches to remedy this. |
301 | ||
302 | .. note:: | |
303 | ||
304 | If you need to group on the `ceph_daemon` label instead of `device` and | |
305 | `instance` labels, using `ceph_disk_occupation_human` may not work reliably. | |
306 | It is advised that you use `ceph_disk_occupation` instead. | |
307 | ||
308 | The difference is that `ceph_disk_occupation_human` may group several OSDs | |
309 | into the value of a single `ceph_daemon` label in cases where multiple OSDs | |
310 | share a disk. | |
3efd9988 | 311 | |
28e407b8 AA |
312 | Use label_replace |
313 | ================= | |
314 | ||
315 | The ``label_replace`` function (cp. | |
316 | `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_) | |
317 | can add a label to, or alter a label of, a metric within a query. | |
3efd9988 | 318 | |
28e407b8 AA |
319 | To correlate an OSD and its disks write rate, the following query can be used: |
320 | ||
321 | :: | |
3efd9988 | 322 | |
20effc67 | 323 | label_replace( |
1e59de90 | 324 | rate(node_disk_written_bytes_total[30s]), |
20effc67 TL |
325 | "exported_instance", |
326 | "$1", | |
327 | "instance", | |
328 | "(.*):.*" | |
329 | ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"} | |
28e407b8 AA |
330 | |
331 | Configuring Prometheus server | |
332 | ============================= | |
3efd9988 FG |
333 | |
334 | honor_labels | |
335 | ------------ | |
336 | ||
11fdf7f2 | 337 | To enable Ceph to output properly-labeled data relating to any host, |
3efd9988 FG |
338 | use the ``honor_labels`` setting when adding the ceph-mgr endpoints |
339 | to your prometheus configuration. | |
340 | ||
28e407b8 AA |
341 | This allows Ceph to export the proper ``instance`` label without prometheus |
342 | overwriting it. Without this setting, Prometheus applies an ``instance`` label | |
11fdf7f2 | 343 | that includes the hostname and port of the endpoint that the series came from. |
28e407b8 AA |
344 | Because Ceph clusters have multiple manager daemons, this results in an |
345 | ``instance`` label that changes spuriously when the active manager daemon | |
346 | changes. | |
3efd9988 | 347 | |
11fdf7f2 TL |
348 | If this is undesirable a custom ``instance`` label can be set in the |
349 | Prometheus target configuration: you might wish to set it to the hostname | |
350 | of your first mgr daemon, or something completely arbitrary like "ceph_cluster". | |
351 | ||
28e407b8 | 352 | node_exporter hostname labels |
3efd9988 FG |
353 | ----------------------------- |
354 | ||
355 | Set your ``instance`` labels to match what appears in Ceph's OSD metadata | |
28e407b8 | 356 | in the ``instance`` field. This is generally the short hostname of the node. |
3efd9988 FG |
357 | |
358 | This is only necessary if you want to correlate Ceph stats with host stats, | |
359 | but you may find it useful to do it in all cases in case you want to do | |
360 | the correlation in the future. | |
361 | ||
362 | Example configuration | |
363 | --------------------- | |
364 | ||
365 | This example shows a single node configuration running ceph-mgr and | |
f67539c2 TL |
366 | node_exporter on a server called ``senta04``. Note that this requires one |
367 | to add an appropriate and unique ``instance`` label to each ``node_exporter`` target. | |
3efd9988 FG |
368 | |
369 | This is just an example: there are other ways to configure prometheus | |
370 | scrape targets and label rewrite rules. | |
371 | ||
372 | prometheus.yml | |
373 | ~~~~~~~~~~~~~~ | |
374 | ||
375 | :: | |
376 | ||
377 | global: | |
378 | scrape_interval: 15s | |
379 | evaluation_interval: 15s | |
380 | ||
381 | scrape_configs: | |
382 | - job_name: 'node' | |
383 | file_sd_configs: | |
384 | - files: | |
385 | - node_targets.yml | |
386 | - job_name: 'ceph' | |
387 | honor_labels: true | |
388 | file_sd_configs: | |
389 | - files: | |
390 | - ceph_targets.yml | |
391 | ||
392 | ||
393 | ceph_targets.yml | |
394 | ~~~~~~~~~~~~~~~~ | |
395 | ||
396 | ||
397 | :: | |
398 | ||
399 | [ | |
400 | { | |
401 | "targets": [ "senta04.mydomain.com:9283" ], | |
28e407b8 | 402 | "labels": {} |
3efd9988 FG |
403 | } |
404 | ] | |
405 | ||
406 | ||
407 | node_targets.yml | |
408 | ~~~~~~~~~~~~~~~~ | |
409 | ||
410 | :: | |
411 | ||
412 | [ | |
413 | { | |
414 | "targets": [ "senta04.mydomain.com:9100" ], | |
415 | "labels": { | |
416 | "instance": "senta04" | |
417 | } | |
418 | } | |
419 | ] | |
420 | ||
421 | ||
c07f9fc5 | 422 | Notes |
3efd9988 | 423 | ===== |
c07f9fc5 | 424 | |
20effc67 TL |
425 | Counters and gauges are exported; currently histograms and long-running |
426 | averages are not. It's possible that Ceph's 2-D histograms could be | |
c07f9fc5 FG |
427 | reduced to two separate 1-D histograms, and that long-running averages |
428 | could be exported as Prometheus' Summary type. | |
429 | ||
c07f9fc5 FG |
430 | Timestamps, as with many Prometheus exporters, are established by |
431 | the server's scrape time (Prometheus expects that it is polling the | |
432 | actual counter process synchronously). It is possible to supply a | |
433 | timestamp along with the stat report, but the Prometheus team strongly | |
434 | advises against this. This means that timestamps will be delayed by | |
435 | an unpredictable amount; it's not clear if this will be problematic, | |
436 | but it's worth knowing about. |