7 Provides a Prometheus exporter to pass on Ceph performance counters
8 from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
9 messages from all MgrClient processes (mons and OSDs, for instance)
10 with performance counter schema data and actual counter data, and keeps
11 a circular buffer of the last N samples. This module creates an HTTP
12 endpoint (like all Prometheus exporters) and retrieves the latest sample
13 of every counter when polled (or "scraped" in Prometheus terminology).
14 The HTTP path and query parameters are ignored; all extant counters
15 for all reporting entities are returned in text exposition format.
16 (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
18 Enabling prometheus output
19 ==========================
21 The *prometheus* module is enabled with::
23 ceph mgr module enable prometheus
30 The Prometheus manager module needs to be restarted for configuration changes to be applied.
32 .. mgr_module:: prometheus
33 .. confval:: server_addr
34 .. confval:: server_port
35 .. confval:: scrape_interval
37 .. confval:: stale_cache_strategy
38 .. confval:: rbd_stats_pools
39 .. confval:: rbd_stats_pools_refresh_interval
40 .. confval:: standby_behaviour
41 .. confval:: standby_error_status_code
43 By default the module will accept HTTP requests on port ``9283`` on all IPv4
44 and IPv6 addresses on the host. The port and listen address are both
45 configurable with ``ceph config set``, with keys
46 ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
47 is registered with Prometheus's `registry
48 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
52 ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
53 ceph config set mgr mgr/prometheus/server_port 9283
57 The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
58 Prometheus' scrape interval to work properly and not cause any issues.
60 The scrape interval in the module is used for caching purposes
61 and to determine when a cache is stale.
63 It is not recommended to use a scrape interval below 10 seconds. It is
64 recommended to use 15 seconds as scrape interval, though, in some cases it
65 might be useful to increase the scrape interval.
67 To set a different scrape interval in the Prometheus module, set
68 ``scrape_interval`` to the desired value::
70 ceph config set mgr mgr/prometheus/scrape_interval 20
72 On large clusters (>1000 OSDs), the time to fetch the metrics may become
73 significant. Without the cache, the Prometheus manager module could, especially
74 in conjunction with multiple Prometheus instances, overload the manager and lead
75 to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
76 by default. This means that there is a possibility that the cache becomes
77 stale. The cache is considered stale when the time to fetch the metrics from
78 Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
80 If that is the case, **a warning will be logged** and the module will either
82 * respond with a 503 HTTP status code (service unavailable) or,
83 * it will return the content of the cache, even though it might be stale.
85 This behavior can be configured. By default, it will return a 503 HTTP status
86 code (service unavailable). You can set other options using the ``ceph config
89 To tell the module to respond with possibly stale data, set it to ``return``::
91 ceph config set mgr mgr/prometheus/stale_cache_strategy return
93 To tell the module to respond with "service unavailable", set it to ``fail``::
95 ceph config set mgr mgr/prometheus/stale_cache_strategy fail
97 If you are confident that you don't require the cache, you can disable it::
99 ceph config set mgr mgr/prometheus/cache false
101 If you are using the prometheus module behind some kind of reverse proxy or
102 loadbalancer, you can simplify discovering the active instance by switching
105 ceph config set mgr mgr/prometheus/standby_behaviour error
107 If set, the prometheus module will repond with a HTTP error when requesting ``/``
108 from the standby instance. The default error code is 500, but you can configure
109 the HTTP response code with::
111 ceph config set mgr mgr/prometheus/standby_error_status_code 503
113 Valid error codes are between 400-599.
115 To switch back to the default behaviour, simply set the config key to ``default``::
117 ceph config set mgr mgr/prometheus/standby_behaviour default
119 .. _prometheus-rbd-io-statistics:
124 The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
125 exposing them to the Prometheus server as discrete metrics. This allows Prometheus
126 alert rules to be configured for specific health check events.
128 The metrics take the following form;
132 # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
133 # TYPE ceph_health_detail gauge
134 ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
135 ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
136 ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
138 The health check history is made available through the following commands;
142 healthcheck history ls [--format {plain|json|json-pretty}]
143 healthcheck history clear
145 The ``ls`` command provides an overview of the health checks that the cluster has
146 encountered, or since the last ``clear`` command was issued. The example below;
150 [ceph: root@c8-node1 /]# ceph healthcheck history ls
151 Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
152 OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
153 OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
154 PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
155 3 health check(s) listed
161 The module can optionally collect RBD per-image IO statistics by enabling
162 dynamic OSD performance counters. The statistics are gathered for all images
163 in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
164 configuration parameter. The parameter is a comma or space separated list
165 of ``pool[/namespace]`` entries. If the namespace is not specified the
166 statistics are collected for all namespaces in the pool.
168 Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
170 ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
172 The module makes the list of all available images scanning the specified
173 pools and namespaces and refreshes it periodically. The period is
174 configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
175 parameter (in sec) and is 300 sec (5 minutes) by default. The module will
176 force refresh earlier if it detects statistics from a previously unknown
179 Example to turn up the sync interval to 10 minutes::
181 ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
183 Statistic names and labels
184 ==========================
186 The names of the stats are exactly as Ceph names them, with
187 illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
188 and ``ceph_`` prefixed to all names.
191 All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
192 that identifies the type and ID of the daemon they come from. Some
193 statistics can come from different types of daemon, so when querying
194 e.g. an OSD's RocksDB stats, you would probably want to filter
195 on ceph_daemon starting with "osd" to avoid mixing in the monitor
199 The *cluster* statistics (i.e. those global to the Ceph cluster)
200 have labels appropriate to what they report on. For example,
201 metrics relating to pools have a ``pool_id`` label.
204 The long running averages that represent the histograms from core Ceph
205 are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
206 This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
207 and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
209 Pool and OSD metadata series
210 ----------------------------
212 Special series are output to enable displaying and querying on
213 certain metadata fields.
215 Pools have a ``ceph_pool_metadata`` field like this:
219 ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
221 OSDs have a ``ceph_osd_metadata`` field like this:
225 ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
228 Correlating drive statistics with node_exporter
229 -----------------------------------------------
231 The prometheus output from Ceph is designed to be used in conjunction
232 with the generic host monitoring from the Prometheus node_exporter.
234 To enable correlation of Ceph OSD statistics with node_exporter's
235 drive statistics, special series are output like this:
239 ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
241 To use this to get disk statistics by OSD ID, use either the ``and`` operator or
242 the ``*`` operator in your prometheus query. All metadata metrics (like ``
243 ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
244 allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
245 the resulting metric has additional labels from one side of the query.
248 `prometheus documentation`__ for more information about constructing queries.
250 __ https://prometheus.io/docs/prometheus/latest/querying/basics
252 The goal is to run a query like
256 rate(node_disk_bytes_written[30s]) and
257 on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
259 Out of the box the above query will not return any metrics since the ``instance`` labels of
260 both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
261 will be the currently active MGR node.
263 The following two section outline two approaches to remedy this.
267 If you need to group on the `ceph_daemon` label instead of `device` and
268 `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
269 It is advised that you use `ceph_disk_occupation` instead.
271 The difference is that `ceph_disk_occupation_human` may group several OSDs
272 into the value of a single `ceph_daemon` label in cases where multiple OSDs
278 The ``label_replace`` function (cp.
279 `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
280 can add a label to, or alter a label of, a metric within a query.
282 To correlate an OSD and its disks write rate, the following query can be used:
287 rate(node_disk_bytes_written[30s]),
292 ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
294 Configuring Prometheus server
295 =============================
300 To enable Ceph to output properly-labeled data relating to any host,
301 use the ``honor_labels`` setting when adding the ceph-mgr endpoints
302 to your prometheus configuration.
304 This allows Ceph to export the proper ``instance`` label without prometheus
305 overwriting it. Without this setting, Prometheus applies an ``instance`` label
306 that includes the hostname and port of the endpoint that the series came from.
307 Because Ceph clusters have multiple manager daemons, this results in an
308 ``instance`` label that changes spuriously when the active manager daemon
311 If this is undesirable a custom ``instance`` label can be set in the
312 Prometheus target configuration: you might wish to set it to the hostname
313 of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
315 node_exporter hostname labels
316 -----------------------------
318 Set your ``instance`` labels to match what appears in Ceph's OSD metadata
319 in the ``instance`` field. This is generally the short hostname of the node.
321 This is only necessary if you want to correlate Ceph stats with host stats,
322 but you may find it useful to do it in all cases in case you want to do
323 the correlation in the future.
325 Example configuration
326 ---------------------
328 This example shows a single node configuration running ceph-mgr and
329 node_exporter on a server called ``senta04``. Note that this requires one
330 to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
332 This is just an example: there are other ways to configure prometheus
333 scrape targets and label rewrite rules.
342 evaluation_interval: 15s
364 "targets": [ "senta04.mydomain.com:9283" ],
377 "targets": [ "senta04.mydomain.com:9100" ],
379 "instance": "senta04"
388 Counters and gauges are exported; currently histograms and long-running
389 averages are not. It's possible that Ceph's 2-D histograms could be
390 reduced to two separate 1-D histograms, and that long-running averages
391 could be exported as Prometheus' Summary type.
393 Timestamps, as with many Prometheus exporters, are established by
394 the server's scrape time (Prometheus expects that it is polling the
395 actual counter process synchronously). It is possible to supply a
396 timestamp along with the stat report, but the Prometheus team strongly
397 advises against this. This means that timestamps will be delayed by
398 an unpredictable amount; it's not clear if this will be problematic,
399 but it's worth knowing about.