]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. _mgr-prometheus: |
2 | ||
3efd9988 | 3 | ================= |
11fdf7f2 | 4 | Prometheus Module |
c07f9fc5 FG |
5 | ================= |
6 | ||
7 | Provides a Prometheus exporter to pass on Ceph performance counters | |
8 | from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport | |
9 | messages from all MgrClient processes (mons and OSDs, for instance) | |
10 | with performance counter schema data and actual counter data, and keeps | |
11fdf7f2 | 11 | a circular buffer of the last N samples. This module creates an HTTP |
c07f9fc5 FG |
12 | endpoint (like all Prometheus exporters) and retrieves the latest sample |
13 | of every counter when polled (or "scraped" in Prometheus terminology). | |
14 | The HTTP path and query parameters are ignored; all extant counters | |
15 | for all reporting entities are returned in text exposition format. | |
16 | (See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.) | |
17 | ||
3efd9988 FG |
18 | Enabling prometheus output |
19 | ========================== | |
c07f9fc5 FG |
20 | |
21 | The *prometheus* module is enabled with:: | |
22 | ||
23 | ceph mgr module enable prometheus | |
24 | ||
25 | Configuration | |
26 | ------------- | |
27 | ||
f6b5b4d7 TL |
28 | .. note:: |
29 | ||
30 | The Prometheus manager module needs to be restarted for configuration changes to be applied. | |
31 | ||
32 | By default the module will accept HTTP requests on port ``9283`` on all IPv4 | |
33 | and IPv6 addresses on the host. The port and listen address are both | |
f67539c2 | 34 | configurable with ``ceph config set``, with keys |
f6b5b4d7 TL |
35 | ``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port |
36 | is registered with Prometheus's `registry | |
37 | <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_. | |
38 | ||
39 | :: | |
40 | ||
41 | ceph config set mgr mgr/prometheus/server_addr 0.0.0.0 | |
42 | ceph config set mgr mgr/prometheus/server_port 9283 | |
43 | ||
44 | .. warning:: | |
45 | ||
46 | The ``scrape_interval`` of this module should always be set to match | |
47 | Prometheus' scrape interval to work properly and not cause any issues. | |
48 | ||
49 | The Prometheus manager module is, by default, configured with a scrape interval | |
50 | of 15 seconds. The scrape interval in the module is used for caching purposes | |
51 | and to determine when a cache is stale. | |
52 | ||
53 | It is not recommended to use a scrape interval below 10 seconds. It is | |
54 | recommended to use 15 seconds as scrape interval, though, in some cases it | |
55 | might be useful to increase the scrape interval. | |
56 | ||
57 | To set a different scrape interval in the Prometheus module, set | |
58 | ``scrape_interval`` to the desired value:: | |
59 | ||
60 | ceph config set mgr mgr/prometheus/scrape_interval 20 | |
61 | ||
62 | On large clusters (>1000 OSDs), the time to fetch the metrics may become | |
a4b75251 TL |
63 | significant. Without the cache, the Prometheus manager module could, especially |
64 | in conjunction with multiple Prometheus instances, overload the manager and lead | |
65 | to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled | |
66 | by default. This means that there is a possibility that the cache becomes | |
67 | stale. The cache is considered stale when the time to fetch the metrics from | |
68 | Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``. | |
f6b5b4d7 TL |
69 | |
70 | If that is the case, **a warning will be logged** and the module will either | |
71 | ||
72 | * respond with a 503 HTTP status code (service unavailable) or, | |
73 | * it will return the content of the cache, even though it might be stale. | |
74 | ||
75 | This behavior can be configured. By default, it will return a 503 HTTP status | |
76 | code (service unavailable). You can set other options using the ``ceph config | |
77 | set`` commands. | |
78 | ||
79 | To tell the module to respond with possibly stale data, set it to ``return``:: | |
80 | ||
81 | ceph config set mgr mgr/prometheus/stale_cache_strategy return | |
82 | ||
83 | To tell the module to respond with "service unavailable", set it to ``fail``:: | |
84 | ||
85 | ceph config set mgr mgr/prometheus/stale_cache_strategy fail | |
c07f9fc5 | 86 | |
a4b75251 TL |
87 | If you are confident that you don't require the cache, you can disable it:: |
88 | ||
89 | ceph config set mgr mgr/prometheus/cache false | |
90 | ||
e306af50 TL |
91 | .. _prometheus-rbd-io-statistics: |
92 | ||
11fdf7f2 TL |
93 | RBD IO statistics |
94 | ----------------- | |
95 | ||
96 | The module can optionally collect RBD per-image IO statistics by enabling | |
97 | dynamic OSD performance counters. The statistics are gathered for all images | |
98 | in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools`` | |
99 | configuration parameter. The parameter is a comma or space separated list | |
100 | of ``pool[/namespace]`` entries. If the namespace is not specified the | |
101 | statistics are collected for all namespaces in the pool. | |
102 | ||
e306af50 TL |
103 | Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:: |
104 | ||
105 | ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN" | |
106 | ||
11fdf7f2 TL |
107 | The module makes the list of all available images scanning the specified |
108 | pools and namespaces and refreshes it periodically. The period is | |
109 | configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval`` | |
110 | parameter (in sec) and is 300 sec (5 minutes) by default. The module will | |
111 | force refresh earlier if it detects statistics from a previously unknown | |
112 | RBD image. | |
113 | ||
e306af50 TL |
114 | Example to turn up the sync interval to 10 minutes:: |
115 | ||
116 | ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600 | |
117 | ||
3efd9988 FG |
118 | Statistic names and labels |
119 | ========================== | |
120 | ||
121 | The names of the stats are exactly as Ceph names them, with | |
f6b5b4d7 | 122 | illegal characters ``.``, ``-`` and ``::`` translated to ``_``, |
3efd9988 FG |
123 | and ``ceph_`` prefixed to all names. |
124 | ||
125 | ||
126 | All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123" | |
127 | that identifies the type and ID of the daemon they come from. Some | |
128 | statistics can come from different types of daemon, so when querying | |
129 | e.g. an OSD's RocksDB stats, you would probably want to filter | |
130 | on ceph_daemon starting with "osd" to avoid mixing in the monitor | |
131 | rocksdb stats. | |
132 | ||
133 | ||
134 | The *cluster* statistics (i.e. those global to the Ceph cluster) | |
f6b5b4d7 | 135 | have labels appropriate to what they report on. For example, |
3efd9988 FG |
136 | metrics relating to pools have a ``pool_id`` label. |
137 | ||
11fdf7f2 TL |
138 | |
139 | The long running averages that represent the histograms from core Ceph | |
140 | are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics. | |
141 | This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_ | |
142 | and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_. | |
143 | ||
3efd9988 FG |
144 | Pool and OSD metadata series |
145 | ---------------------------- | |
146 | ||
147 | Special series are output to enable displaying and querying on | |
148 | certain metadata fields. | |
149 | ||
150 | Pools have a ``ceph_pool_metadata`` field like this: | |
151 | ||
152 | :: | |
153 | ||
28e407b8 | 154 | ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0 |
3efd9988 FG |
155 | |
156 | OSDs have a ``ceph_osd_metadata`` field like this: | |
157 | ||
158 | :: | |
159 | ||
28e407b8 | 160 | ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0 |
3efd9988 FG |
161 | |
162 | ||
163 | Correlating drive statistics with node_exporter | |
164 | ----------------------------------------------- | |
165 | ||
166 | The prometheus output from Ceph is designed to be used in conjunction | |
167 | with the generic host monitoring from the Prometheus node_exporter. | |
168 | ||
f6b5b4d7 | 169 | To enable correlation of Ceph OSD statistics with node_exporter's |
3efd9988 FG |
170 | drive statistics, special series are output like this: |
171 | ||
172 | :: | |
173 | ||
28e407b8 | 174 | ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"} |
3efd9988 | 175 | |
28e407b8 AA |
176 | To use this to get disk statistics by OSD ID, use either the ``and`` operator or |
177 | the ``*`` operator in your prometheus query. All metadata metrics (like `` | |
178 | ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*`` | |
179 | allows to use ``group_left`` and ``group_right`` grouping modifiers, so that | |
180 | the resulting metric has additional labels from one side of the query. | |
181 | ||
182 | See the | |
183 | `prometheus documentation`__ for more information about constructing queries. | |
184 | ||
185 | __ https://prometheus.io/docs/prometheus/latest/querying/basics | |
186 | ||
187 | The goal is to run a query like | |
3efd9988 FG |
188 | |
189 | :: | |
190 | ||
191 | rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"} | |
192 | ||
28e407b8 AA |
193 | Out of the box the above query will not return any metrics since the ``instance`` labels of |
194 | both metrics don't match. The ``instance`` label of ``ceph_disk_occupation`` | |
195 | will be the currently active MGR node. | |
3efd9988 | 196 | |
28e407b8 | 197 | The following two section outline two approaches to remedy this. |
3efd9988 | 198 | |
28e407b8 AA |
199 | Use label_replace |
200 | ================= | |
201 | ||
202 | The ``label_replace`` function (cp. | |
203 | `label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_) | |
204 | can add a label to, or alter a label of, a metric within a query. | |
3efd9988 | 205 | |
28e407b8 AA |
206 | To correlate an OSD and its disks write rate, the following query can be used: |
207 | ||
208 | :: | |
3efd9988 | 209 | |
28e407b8 AA |
210 | label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"} |
211 | ||
212 | Configuring Prometheus server | |
213 | ============================= | |
3efd9988 FG |
214 | |
215 | honor_labels | |
216 | ------------ | |
217 | ||
11fdf7f2 | 218 | To enable Ceph to output properly-labeled data relating to any host, |
3efd9988 FG |
219 | use the ``honor_labels`` setting when adding the ceph-mgr endpoints |
220 | to your prometheus configuration. | |
221 | ||
28e407b8 AA |
222 | This allows Ceph to export the proper ``instance`` label without prometheus |
223 | overwriting it. Without this setting, Prometheus applies an ``instance`` label | |
11fdf7f2 | 224 | that includes the hostname and port of the endpoint that the series came from. |
28e407b8 AA |
225 | Because Ceph clusters have multiple manager daemons, this results in an |
226 | ``instance`` label that changes spuriously when the active manager daemon | |
227 | changes. | |
3efd9988 | 228 | |
11fdf7f2 TL |
229 | If this is undesirable a custom ``instance`` label can be set in the |
230 | Prometheus target configuration: you might wish to set it to the hostname | |
231 | of your first mgr daemon, or something completely arbitrary like "ceph_cluster". | |
232 | ||
28e407b8 | 233 | node_exporter hostname labels |
3efd9988 FG |
234 | ----------------------------- |
235 | ||
236 | Set your ``instance`` labels to match what appears in Ceph's OSD metadata | |
28e407b8 | 237 | in the ``instance`` field. This is generally the short hostname of the node. |
3efd9988 FG |
238 | |
239 | This is only necessary if you want to correlate Ceph stats with host stats, | |
240 | but you may find it useful to do it in all cases in case you want to do | |
241 | the correlation in the future. | |
242 | ||
243 | Example configuration | |
244 | --------------------- | |
245 | ||
246 | This example shows a single node configuration running ceph-mgr and | |
f67539c2 TL |
247 | node_exporter on a server called ``senta04``. Note that this requires one |
248 | to add an appropriate and unique ``instance`` label to each ``node_exporter`` target. | |
3efd9988 FG |
249 | |
250 | This is just an example: there are other ways to configure prometheus | |
251 | scrape targets and label rewrite rules. | |
252 | ||
253 | prometheus.yml | |
254 | ~~~~~~~~~~~~~~ | |
255 | ||
256 | :: | |
257 | ||
258 | global: | |
259 | scrape_interval: 15s | |
260 | evaluation_interval: 15s | |
261 | ||
262 | scrape_configs: | |
263 | - job_name: 'node' | |
264 | file_sd_configs: | |
265 | - files: | |
266 | - node_targets.yml | |
267 | - job_name: 'ceph' | |
268 | honor_labels: true | |
269 | file_sd_configs: | |
270 | - files: | |
271 | - ceph_targets.yml | |
272 | ||
273 | ||
274 | ceph_targets.yml | |
275 | ~~~~~~~~~~~~~~~~ | |
276 | ||
277 | ||
278 | :: | |
279 | ||
280 | [ | |
281 | { | |
282 | "targets": [ "senta04.mydomain.com:9283" ], | |
28e407b8 | 283 | "labels": {} |
3efd9988 FG |
284 | } |
285 | ] | |
286 | ||
287 | ||
288 | node_targets.yml | |
289 | ~~~~~~~~~~~~~~~~ | |
290 | ||
291 | :: | |
292 | ||
293 | [ | |
294 | { | |
295 | "targets": [ "senta04.mydomain.com:9100" ], | |
296 | "labels": { | |
297 | "instance": "senta04" | |
298 | } | |
299 | } | |
300 | ] | |
301 | ||
302 | ||
c07f9fc5 | 303 | Notes |
3efd9988 | 304 | ===== |
c07f9fc5 FG |
305 | |
306 | Counters and gauges are exported; currently histograms and long-running | |
307 | averages are not. It's possible that Ceph's 2-D histograms could be | |
308 | reduced to two separate 1-D histograms, and that long-running averages | |
309 | could be exported as Prometheus' Summary type. | |
310 | ||
c07f9fc5 FG |
311 | Timestamps, as with many Prometheus exporters, are established by |
312 | the server's scrape time (Prometheus expects that it is polling the | |
313 | actual counter process synchronously). It is possible to supply a | |
314 | timestamp along with the stat report, but the Prometheus team strongly | |
315 | advises against this. This means that timestamps will be delayed by | |
316 | an unpredictable amount; it's not clear if this will be problematic, | |
317 | but it's worth knowing about. |