]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephadm/services/monitoring.rst
08ccd94821e081006b6aaf503482531ed812c0cc
[ceph.git] / ceph / doc / cephadm / services / monitoring.rst
1 .. _mgr-cephadm-monitoring:
2
3 Monitoring Services
4 ===================
5
6 Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana
7 <https://grafana.com/>`_, and related tools to store and visualize detailed
8 metrics on cluster utilization and performance. Ceph users have three options:
9
10 #. Have cephadm deploy and configure these services. This is the default
11 when bootstrapping a new cluster unless the ``--skip-monitoring-stack``
12 option is used.
13 #. Deploy and configure these services manually. This is recommended for users
14 with existing prometheus services in their environment (and in cases where
15 Ceph is running in Kubernetes with Rook).
16 #. Skip the monitoring stack completely. Some Ceph dashboard graphs will
17 not be available.
18
19 The monitoring stack consists of `Prometheus <https://prometheus.io/>`_,
20 Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter
21 <https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert
22 Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana
23 <https://grafana.com/>`_.
24
25 .. note::
26
27 Prometheus' security model presumes that untrusted users have access to the
28 Prometheus HTTP endpoint and logs. Untrusted users have access to all the
29 (meta)data Prometheus collects that is contained in the database, plus a
30 variety of operational and debugging information.
31
32 However, Prometheus' HTTP API is limited to read-only operations.
33 Configurations can *not* be changed using the API and secrets are not
34 exposed. Moreover, Prometheus has some built-in measures to mitigate the
35 impact of denial of service attacks.
36
37 Please see `Prometheus' Security model
38 <https://prometheus.io/docs/operating/security/>` for more detailed
39 information.
40
41 Deploying monitoring with cephadm
42 ---------------------------------
43
44 The default behavior of ``cephadm`` is to deploy a basic monitoring stack. It
45 is however possible that you have a Ceph cluster without a monitoring stack,
46 and you would like to add a monitoring stack to it. (Here are some ways that
47 you might have come to have a Ceph cluster without a monitoring stack: You
48 might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during
49 the installation of the cluster, or you might have converted an existing
50 cluster (which had no monitoring stack) to cephadm management.)
51
52 To set up monitoring on a Ceph cluster that has no monitoring, follow the
53 steps below:
54
55 #. Deploy a node-exporter service on every node of the cluster. The node-exporter provides host-level metrics like CPU and memory utilization:
56
57 .. prompt:: bash #
58
59 ceph orch apply node-exporter
60
61 #. Deploy alertmanager:
62
63 .. prompt:: bash #
64
65 ceph orch apply alertmanager
66
67 #. Deploy Prometheus. A single Prometheus instance is sufficient, but
68 for high availablility (HA) you might want to deploy two:
69
70 .. prompt:: bash #
71
72 ceph orch apply prometheus
73
74 or
75
76 .. prompt:: bash #
77
78 ceph orch apply prometheus --placement 'count:2'
79
80 #. Deploy grafana:
81
82 .. prompt:: bash #
83
84 ceph orch apply grafana
85
86 .. _cephadm-monitoring-centralized-logs:
87
88 Centralized Logging in Ceph
89 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
90
91 Ceph now provides centralized logging with Loki & Promtail. Centralized Log Management (CLM) consolidates all log data and pushes it to a central repository,
92 with an accessible and easy-to-use interface. Centralized logging is designed to make your life easier.
93 Some of the advantages are:
94
95 #. **Linear event timeline**: it is easier to troubleshoot issues analyzing a single chain of events than thousands of different logs from a hundred nodes.
96 #. **Real-time live log monitoring**: it is impractical to follow logs from thousands of different sources.
97 #. **Flexible retention policies**: with per-daemon logs, log rotation is usually set to a short interval (1-2 weeks) to save disk usage.
98 #. **Increased security & backup**: logs can contain sensitive information and expose usage patterns. Additionally, centralized logging allows for HA, etc.
99
100 Centralized Logging in Ceph is implemented using two new services - ``loki`` & ``promtail``.
101
102 Loki: It is basically a log aggregation system and is used to query logs. It can be configured as a datasource in Grafana.
103
104 Promtail: It acts as an agent that gathers logs from the system and makes them available to Loki.
105
106 These two services are not deployed by default in a Ceph cluster. To enable the centralized logging you can follow the steps mentioned here :ref:`centralized-logging`.
107
108 .. _cephadm-monitoring-networks-ports:
109
110 Networks and Ports
111 ~~~~~~~~~~~~~~~~~~
112
113 All monitoring services can have the network and port they bind to configured with a yaml service specification
114
115 example spec file:
116
117 .. code-block:: yaml
118
119 service_type: grafana
120 service_name: grafana
121 placement:
122 count: 1
123 networks:
124 - 192.169.142.0/24
125 spec:
126 port: 4200
127
128 Using custom images
129 ~~~~~~~~~~~~~~~~~~~
130
131 It is possible to install or upgrade monitoring components based on other
132 images. To do so, the name of the image to be used needs to be stored in the
133 configuration first. The following configuration options are available.
134
135 - ``container_image_prometheus``
136 - ``container_image_grafana``
137 - ``container_image_alertmanager``
138 - ``container_image_node_exporter``
139
140 Custom images can be set with the ``ceph config`` command
141
142 .. code-block:: bash
143
144 ceph config set mgr mgr/cephadm/<option_name> <value>
145
146 For example
147
148 .. code-block:: bash
149
150 ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1
151
152 If there were already running monitoring stack daemon(s) of the type whose
153 image you've changed, you must redeploy the daemon(s) in order to have them
154 actually use the new image.
155
156 For example, if you had changed the prometheus image
157
158 .. prompt:: bash #
159
160 ceph orch redeploy prometheus
161
162
163 .. note::
164
165 By setting a custom image, the default value will be overridden (but not
166 overwritten). The default value changes when updates become available.
167 By setting a custom image, you will not be able to update the component
168 you have set the custom image for automatically. You will need to
169 manually update the configuration (image name and tag) to be able to
170 install updates.
171
172 If you choose to go with the recommendations instead, you can reset the
173 custom image you have set before. After that, the default value will be
174 used again. Use ``ceph config rm`` to reset the configuration option
175
176 .. code-block:: bash
177
178 ceph config rm mgr mgr/cephadm/<option_name>
179
180 For example
181
182 .. code-block:: bash
183
184 ceph config rm mgr mgr/cephadm/container_image_prometheus
185
186 .. _cephadm-overwrite-jinja2-templates:
187
188 Using custom configuration files
189 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
190
191 By overriding cephadm templates, it is possible to completely customize the
192 configuration files for monitoring services.
193
194 Internally, cephadm already uses `Jinja2
195 <https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
196 configuration files for all monitoring components. To be able to customize the
197 configuration of Prometheus, Grafana or the Alertmanager it is possible to store
198 a Jinja2 template for each service that will be used for configuration
199 generation instead. This template will be evaluated every time a service of that
200 kind is deployed or reconfigured. That way, the custom configuration is
201 preserved and automatically applied on future deployments of these services.
202
203 .. note::
204
205 The configuration of the custom template is also preserved when the default
206 configuration of cephadm changes. If the updated configuration is to be used,
207 the custom template needs to be migrated *manually* after each upgrade of Ceph.
208
209 Option names
210 """"""""""""
211
212 The following templates for files that will be generated by cephadm can be
213 overridden. These are the names to be used when storing with ``ceph config-key
214 set``:
215
216 - ``services/alertmanager/alertmanager.yml``
217 - ``services/grafana/ceph-dashboard.yml``
218 - ``services/grafana/grafana.ini``
219 - ``services/prometheus/prometheus.yml``
220 - ``services/prometheus/alerting/custom_alerts.yml``
221
222 You can look up the file templates that are currently used by cephadm in
223 ``src/pybind/mgr/cephadm/templates``:
224
225 - ``services/alertmanager/alertmanager.yml.j2``
226 - ``services/grafana/ceph-dashboard.yml.j2``
227 - ``services/grafana/grafana.ini.j2``
228 - ``services/prometheus/prometheus.yml.j2``
229
230 Usage
231 """""
232
233 The following command applies a single line value:
234
235 .. code-block:: bash
236
237 ceph config-key set mgr/cephadm/<option_name> <value>
238
239 To set contents of files as template use the ``-i`` argument:
240
241 .. code-block:: bash
242
243 ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>
244
245 .. note::
246
247 When using files as input to ``config-key`` an absolute path to the file must
248 be used.
249
250
251 Then the configuration file for the service needs to be recreated.
252 This is done using `reconfig`. For more details see the following example.
253
254 Example
255 """""""
256
257 .. code-block:: bash
258
259 # set the contents of ./prometheus.yml.j2 as template
260 ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \
261 -i $PWD/prometheus.yml.j2
262
263 # reconfig the prometheus service
264 ceph orch reconfig prometheus
265
266 .. code-block:: bash
267
268 # set additional custom alerting rules for Prometheus
269 ceph config-key set mgr/cephadm/services/prometheus/alerting/custom_alerts.yml \
270 -i $PWD/custom_alerts.yml
271
272 # Note that custom alerting rules are not parsed by Jinja and hence escaping
273 # will not be an issue.
274
275 Deploying monitoring without cephadm
276 ------------------------------------
277
278 If you have an existing prometheus monitoring infrastructure, or would like
279 to manage it yourself, you need to configure it to integrate with your Ceph
280 cluster.
281
282 * Enable the prometheus module in the ceph-mgr daemon
283
284 .. code-block:: bash
285
286 ceph mgr module enable prometheus
287
288 By default, ceph-mgr presents prometheus metrics on port 9283 on each host
289 running a ceph-mgr daemon. Configure prometheus to scrape these.
290
291 * To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
292
293 * To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
294
295 Disabling monitoring
296 --------------------
297
298 To disable monitoring and remove the software that supports it, run the following commands:
299
300 .. code-block:: console
301
302 $ ceph orch rm grafana
303 $ ceph orch rm prometheus --force # this will delete metrics data collected so far
304 $ ceph orch rm node-exporter
305 $ ceph orch rm alertmanager
306 $ ceph mgr module disable prometheus
307
308 See also :ref:`orch-rm`.
309
310 Setting up RBD-Image monitoring
311 -------------------------------
312
313 Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see
314 :ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana
315 and the metrics will not be visible in Prometheus.
316
317 Setting up Grafana
318 ------------------
319
320 Manually setting the Grafana URL
321 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
322
323 Cephadm automatically configures Prometheus, Grafana, and Alertmanager in
324 all cases except one.
325
326 In a some setups, the Dashboard user's browser might not be able to access the
327 Grafana URL that is configured in Ceph Dashboard. This can happen when the
328 cluster and the accessing user are in different DNS zones.
329
330 If this is the case, you can use a configuration option for Ceph Dashboard
331 to set the URL that the user's browser will use to access Grafana. This
332 value will never be altered by cephadm. To set this configuration option,
333 issue the following command:
334
335 .. prompt:: bash $
336
337 ceph dashboard set-grafana-frontend-api-url <grafana-server-api>
338
339 It might take a minute or two for services to be deployed. After the
340 services have been deployed, you should see something like this when you issue the command ``ceph orch ls``:
341
342 .. code-block:: console
343
344 $ ceph orch ls
345 NAME RUNNING REFRESHED IMAGE NAME IMAGE ID SPEC
346 alertmanager 1/1 6s ago docker.io/prom/alertmanager:latest 0881eb8f169f present
347 crash 2/2 6s ago docker.io/ceph/daemon-base:latest-master-devel mix present
348 grafana 1/1 0s ago docker.io/pcuzner/ceph-grafana-el8:latest f77afcf0bcf6 absent
349 node-exporter 2/2 6s ago docker.io/prom/node-exporter:latest e5a616e4b9cf present
350 prometheus 1/1 6s ago docker.io/prom/prometheus:latest e935122ab143 present
351
352 Configuring SSL/TLS for Grafana
353 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
354
355 ``cephadm`` deploys Grafana using the certificate defined in the ceph
356 key/value store. If no certificate is specified, ``cephadm`` generates a
357 self-signed certificate during the deployment of the Grafana service.
358
359 A custom certificate can be configured using the following commands:
360
361 .. prompt:: bash #
362
363 ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
364 ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem
365
366 If you have already deployed Grafana, run ``reconfig`` on the service to
367 update its configuration:
368
369 .. prompt:: bash #
370
371 ceph orch reconfig grafana
372
373 The ``reconfig`` command also sets the proper URL for Ceph Dashboard.
374
375 Setting the initial admin password
376 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
377
378 By default, Grafana will not create an initial
379 admin user. In order to create the admin user, please create a file
380 ``grafana.yaml`` with this content:
381
382 .. code-block:: yaml
383
384 service_type: grafana
385 spec:
386 initial_admin_password: mypassword
387
388 Then apply this specification:
389
390 .. code-block:: bash
391
392 ceph orch apply -i grafana.yaml
393 ceph orch redeploy grafana
394
395 Grafana will now create an admin user called ``admin`` with the
396 given password.
397
398
399 Setting up Alertmanager
400 -----------------------
401
402 Adding Alertmanager webhooks
403 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
404
405 To add new webhooks to the Alertmanager configuration, add additional
406 webhook urls like so:
407
408 .. code-block:: yaml
409
410 service_type: alertmanager
411 spec:
412 user_data:
413 default_webhook_urls:
414 - "https://foo"
415 - "https://bar"
416
417 Where ``default_webhook_urls`` is a list of additional URLs that are
418 added to the default receivers' ``<webhook_configs>`` configuration.
419
420 Run ``reconfig`` on the service to update its configuration:
421
422 .. prompt:: bash #
423
424 ceph orch reconfig alertmanager
425
426 Turn on Certificate Validation
427 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
428
429 If you are using certificates for alertmanager and want to make sure
430 these certs are verified, you should set the "secure" option to
431 true in your alertmanager spec (this defaults to false).
432
433 .. code-block:: yaml
434
435 service_type: alertmanager
436 spec:
437 secure: true
438
439 If you already had alertmanager daemons running before applying the spec
440 you must reconfigure them to update their configuration
441
442 .. prompt:: bash #
443
444 ceph orch reconfig alertmanager
445
446 Further Reading
447 ---------------
448
449 * :ref:`mgr-prometheus`