ceph/doc/cephadm/services/monitoring.rst

   1 .. _mgr-cephadm-monitoring:
   2
   3 Monitoring Services
   4 ===================
   5
   6 Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana
   7 <https://grafana.com/>`_, and related tools to store and visualize detailed
   8 metrics on cluster utilization and performance.  Ceph users have three options:
   9
  10 #. Have cephadm deploy and configure these services.  This is the default
  11    when bootstrapping a new cluster unless the ``--skip-monitoring-stack``
  12    option is used.
  13 #. Deploy and configure these services manually.  This is recommended for users
  14    with existing prometheus services in their environment (and in cases where
  15    Ceph is running in Kubernetes with Rook).
  16 #. Skip the monitoring stack completely.  Some Ceph dashboard graphs will
  17    not be available.
  18
  19 The monitoring stack consists of `Prometheus <https://prometheus.io/>`_,
  20 Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter
  21 <https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert
  22 Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana
  23 <https://grafana.com/>`_.
  24
  25 .. note::
  26
  27   Prometheus' security model presumes that untrusted users have access to the
  28   Prometheus HTTP endpoint and logs. Untrusted users have access to all the
  29   (meta)data Prometheus collects that is contained in the database, plus a
  30   variety of operational and debugging information.
  31
  32   However, Prometheus' HTTP API is limited to read-only operations.
  33   Configurations can *not* be changed using the API and secrets are not
  34   exposed. Moreover, Prometheus has some built-in measures to mitigate the
  35   impact of denial of service attacks.
  36
  37   Please see `Prometheus' Security model
  38   <https://prometheus.io/docs/operating/security/>` for more detailed
  39   information.
  40
  41 Deploying monitoring with cephadm
  42 ---------------------------------
  43
  44 The default behavior of ``cephadm`` is to deploy a basic monitoring stack.  It
  45 is however possible that you have a Ceph cluster without a monitoring stack,
  46 and you would like to add a monitoring stack to it. (Here are some ways that
  47 you might have come to have a Ceph cluster without a monitoring stack: You
  48 might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during
  49 the installation of the cluster, or you might have converted an existing
  50 cluster (which had no monitoring stack) to cephadm management.)
  51
  52 To set up monitoring on a Ceph cluster that has no monitoring, follow the
  53 steps below:
  54
  55 #. Deploy a node-exporter service on every node of the cluster.  The node-exporter provides host-level metrics like CPU and memory utilization:
  56
  57    .. prompt:: bash #
  58
  59      ceph orch apply node-exporter
  60
  61 #. Deploy alertmanager:
  62
  63    .. prompt:: bash #
  64
  65      ceph orch apply alertmanager
  66
  67 #. Deploy Prometheus. A single Prometheus instance is sufficient, but
  68    for high availablility (HA) you might want to deploy two:
  69
  70    .. prompt:: bash #
  71
  72      ceph orch apply prometheus
  73
  74    or
  75
  76    .. prompt:: bash #
  77
  78      ceph orch apply prometheus --placement 'count:2'
  79
  80 #. Deploy grafana:
  81
  82    .. prompt:: bash #
  83
  84      ceph orch apply grafana
  85
  86 .. _cephadm-monitoring-centralized-logs:
  87
  88 Centralized Logging in Ceph
  89 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  90
  91 Ceph now provides centralized logging with Loki & Promtail. Centralized Log Management (CLM) consolidates all log data and pushes it to a central repository,
  92 with an accessible and easy-to-use interface. Centralized logging is designed to make your life easier.
  93 Some of the advantages are:
  94
  95 #. **Linear event timeline**: it is easier to troubleshoot issues analyzing a single chain of events than thousands of different logs from a hundred nodes.
  96 #. **Real-time live log monitoring**: it is impractical to follow logs from thousands of different sources.
  97 #. **Flexible retention policies**: with per-daemon logs, log rotation is usually set to a short interval (1-2 weeks) to save disk usage.
  98 #. **Increased security & backup**: logs can contain sensitive information and expose usage patterns. Additionally, centralized logging allows for HA, etc.
  99
 100 Centralized Logging in Ceph is implemented using two new services - ``loki`` & ``promtail``.
 101
 102 Loki: It is basically a log aggregation system and is used to query logs. It can be configured as a datasource in Grafana.
 103
 104 Promtail: It acts as an agent that gathers logs from the system and makes them available to Loki.
 105
 106 These two services are not deployed by default in a Ceph cluster. To enable the centralized logging you can follow the steps mentioned here :ref:`centralized-logging`.
 107
 108 .. _cephadm-monitoring-networks-ports:
 109
 110 Networks and Ports
 111 ~~~~~~~~~~~~~~~~~~
 112
 113 All monitoring services can have the network and port they bind to configured with a yaml service specification
 114
 115 example spec file:
 116
 117 .. code-block:: yaml
 118
 119     service_type: grafana
 120     service_name: grafana
 121     placement:
 122       count: 1
 123     networks:
 124     - 192.169.142.0/24
 125     spec:
 126       port: 4200
 127
 128 Using custom images
 129 ~~~~~~~~~~~~~~~~~~~
 130
 131 It is possible to install or upgrade monitoring components based on other
 132 images.  To do so, the name of the image to be used needs to be stored in the
 133 configuration first.  The following configuration options are available.
 134
 135 - ``container_image_prometheus``
 136 - ``container_image_grafana``
 137 - ``container_image_alertmanager``
 138 - ``container_image_node_exporter``
 139
 140 Custom images can be set with the ``ceph config`` command
 141
 142 .. code-block:: bash
 143
 144      ceph config set mgr mgr/cephadm/<option_name> <value>
 145
 146 For example
 147
 148 .. code-block:: bash
 149
 150      ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1
 151
 152 If there were already running monitoring stack daemon(s) of the type whose
 153 image you've changed, you must redeploy the daemon(s) in order to have them
 154 actually use the new image.
 155
 156 For example, if you had changed the prometheus image
 157
 158 .. prompt:: bash #
 159
 160      ceph orch redeploy prometheus
 161
 162
 163 .. note::
 164
 165      By setting a custom image, the default value will be overridden (but not
 166      overwritten).  The default value changes when updates become available.
 167      By setting a custom image, you will not be able to update the component
 168      you have set the custom image for automatically.  You will need to
 169      manually update the configuration (image name and tag) to be able to
 170      install updates.
 171
 172      If you choose to go with the recommendations instead, you can reset the
 173      custom image you have set before.  After that, the default value will be
 174      used again.  Use ``ceph config rm`` to reset the configuration option
 175
 176      .. code-block:: bash
 177
 178           ceph config rm mgr mgr/cephadm/<option_name>
 179
 180      For example
 181
 182      .. code-block:: bash
 183
 184           ceph config rm mgr mgr/cephadm/container_image_prometheus
 185
 186 .. _cephadm-overwrite-jinja2-templates:
 187
 188 Using custom configuration files
 189 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 190
 191 By overriding cephadm templates, it is possible to completely customize the
 192 configuration files for monitoring services.
 193
 194 Internally, cephadm already uses `Jinja2
 195 <https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
 196 configuration files for all monitoring components. To be able to customize the
 197 configuration of Prometheus, Grafana or the Alertmanager it is possible to store
 198 a Jinja2 template for each service that will be used for configuration
 199 generation instead. This template will be evaluated every time a service of that
 200 kind is deployed or reconfigured. That way, the custom configuration is
 201 preserved and automatically applied on future deployments of these services.
 202
 203 .. note::
 204
 205   The configuration of the custom template is also preserved when the default
 206   configuration of cephadm changes. If the updated configuration is to be used,
 207   the custom template needs to be migrated *manually* after each upgrade of Ceph.
 208
 209 Option names
 210 """"""""""""
 211
 212 The following templates for files that will be generated by cephadm can be
 213 overridden. These are the names to be used when storing with ``ceph config-key
 214 set``:
 215
 216 - ``services/alertmanager/alertmanager.yml``
 217 - ``services/grafana/ceph-dashboard.yml``
 218 - ``services/grafana/grafana.ini``
 219 - ``services/prometheus/prometheus.yml``
 220 - ``services/prometheus/alerting/custom_alerts.yml``
 221
 222 You can look up the file templates that are currently used by cephadm in
 223 ``src/pybind/mgr/cephadm/templates``:
 224
 225 - ``services/alertmanager/alertmanager.yml.j2``
 226 - ``services/grafana/ceph-dashboard.yml.j2``
 227 - ``services/grafana/grafana.ini.j2``
 228 - ``services/prometheus/prometheus.yml.j2``
 229
 230 Usage
 231 """""
 232
 233 The following command applies a single line value:
 234
 235 .. code-block:: bash
 236
 237   ceph config-key set mgr/cephadm/<option_name> <value>
 238
 239 To set contents of files as template use the ``-i`` argument:
 240
 241 .. code-block:: bash
 242
 243   ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>
 244
 245 .. note::
 246
 247   When using files as input to ``config-key`` an absolute path to the file must
 248   be used.
 249
 250
 251 Then the configuration file for the service needs to be recreated.
 252 This is done using `reconfig`. For more details see the following example.
 253
 254 Example
 255 """""""
 256
 257 .. code-block:: bash
 258
 259   # set the contents of ./prometheus.yml.j2 as template
 260   ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \
 261     -i $PWD/prometheus.yml.j2
 262
 263   # reconfig the prometheus service
 264   ceph orch reconfig prometheus
 265
 266 .. code-block:: bash
 267
 268   # set additional custom alerting rules for Prometheus
 269   ceph config-key set mgr/cephadm/services/prometheus/alerting/custom_alerts.yml \
 270     -i $PWD/custom_alerts.yml
 271
 272   # Note that custom alerting rules are not parsed by Jinja and hence escaping
 273   # will not be an issue.
 274
 275 Deploying monitoring without cephadm
 276 ------------------------------------
 277
 278 If you have an existing prometheus monitoring infrastructure, or would like
 279 to manage it yourself, you need to configure it to integrate with your Ceph
 280 cluster.
 281
 282 * Enable the prometheus module in the ceph-mgr daemon
 283
 284   .. code-block:: bash
 285
 286      ceph mgr module enable prometheus
 287
 288   By default, ceph-mgr presents prometheus metrics on port 9283 on each host
 289   running a ceph-mgr daemon.  Configure prometheus to scrape these.
 290
 291 * To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
 292
 293 * To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
 294
 295 Disabling monitoring
 296 --------------------
 297
 298 To disable monitoring and remove the software that supports it, run the following commands:
 299
 300 .. code-block:: console
 301
 302   $ ceph orch rm grafana
 303   $ ceph orch rm prometheus --force   # this will delete metrics data collected so far
 304   $ ceph orch rm node-exporter
 305   $ ceph orch rm alertmanager
 306   $ ceph mgr module disable prometheus
 307
 308 See also :ref:`orch-rm`.
 309
 310 Setting up RBD-Image monitoring
 311 -------------------------------
 312
 313 Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see
 314 :ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana
 315 and the metrics will not be visible in Prometheus.
 316
 317 Setting up Grafana
 318 ------------------
 319
 320 Manually setting the Grafana URL
 321 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 322
 323 Cephadm automatically configures Prometheus, Grafana, and Alertmanager in
 324 all cases except one.
 325
 326 In a some setups, the Dashboard user's browser might not be able to access the
 327 Grafana URL that is configured in Ceph Dashboard. This can happen when the
 328 cluster and the accessing user are in different DNS zones.
 329
 330 If this is the case, you can use a configuration option for Ceph Dashboard
 331 to set the URL that the user's browser will use to access Grafana. This
 332 value will never be altered by cephadm. To set this configuration option,
 333 issue the following command:
 334
 335    .. prompt:: bash $
 336
 337      ceph dashboard set-grafana-frontend-api-url <grafana-server-api>
 338
 339 It might take a minute or two for services to be deployed. After the
 340 services have been deployed, you should see something like this when you issue the command ``ceph orch ls``:
 341
 342 .. code-block:: console
 343
 344   $ ceph orch ls
 345   NAME           RUNNING  REFRESHED  IMAGE NAME                                      IMAGE ID        SPEC
 346   alertmanager       1/1  6s ago     docker.io/prom/alertmanager:latest              0881eb8f169f  present
 347   crash              2/2  6s ago     docker.io/ceph/daemon-base:latest-master-devel  mix           present
 348   grafana            1/1  0s ago     docker.io/pcuzner/ceph-grafana-el8:latest       f77afcf0bcf6   absent
 349   node-exporter      2/2  6s ago     docker.io/prom/node-exporter:latest             e5a616e4b9cf  present
 350   prometheus         1/1  6s ago     docker.io/prom/prometheus:latest                e935122ab143  present
 351
 352 Configuring SSL/TLS for Grafana
 353 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 354
 355 ``cephadm`` deploys Grafana using the certificate defined in the ceph
 356 key/value store. If no certificate is specified, ``cephadm`` generates a
 357 self-signed certificate during the deployment of the Grafana service.
 358
 359 A custom certificate can be configured using the following commands:
 360
 361 .. prompt:: bash #
 362
 363   ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
 364   ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem
 365
 366 If you have already deployed Grafana, run ``reconfig`` on the service to
 367 update its configuration:
 368
 369 .. prompt:: bash #
 370
 371   ceph orch reconfig grafana
 372
 373 The ``reconfig`` command also sets the proper URL for Ceph Dashboard.
 374
 375 Setting the initial admin password
 376 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 377
 378 By default, Grafana will not create an initial
 379 admin user. In order to create the admin user, please create a file
 380 ``grafana.yaml`` with this content:
 381
 382 .. code-block:: yaml
 383
 384   service_type: grafana
 385   spec:
 386     initial_admin_password: mypassword
 387
 388 Then apply this specification:
 389
 390 .. code-block:: bash
 391
 392   ceph orch apply -i grafana.yaml
 393   ceph orch redeploy grafana
 394
 395 Grafana will now create an admin user called ``admin`` with the
 396 given password.
 397
 398
 399 Setting up Alertmanager
 400 -----------------------
 401
 402 Adding Alertmanager webhooks
 403 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 404
 405 To add new webhooks to the Alertmanager configuration, add additional
 406 webhook urls like so:
 407
 408 .. code-block:: yaml
 409
 410     service_type: alertmanager
 411     spec:
 412       user_data:
 413         default_webhook_urls:
 414         - "https://foo"
 415         - "https://bar"
 416
 417 Where ``default_webhook_urls`` is a list of additional URLs that are
 418 added to the default receivers' ``<webhook_configs>`` configuration.
 419
 420 Run ``reconfig`` on the service to update its configuration:
 421
 422 .. prompt:: bash #
 423
 424   ceph orch reconfig alertmanager
 425
 426 Turn on Certificate Validation
 427 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 428
 429 If you are using certificates for alertmanager and want to make sure
 430 these certs are verified, you should set the "secure" option to
 431 true in your alertmanager spec (this defaults to false).
 432
 433 .. code-block:: yaml
 434
 435     service_type: alertmanager
 436     spec:
 437       secure: true
 438
 439 If you already had alertmanager daemons running before applying the spec
 440 you must reconfigure them to update their configuration
 441
 442 .. prompt:: bash #
 443
 444   ceph orch reconfig alertmanager
 445
 446 Further Reading
 447 ---------------
 448
 449 * :ref:`mgr-prometheus`