ceph/doc/cephadm/services/monitoring.rst

   1 .. _mgr-cephadm-monitoring:
   2
   3 Monitoring Services
   4 ===================
   5
   6 Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana
   7 <https://grafana.com/>`_, and related tools to store and visualize detailed
   8 metrics on cluster utilization and performance.  Ceph users have three options:
   9
  10 #. Have cephadm deploy and configure these services.  This is the default
  11    when bootstrapping a new cluster unless the ``--skip-monitoring-stack``
  12    option is used.
  13 #. Deploy and configure these services manually.  This is recommended for users
  14    with existing prometheus services in their environment (and in cases where
  15    Ceph is running in Kubernetes with Rook).
  16 #. Skip the monitoring stack completely.  Some Ceph dashboard graphs will
  17    not be available.
  18
  19 The monitoring stack consists of `Prometheus <https://prometheus.io/>`_,
  20 Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter
  21 <https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert
  22 Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana
  23 <https://grafana.com/>`_.
  24
  25 .. note::
  26
  27   Prometheus' security model presumes that untrusted users have access to the
  28   Prometheus HTTP endpoint and logs. Untrusted users have access to all the
  29   (meta)data Prometheus collects that is contained in the database, plus a
  30   variety of operational and debugging information.
  31
  32   However, Prometheus' HTTP API is limited to read-only operations.
  33   Configurations can *not* be changed using the API and secrets are not
  34   exposed. Moreover, Prometheus has some built-in measures to mitigate the
  35   impact of denial of service attacks.
  36
  37   Please see `Prometheus' Security model
  38   <https://prometheus.io/docs/operating/security/>` for more detailed
  39   information.
  40
  41 Deploying monitoring with cephadm
  42 ---------------------------------
  43
  44 The default behavior of ``cephadm`` is to deploy a basic monitoring stack.  It
  45 is however possible that you have a Ceph cluster without a monitoring stack,
  46 and you would like to add a monitoring stack to it. (Here are some ways that
  47 you might have come to have a Ceph cluster without a monitoring stack: You
  48 might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during
  49 the installation of the cluster, or you might have converted an existing
  50 cluster (which had no monitoring stack) to cephadm management.)
  51
  52 To set up monitoring on a Ceph cluster that has no monitoring, follow the
  53 steps below:
  54
  55 #. Deploy a node-exporter service on every node of the cluster.  The node-exporter provides host-level metrics like CPU and memory utilization:
  56
  57    .. prompt:: bash #
  58
  59      ceph orch apply node-exporter
  60
  61 #. Deploy alertmanager:
  62
  63    .. prompt:: bash #
  64
  65      ceph orch apply alertmanager
  66
  67 #. Deploy Prometheus. A single Prometheus instance is sufficient, but
  68    for high availablility (HA) you might want to deploy two:
  69
  70    .. prompt:: bash #
  71
  72      ceph orch apply prometheus
  73
  74    or
  75
  76    .. prompt:: bash #
  77
  78      ceph orch apply prometheus --placement 'count:2'
  79
  80 #. Deploy grafana:
  81
  82    .. prompt:: bash #
  83
  84      ceph orch apply grafana
  85
  86 .. _cephadm-monitoring-networks-ports:
  87
  88 Networks and Ports
  89 ~~~~~~~~~~~~~~~~~~
  90
  91 All monitoring services can have the network and port they bind to configured with a yaml service specification
  92
  93 example spec file:
  94
  95 .. code-block:: yaml
  96
  97     service_type: grafana
  98     service_name: grafana
  99     placement:
 100       count: 1
 101     networks:
 102     - 192.169.142.0/24
 103     spec:
 104       port: 4200
 105
 106 Using custom images
 107 ~~~~~~~~~~~~~~~~~~~
 108
 109 It is possible to install or upgrade monitoring components based on other
 110 images.  To do so, the name of the image to be used needs to be stored in the
 111 configuration first.  The following configuration options are available.
 112
 113 - ``container_image_prometheus``
 114 - ``container_image_grafana``
 115 - ``container_image_alertmanager``
 116 - ``container_image_node_exporter``
 117
 118 Custom images can be set with the ``ceph config`` command
 119
 120 .. code-block:: bash
 121
 122      ceph config set mgr mgr/cephadm/<option_name> <value>
 123
 124 For example
 125
 126 .. code-block:: bash
 127
 128      ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1
 129
 130 If there were already running monitoring stack daemon(s) of the type whose
 131 image you've changed, you must redeploy the daemon(s) in order to have them
 132 actually use the new image.
 133
 134 For example, if you had changed the prometheus image
 135
 136 .. prompt:: bash #
 137
 138      ceph orch redeploy prometheus
 139
 140
 141 .. note::
 142
 143      By setting a custom image, the default value will be overridden (but not
 144      overwritten).  The default value changes when updates become available.
 145      By setting a custom image, you will not be able to update the component
 146      you have set the custom image for automatically.  You will need to
 147      manually update the configuration (image name and tag) to be able to
 148      install updates.
 149
 150      If you choose to go with the recommendations instead, you can reset the
 151      custom image you have set before.  After that, the default value will be
 152      used again.  Use ``ceph config rm`` to reset the configuration option
 153
 154      .. code-block:: bash
 155
 156           ceph config rm mgr mgr/cephadm/<option_name>
 157
 158      For example
 159
 160      .. code-block:: bash
 161
 162           ceph config rm mgr mgr/cephadm/container_image_prometheus
 163
 164 .. _cephadm-overwrite-jinja2-templates:
 165
 166 Using custom configuration files
 167 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 168
 169 By overriding cephadm templates, it is possible to completely customize the
 170 configuration files for monitoring services.
 171
 172 Internally, cephadm already uses `Jinja2
 173 <https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
 174 configuration files for all monitoring components. To be able to customize the
 175 configuration of Prometheus, Grafana or the Alertmanager it is possible to store
 176 a Jinja2 template for each service that will be used for configuration
 177 generation instead. This template will be evaluated every time a service of that
 178 kind is deployed or reconfigured. That way, the custom configuration is
 179 preserved and automatically applied on future deployments of these services.
 180
 181 .. note::
 182
 183   The configuration of the custom template is also preserved when the default
 184   configuration of cephadm changes. If the updated configuration is to be used,
 185   the custom template needs to be migrated *manually* after each upgrade of Ceph.
 186
 187 Option names
 188 """"""""""""
 189
 190 The following templates for files that will be generated by cephadm can be
 191 overridden. These are the names to be used when storing with ``ceph config-key
 192 set``:
 193
 194 - ``services/alertmanager/alertmanager.yml``
 195 - ``services/grafana/ceph-dashboard.yml``
 196 - ``services/grafana/grafana.ini``
 197 - ``services/prometheus/prometheus.yml``
 198
 199 You can look up the file templates that are currently used by cephadm in
 200 ``src/pybind/mgr/cephadm/templates``:
 201
 202 - ``services/alertmanager/alertmanager.yml.j2``
 203 - ``services/grafana/ceph-dashboard.yml.j2``
 204 - ``services/grafana/grafana.ini.j2``
 205 - ``services/prometheus/prometheus.yml.j2``
 206
 207 Usage
 208 """""
 209
 210 The following command applies a single line value:
 211
 212 .. code-block:: bash
 213
 214   ceph config-key set mgr/cephadm/<option_name> <value>
 215
 216 To set contents of files as template use the ``-i`` argument:
 217
 218 .. code-block:: bash
 219
 220   ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>
 221
 222 .. note::
 223
 224   When using files as input to ``config-key`` an absolute path to the file must
 225   be used.
 226
 227
 228 Then the configuration file for the service needs to be recreated.
 229 This is done using `reconfig`. For more details see the following example.
 230
 231 Example
 232 """""""
 233
 234 .. code-block:: bash
 235
 236   # set the contents of ./prometheus.yml.j2 as template
 237   ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \
 238     -i $PWD/prometheus.yml.j2
 239
 240   # reconfig the prometheus service
 241   ceph orch reconfig prometheus
 242
 243 Deploying monitoring without cephadm
 244 ------------------------------------
 245
 246 If you have an existing prometheus monitoring infrastructure, or would like
 247 to manage it yourself, you need to configure it to integrate with your Ceph
 248 cluster.
 249
 250 * Enable the prometheus module in the ceph-mgr daemon
 251
 252   .. code-block:: bash
 253
 254      ceph mgr module enable prometheus
 255
 256   By default, ceph-mgr presents prometheus metrics on port 9283 on each host
 257   running a ceph-mgr daemon.  Configure prometheus to scrape these.
 258
 259 * To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
 260
 261 * To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
 262
 263 Disabling monitoring
 264 --------------------
 265
 266 To disable monitoring and remove the software that supports it, run the following commands:
 267
 268 .. code-block:: console
 269
 270   $ ceph orch rm grafana
 271   $ ceph orch rm prometheus --force   # this will delete metrics data collected so far
 272   $ ceph orch rm node-exporter
 273   $ ceph orch rm alertmanager
 274   $ ceph mgr module disable prometheus
 275
 276 See also :ref:`orch-rm`.
 277
 278 Setting up RBD-Image monitoring
 279 -------------------------------
 280
 281 Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see
 282 :ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana
 283 and the metrics will not be visible in Prometheus.
 284
 285 Setting up Grafana
 286 ------------------
 287
 288 Manually setting the Grafana URL
 289 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 290
 291 Cephadm automatically configures Prometheus, Grafana, and Alertmanager in
 292 all cases except one.
 293
 294 In a some setups, the Dashboard user's browser might not be able to access the
 295 Grafana URL that is configured in Ceph Dashboard. This can happen when the
 296 cluster and the accessing user are in different DNS zones.
 297
 298 If this is the case, you can use a configuration option for Ceph Dashboard
 299 to set the URL that the user's browser will use to access Grafana. This
 300 value will never be altered by cephadm. To set this configuration option,
 301 issue the following command:
 302
 303    .. prompt:: bash $
 304
 305      ceph dashboard set-grafana-frontend-api-url <grafana-server-api>
 306
 307 It might take a minute or two for services to be deployed. After the
 308 services have been deployed, you should see something like this when you issue the command ``ceph orch ls``:
 309
 310 .. code-block:: console
 311
 312   $ ceph orch ls
 313   NAME           RUNNING  REFRESHED  IMAGE NAME                                      IMAGE ID        SPEC
 314   alertmanager       1/1  6s ago     docker.io/prom/alertmanager:latest              0881eb8f169f  present
 315   crash              2/2  6s ago     docker.io/ceph/daemon-base:latest-master-devel  mix           present
 316   grafana            1/1  0s ago     docker.io/pcuzner/ceph-grafana-el8:latest       f77afcf0bcf6   absent
 317   node-exporter      2/2  6s ago     docker.io/prom/node-exporter:latest             e5a616e4b9cf  present
 318   prometheus         1/1  6s ago     docker.io/prom/prometheus:latest                e935122ab143  present
 319
 320 Configuring SSL/TLS for Grafana
 321 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 322
 323 ``cephadm`` deploys Grafana using the certificate defined in the ceph
 324 key/value store. If no certificate is specified, ``cephadm`` generates a
 325 self-signed certificate during the deployment of the Grafana service.
 326
 327 A custom certificate can be configured using the following commands:
 328
 329 .. prompt:: bash #
 330
 331   ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
 332   ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem
 333
 334 If you have already deployed Grafana, run ``reconfig`` on the service to
 335 update its configuration:
 336
 337 .. prompt:: bash #
 338
 339   ceph orch reconfig grafana
 340
 341 The ``reconfig`` command also sets the proper URL for Ceph Dashboard.
 342
 343 Setting the initial admin password
 344 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 345
 346 By default, Grafana will not create an initial
 347 admin user. In order to create the admin user, please create a file
 348 ``grafana.yaml`` with this content:
 349
 350 .. code-block:: yaml
 351
 352   service_type: grafana
 353   spec:
 354     initial_admin_password: mypassword
 355
 356 Then apply this specification:
 357
 358 .. code-block:: bash
 359
 360   ceph orch apply -i grafana.yaml
 361   ceph orch redeploy grafana
 362
 363 Grafana will now create an admin user called ``admin`` with the
 364 given password.
 365
 366
 367 Setting up Alertmanager
 368 -----------------------
 369
 370 Adding Alertmanager webhooks
 371 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 372
 373 To add new webhooks to the Alertmanager configuration, add additional
 374 webhook urls like so:
 375
 376 .. code-block:: yaml
 377
 378     service_type: alertmanager
 379     spec:
 380       user_data:
 381         default_webhook_urls:
 382         - "https://foo"
 383         - "https://bar"
 384
 385 Where ``default_webhook_urls`` is a list of additional URLs that are
 386 added to the default receivers' ``<webhook_configs>`` configuration.
 387
 388 Run ``reconfig`` on the service to update its configuration:
 389
 390 .. prompt:: bash #
 391
 392   ceph orch reconfig alertmanager
 393
 394 Further Reading
 395 ---------------
 396
 397 * :ref:`mgr-prometheus`