[ceph.git] / ceph / doc / mgr / prometheus.rst

.. _mgr-prometheus:

=================
Prometheus Module
=================

Provides a Prometheus exporter to pass on Ceph performance counters
from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
messages from all MgrClient processes (mons and OSDs, for instance)
with performance counter schema data and actual counter data, and keeps
a circular buffer of the last N samples.  This module creates an HTTP
endpoint (like all Prometheus exporters) and retrieves the latest sample
of every counter when polled (or "scraped" in Prometheus terminology).
The HTTP path and query parameters are ignored; all extant counters
for all reporting entities are returned in text exposition format.
(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)

Enabling prometheus output
==========================

The *prometheus* module is enabled with:

.. prompt:: bash $

   ceph mgr module enable prometheus

Configuration
-------------

.. note::

    The Prometheus manager module needs to be restarted for configuration changes to be applied.

.. mgr_module:: prometheus
.. confval:: server_addr
.. confval:: server_port
.. confval:: scrape_interval
.. confval:: cache
.. confval:: stale_cache_strategy
.. confval:: rbd_stats_pools
.. confval:: rbd_stats_pools_refresh_interval
.. confval:: standby_behaviour
.. confval:: standby_error_status_code

By default the module will accept HTTP requests on port ``9283`` on all IPv4
and IPv6 addresses on the host.  The port and listen address are both
configurable with ``ceph config set``, with keys
``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
is registered with Prometheus's `registry
<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

.. prompt:: bash $
   
   ceph config set mgr mgr/prometheus/server_addr 0.0.0.
   ceph config set mgr mgr/prometheus/server_port 9283

.. warning::

    The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
    Prometheus' scrape interval to work properly and not cause any issues.

The scrape interval in the module is used for caching purposes
and to determine when a cache is stale.

It is not recommended to use a scrape interval below 10 seconds.  It is
recommended to use 15 seconds as scrape interval, though, in some cases it
might be useful to increase the scrape interval.

To set a different scrape interval in the Prometheus module, set
``scrape_interval`` to the desired value:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/scrape_interval 20

On large clusters (>1000 OSDs), the time to fetch the metrics may become
significant.  Without the cache, the Prometheus manager module could, especially
in conjunction with multiple Prometheus instances, overload the manager and lead
to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
by default.  This means that there is a possibility that the cache becomes
stale.  The cache is considered stale when the time to fetch the metrics from
Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.

If that is the case, **a warning will be logged** and the module will either

* respond with a 503 HTTP status code (service unavailable) or,
* it will return the content of the cache, even though it might be stale.

This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ``ceph config
set`` commands.

To tell the module to respond with possibly stale data, set it to ``return``:

.. prompt:: bash $

    ceph config set mgr mgr/prometheus/stale_cache_strategy return

To tell the module to respond with "service unavailable", set it to ``fail``:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/stale_cache_strategy fail

If you are confident that you don't require the cache, you can disable it:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/cache false

If you are using the prometheus module behind some kind of reverse proxy or
loadbalancer, you can simplify discovering the active instance by switching
to ``error``-mode:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/standby_behaviour error

If set, the prometheus module will respond with a HTTP error when requesting ``/``
from the standby instance. The default error code is 500, but you can configure
the HTTP response code with:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/standby_error_status_code 503

Valid error codes are between 400-599.

To switch back to the default behaviour, simply set the config key to ``default``:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/standby_behaviour default

.. _prometheus-rbd-io-statistics:

Ceph Health Checks
------------------

The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
exposing them to the Prometheus server as discrete metrics. This allows Prometheus
alert rules to be configured for specific health check events.

The metrics take the following form;

::

    # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
    # TYPE ceph_health_detail gauge
    ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
    ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
    ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0

The health check history is made available through the following commands;

::

    healthcheck history ls [--format {plain|json|json-pretty}]
    healthcheck history clear

The ``ls`` command provides an overview of the health checks that the cluster has
encountered, or since the last ``clear`` command was issued. The example below;

::

    [ceph: root@c8-node1 /]# ceph healthcheck history ls
    Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
    OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
    OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
    PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
    3 health check(s) listed


RBD IO statistics
-----------------

The module can optionally collect RBD per-image IO statistics by enabling
dynamic OSD performance counters. The statistics are gathered for all images
in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.

Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

The wildcard can be used to indicate all pools or namespaces:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.

Example to turn up the sync interval to 10 minutes:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

Ceph daemon performance counters metrics
-----------------------------------------

With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
the module option ``exclude_perf_counters`` to ``false``:

.. prompt:: bash $

   ceph config set mgr mgr/prometheus/exclude_perf_counters false

Statistic names and labels
==========================

The names of the stats are exactly as Ceph names them, with
illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
and ``ceph_`` prefixed to all names.


All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
that identifies the type and ID of the daemon they come from.  Some
statistics can come from different types of daemon, so when querying
e.g. an OSD's RocksDB stats, you would probably want to filter
on ceph_daemon starting with "osd" to avoid mixing in the monitor
rocksdb stats.


The *cluster* statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on.  For example,
metrics relating to pools have a ``pool_id`` label.


The long running averages that represent the histograms from core Ceph
are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.

Pool and OSD metadata series
----------------------------

Special series are output to enable displaying and querying on
certain metadata fields.

Pools have a ``ceph_pool_metadata`` field like this:

::

    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ``ceph_osd_metadata`` field like this:

::

    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0


Correlating drive statistics with node_exporter
-----------------------------------------------

The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter's
drive statistics, special series are output like this:

::

    ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.

See the
`prometheus documentation`__ for more information about constructing queries.

__ https://prometheus.io/docs/prometheus/latest/querying/basics

The goal is to run a query like

::

    rate(node_disk_written_bytes_total[30s]) and
    on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the ``instance`` labels of
both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
will be the currently active MGR node.

The following two section outline two approaches to remedy this.

.. note::

    If you need to group on the `ceph_daemon` label instead of `device` and
    `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
    It is advised that you use `ceph_disk_occupation` instead.

    The difference is that `ceph_disk_occupation_human` may group several OSDs
    into the value of a single `ceph_daemon` label in cases where multiple OSDs
    share a disk.

Use label_replace
=================

The ``label_replace`` function (cp.
`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

::

    label_replace(
        rate(node_disk_written_bytes_total[30s]),
        "exported_instance",
        "$1",
        "instance",
        "(.*):.*"
    ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}

Configuring Prometheus server
=============================

honor_labels
------------

To enable Ceph to output properly-labeled data relating to any host,
use the ``honor_labels`` setting when adding the ceph-mgr endpoints
to your prometheus configuration.

This allows Ceph to export the proper ``instance`` label without prometheus
overwriting it. Without this setting, Prometheus applies an ``instance`` label
that includes the hostname and port of the endpoint that the series came from.
Because Ceph clusters have multiple manager daemons, this results in an
``instance`` label that changes spuriously when the active manager daemon
changes.

If this is undesirable a custom ``instance`` label can be set in the
Prometheus target configuration: you might wish to set it to the hostname
of your first mgr daemon, or something completely arbitrary like "ceph_cluster".

node_exporter hostname labels
-----------------------------

Set your ``instance`` labels to match what appears in Ceph's OSD metadata
in the ``instance`` field.  This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,
but you may find it useful to do it in all cases in case you want to do
the correlation in the future.

Example configuration
---------------------

This example shows a single node configuration running ceph-mgr and
node_exporter on a server called ``senta04``. Note that this requires one
to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.

This is just an example: there are other ways to configure prometheus
scrape targets and label rewrite rules.

prometheus.yml
~~~~~~~~~~~~~~

::

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'node'
        file_sd_configs:
          - files:
            - node_targets.yml
      - job_name: 'ceph'
        honor_labels: true
        file_sd_configs:
          - files:
            - ceph_targets.yml


ceph_targets.yml
~~~~~~~~~~~~~~~~


::

    [
        {
            "targets": [ "senta04.mydomain.com:9283" ],
            "labels": {}
        }
    ]


node_targets.yml
~~~~~~~~~~~~~~~~

::

    [
        {
            "targets": [ "senta04.mydomain.com:9100" ],
            "labels": {
                "instance": "senta04"
            }
        }
    ]


Notes
=====

Counters and gauges are exported; currently histograms and long-running
averages are not.  It's possible that Ceph's 2-D histograms could be
reduced to two separate 1-D histograms, and that long-running averages
could be exported as Prometheus' Summary type.

Timestamps, as with many Prometheus exporters, are established by
the server's scrape time (Prometheus expects that it is polling the
actual counter process synchronously).  It is possible to supply a
timestamp along with the stat report, but the Prometheus team strongly
advises against this.  This means that timestamps will be delayed by
an unpredictable amount; it's not clear if this will be problematic,
but it's worth knowing about.
Commit	Line	Data
11fdf7f2 TL	1	.. _mgr-prometheus:
11fdf7f2 TL	2
3efd9988	3	=================
11fdf7f2	4	Prometheus Module
c07f9fc5 FG	5	=================
	6
	7	Provides a Prometheus exporter to pass on Ceph performance counters
	8	from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
	9	messages from all MgrClient processes (mons and OSDs, for instance)
	10	with performance counter schema data and actual counter data, and keeps
11fdf7f2	11	a circular buffer of the last N samples. This module creates an HTTP
c07f9fc5 FG	12	endpoint (like all Prometheus exporters) and retrieves the latest sample
	13	of every counter when polled (or "scraped" in Prometheus terminology).
	14	The HTTP path and query parameters are ignored; all extant counters
	15	for all reporting entities are returned in text exposition format.
	16	(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
	17
3efd9988 FG	18	Enabling prometheus output
3efd9988 FG	19	==========================
c07f9fc5	20
1e59de90	21	The prometheus module is enabled with:
c07f9fc5	22
1e59de90 TL	23	.. prompt:: bash $
	24
	25	ceph mgr module enable prometheus
c07f9fc5 FG	26
	27	Configuration
	28	-------------
	29
f6b5b4d7 TL	30	.. note::
	31
	32	The Prometheus manager module needs to be restarted for configuration changes to be applied.
	33
20effc67 TL	34	.. mgr_module:: prometheus
	35	.. confval:: server_addr
	36	.. confval:: server_port
	37	.. confval:: scrape_interval
	38	.. confval:: cache
	39	.. confval:: stale_cache_strategy
	40	.. confval:: rbd_stats_pools
	41	.. confval:: rbd_stats_pools_refresh_interval
	42	.. confval:: standby_behaviour
	43	.. confval:: standby_error_status_code
	44
f6b5b4d7 TL	45	By default the module will accept HTTP requests on port ``9283`` on all IPv4
f6b5b4d7 TL	46	and IPv6 addresses on the host. The port and listen address are both
f67539c2	47	configurable with ``ceph config set``, with keys
f6b5b4d7 TL	48	``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
	49	is registered with Prometheus's `registry
	50	<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
	51
1e59de90 TL	52	.. prompt:: bash $
	53
	54	ceph config set mgr mgr/prometheus/server_addr 0.0.0.
	55	ceph config set mgr mgr/prometheus/server_port 9283
f6b5b4d7 TL	56
	57	.. warning::
	58
20effc67	59	The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match
f6b5b4d7	60	Prometheus' scrape interval to work properly and not cause any issues.
20effc67 TL	61
20effc67 TL	62	The scrape interval in the module is used for caching purposes
f6b5b4d7 TL	63	and to determine when a cache is stale.
	64
	65	It is not recommended to use a scrape interval below 10 seconds. It is
	66	recommended to use 15 seconds as scrape interval, though, in some cases it
	67	might be useful to increase the scrape interval.
	68
	69	To set a different scrape interval in the Prometheus module, set
1e59de90	70	``scrape_interval`` to the desired value:
f6b5b4d7	71
1e59de90 TL	72	.. prompt:: bash $
	73
	74	ceph config set mgr mgr/prometheus/scrape_interval 20
f6b5b4d7 TL	75
f6b5b4d7 TL	76	On large clusters (>1000 OSDs), the time to fetch the metrics may become
a4b75251 TL	77	significant. Without the cache, the Prometheus manager module could, especially
	78	in conjunction with multiple Prometheus instances, overload the manager and lead
	79	to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
	80	by default. This means that there is a possibility that the cache becomes
	81	stale. The cache is considered stale when the time to fetch the metrics from
1e59de90	82	Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
f6b5b4d7 TL	83
	84	If that is the case, a warning will be logged and the module will either
	85
	86	* respond with a 503 HTTP status code (service unavailable) or,
	87	* it will return the content of the cache, even though it might be stale.
	88
	89	This behavior can be configured. By default, it will return a 503 HTTP status
	90	code (service unavailable). You can set other options using the ``ceph config
	91	set`` commands.
	92
1e59de90 TL	93	To tell the module to respond with possibly stale data, set it to ``return``:
	94
	95	.. prompt:: bash $
f6b5b4d7 TL	96
	97	ceph config set mgr mgr/prometheus/stale_cache_strategy return
	98
1e59de90	99	To tell the module to respond with "service unavailable", set it to ``fail``:
f6b5b4d7	100
1e59de90	101	.. prompt:: bash $
c07f9fc5	102
1e59de90	103	ceph config set mgr mgr/prometheus/stale_cache_strategy fail
a4b75251	104
1e59de90 TL	105	If you are confident that you don't require the cache, you can disable it:
	106
	107	.. prompt:: bash $
	108
	109	ceph config set mgr mgr/prometheus/cache false
a4b75251	110
20effc67 TL	111	If you are using the prometheus module behind some kind of reverse proxy or
20effc67 TL	112	loadbalancer, you can simplify discovering the active instance by switching
1e59de90 TL	113	to ``error``-mode:
	114
	115	.. prompt:: bash $
20effc67	116
1e59de90	117	ceph config set mgr mgr/prometheus/standby_behaviour error
20effc67	118
1e59de90	119	If set, the prometheus module will respond with a HTTP error when requesting ``/``
20effc67	120	from the standby instance. The default error code is 500, but you can configure
1e59de90	121	the HTTP response code with:
20effc67	122
1e59de90 TL	123	.. prompt:: bash $
	124
	125	ceph config set mgr mgr/prometheus/standby_error_status_code 503
20effc67 TL	126
	127	Valid error codes are between 400-599.
	128
1e59de90 TL	129	To switch back to the default behaviour, simply set the config key to ``default``:
	130
	131	.. prompt:: bash $
20effc67	132
1e59de90	133	ceph config set mgr mgr/prometheus/standby_behaviour default
20effc67	134
e306af50 TL	135	.. _prometheus-rbd-io-statistics:
e306af50 TL	136
20effc67 TL	137	Ceph Health Checks
	138	------------------
	139
	140	The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
	141	exposing them to the Prometheus server as discrete metrics. This allows Prometheus
	142	alert rules to be configured for specific health check events.
	143
	144	The metrics take the following form;
	145
	146	::
	147
	148	# HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
	149	# TYPE ceph_health_detail gauge
	150	ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
	151	ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
	152	ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
	153
	154	The health check history is made available through the following commands;
	155
	156	::
	157
	158	healthcheck history ls [--format {plain\|json\|json-pretty}]
	159	healthcheck history clear
	160
	161	The ``ls`` command provides an overview of the health checks that the cluster has
	162	encountered, or since the last ``clear`` command was issued. The example below;
	163
	164	::
	165
	166	[ceph: root@c8-node1 /]# ceph healthcheck history ls
	167	Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
	168	OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
	169	OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
	170	PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
	171	3 health check(s) listed
	172
	173
11fdf7f2 TL	174	RBD IO statistics
	175	-----------------
	176
	177	The module can optionally collect RBD per-image IO statistics by enabling
	178	dynamic OSD performance counters. The statistics are gathered for all images
	179	in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
	180	configuration parameter. The parameter is a comma or space separated list
	181	of ``pool[/namespace]`` entries. If the namespace is not specified the
	182	statistics are collected for all namespaces in the pool.
	183
1e59de90	184	Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
e306af50	185
1e59de90 TL	186	.. prompt:: bash $
	187
	188	ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
	189
	190	The wildcard can be used to indicate all pools or namespaces:
	191
	192	.. prompt:: bash $
	193
	194	ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
e306af50	195
11fdf7f2 TL	196	The module makes the list of all available images scanning the specified
	197	pools and namespaces and refreshes it periodically. The period is
	198	configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
	199	parameter (in sec) and is 300 sec (5 minutes) by default. The module will
	200	force refresh earlier if it detects statistics from a previously unknown
	201	RBD image.
	202
1e59de90 TL	203	Example to turn up the sync interval to 10 minutes:
	204
	205	.. prompt:: bash $
	206
	207	ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
	208
	209	Ceph daemon performance counters metrics
	210	-----------------------------------------
	211
	212	With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
	213	perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
	214	the module option ``exclude_perf_counters`` to ``false``:
	215
	216	.. prompt:: bash $
e306af50	217
1e59de90	218	ceph config set mgr mgr/prometheus/exclude_perf_counters false
e306af50	219
3efd9988 FG	220	Statistic names and labels
	221	==========================
	222
	223	The names of the stats are exactly as Ceph names them, with
f6b5b4d7	224	illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
3efd9988 FG	225	and ``ceph_`` prefixed to all names.
	226
	227
	228	All daemon statistics have a ``ceph_daemon`` label such as "osd.123"
	229	that identifies the type and ID of the daemon they come from. Some
	230	statistics can come from different types of daemon, so when querying
	231	e.g. an OSD's RocksDB stats, you would probably want to filter
	232	on ceph_daemon starting with "osd" to avoid mixing in the monitor
	233	rocksdb stats.
	234
	235
	236	The cluster statistics (i.e. those global to the Ceph cluster)
f6b5b4d7	237	have labels appropriate to what they report on. For example,
3efd9988 FG	238	metrics relating to pools have a ``pool_id`` label.
3efd9988 FG	239
11fdf7f2 TL	240
	241	The long running averages that represent the histograms from core Ceph
	242	are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
	243	This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
	244	and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
	245
3efd9988 FG	246	Pool and OSD metadata series
	247	----------------------------
	248
	249	Special series are output to enable displaying and querying on
	250	certain metadata fields.
	251
	252	Pools have a ``ceph_pool_metadata`` field like this:
	253
	254	::
	255
28e407b8	256	ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
3efd9988 FG	257
	258	OSDs have a ``ceph_osd_metadata`` field like this:
	259
	260	::
	261
28e407b8	262	ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
3efd9988 FG	263
	264
	265	Correlating drive statistics with node_exporter
	266	-----------------------------------------------
	267
	268	The prometheus output from Ceph is designed to be used in conjunction
	269	with the generic host monitoring from the Prometheus node_exporter.
	270
f6b5b4d7	271	To enable correlation of Ceph OSD statistics with node_exporter's
3efd9988 FG	272	drive statistics, special series are output like this:
	273
	274	::
	275
20effc67	276	ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
3efd9988	277
28e407b8 AA	278	To use this to get disk statistics by OSD ID, use either the ``and`` operator or
28e407b8 AA	279	the ``*`` operator in your prometheus query. All metadata metrics (like ``
20effc67	280	ceph_disk_occupation_human`` have the value 1 so they act neutral with ````. Using ````
28e407b8 AA	281	allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
	282	the resulting metric has additional labels from one side of the query.
	283
	284	See the
	285	`prometheus documentation`__ for more information about constructing queries.
	286
	287	__ https://prometheus.io/docs/prometheus/latest/querying/basics
	288
	289	The goal is to run a query like
3efd9988 FG	290
	291	::
	292
1e59de90	293	rate(node_disk_written_bytes_total[30s]) and
20effc67	294	on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
3efd9988	295
28e407b8	296	Out of the box the above query will not return any metrics since the ``instance`` labels of
20effc67	297	both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
28e407b8	298	will be the currently active MGR node.
3efd9988	299
20effc67 TL	300	The following two section outline two approaches to remedy this.
	301
	302	.. note::
	303
	304	If you need to group on the `ceph_daemon` label instead of `device` and
	305	`instance` labels, using `ceph_disk_occupation_human` may not work reliably.
	306	It is advised that you use `ceph_disk_occupation` instead.
	307
	308	The difference is that `ceph_disk_occupation_human` may group several OSDs
	309	into the value of a single `ceph_daemon` label in cases where multiple OSDs
	310	share a disk.
3efd9988	311
28e407b8 AA	312	Use label_replace
	313	=================
	314
	315	The ``label_replace`` function (cp.
	316	`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
	317	can add a label to, or alter a label of, a metric within a query.
3efd9988	318
28e407b8 AA	319	To correlate an OSD and its disks write rate, the following query can be used:
	320
	321	::
3efd9988	322
20effc67	323	label_replace(
1e59de90	324	rate(node_disk_written_bytes_total[30s]),
20effc67 TL	325	"exported_instance",
	326	"$1",
	327	"instance",
	328	"(.):."
	329	) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
28e407b8 AA	330
	331	Configuring Prometheus server
	332	=============================
3efd9988 FG	333
	334	honor_labels
	335	------------
	336
11fdf7f2	337	To enable Ceph to output properly-labeled data relating to any host,
3efd9988 FG	338	use the ``honor_labels`` setting when adding the ceph-mgr endpoints
	339	to your prometheus configuration.
	340
28e407b8 AA	341	This allows Ceph to export the proper ``instance`` label without prometheus
28e407b8 AA	342	overwriting it. Without this setting, Prometheus applies an ``instance`` label
11fdf7f2	343	that includes the hostname and port of the endpoint that the series came from.
28e407b8 AA	344	Because Ceph clusters have multiple manager daemons, this results in an
	345	``instance`` label that changes spuriously when the active manager daemon
	346	changes.
3efd9988	347
11fdf7f2 TL	348	If this is undesirable a custom ``instance`` label can be set in the
	349	Prometheus target configuration: you might wish to set it to the hostname
	350	of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
	351
28e407b8	352	node_exporter hostname labels
3efd9988 FG	353	-----------------------------
	354
	355	Set your ``instance`` labels to match what appears in Ceph's OSD metadata
28e407b8	356	in the ``instance`` field. This is generally the short hostname of the node.
3efd9988 FG	357
	358	This is only necessary if you want to correlate Ceph stats with host stats,
	359	but you may find it useful to do it in all cases in case you want to do
	360	the correlation in the future.
	361
	362	Example configuration
	363	---------------------
	364
	365	This example shows a single node configuration running ceph-mgr and
f67539c2 TL	366	node_exporter on a server called ``senta04``. Note that this requires one
f67539c2 TL	367	to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
3efd9988 FG	368
	369	This is just an example: there are other ways to configure prometheus
	370	scrape targets and label rewrite rules.
	371
	372	prometheus.yml
	373	~~~~~~~~~~~~~~
	374
	375	::
	376
	377	global:
	378	scrape_interval: 15s
	379	evaluation_interval: 15s
	380
	381	scrape_configs:
	382	- job_name: 'node'
	383	file_sd_configs:
	384	- files:
	385	- node_targets.yml
	386	- job_name: 'ceph'
	387	honor_labels: true
	388	file_sd_configs:
	389	- files:
	390	- ceph_targets.yml
	391
	392
	393	ceph_targets.yml
	394	~~~~~~~~~~~~~~~~
	395
	396
	397	::
	398
	399	[
	400	{
	401	"targets": [ "senta04.mydomain.com:9283" ],
28e407b8	402	"labels": {}
3efd9988 FG	403	}
	404	]
	405
	406
	407	node_targets.yml
	408	~~~~~~~~~~~~~~~~
	409
	410	::
	411
	412	[
	413	{
	414	"targets": [ "senta04.mydomain.com:9100" ],
	415	"labels": {
	416	"instance": "senta04"
	417	}
	418	}
	419	]
	420
	421
c07f9fc5	422	Notes
3efd9988	423	=====
c07f9fc5	424
20effc67 TL	425	Counters and gauges are exported; currently histograms and long-running
20effc67 TL	426	averages are not. It's possible that Ceph's 2-D histograms could be
c07f9fc5 FG	427	reduced to two separate 1-D histograms, and that long-running averages
	428	could be exported as Prometheus' Summary type.
	429
c07f9fc5 FG	430	Timestamps, as with many Prometheus exporters, are established by
	431	the server's scrape time (Prometheus expects that it is polling the
	432	actual counter process synchronously). It is possible to supply a
	433	timestamp along with the stat report, but the Prometheus team strongly
	434	advises against this. This means that timestamps will be delayed by
	435	an unpredictable amount; it's not clear if this will be problematic,
	436	but it's worth knowing about.