[ceph.git] / ceph / doc / mgr / prometheus.rst

.. _mgr-prometheus:

=================
Prometheus Module
=================

Provides a Prometheus exporter to pass on Ceph performance counters
from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
messages from all MgrClient processes (mons and OSDs, for instance)
with performance counter schema data and actual counter data, and keeps
a circular buffer of the last N samples.  This module creates an HTTP
endpoint (like all Prometheus exporters) and retrieves the latest sample
of every counter when polled (or "scraped" in Prometheus terminology).
The HTTP path and query parameters are ignored; all extant counters
for all reporting entities are returned in text exposition format.
(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)

Enabling prometheus output
==========================

The *prometheus* module is enabled with::

  ceph mgr module enable prometheus

Configuration
-------------

By default the module will accept HTTP requests on port ``9283`` on all
IPv4 and IPv6 addresses on the host.  The port and listen address are both
configurable with ``ceph config-key set``, with keys
``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

.. _prometheus-rbd-io-statistics:

RBD IO statistics
-----------------

The module can optionally collect RBD per-image IO statistics by enabling
dynamic OSD performance counters. The statistics are gathered for all images
in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.

Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::

  ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.

Example to turn up the sync interval to 10 minutes::

  ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

Statistic names and labels
==========================

The names of the stats are exactly as Ceph names them, with
illegal characters ``.``, ``-`` and ``::`` translated to ``_``, 
and ``ceph_`` prefixed to all names.


All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
that identifies the type and ID of the daemon they come from.  Some
statistics can come from different types of daemon, so when querying
e.g. an OSD's RocksDB stats, you would probably want to filter
on ceph_daemon starting with "osd" to avoid mixing in the monitor
rocksdb stats.


The *cluster* statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on.  For example, 
metrics relating to pools have a ``pool_id`` label.


The long running averages that represent the histograms from core Ceph
are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.

Pool and OSD metadata series
----------------------------

Special series are output to enable displaying and querying on
certain metadata fields.

Pools have a ``ceph_pool_metadata`` field like this:

::

    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ``ceph_osd_metadata`` field like this:

::

    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0


Correlating drive statistics with node_exporter
-----------------------------------------------

The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter's 
drive statistics, special series are output like this:

::

    ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.

See the
`prometheus documentation`__ for more information about constructing queries.

__ https://prometheus.io/docs/prometheus/latest/querying/basics

The goal is to run a query like

::

    rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the ``instance`` labels of
both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
will be the currently active MGR node.

 The following two section outline two approaches to remedy this.

Use label_replace
=================

The ``label_replace`` function (cp.
`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

::

    label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Configuring Prometheus server
=============================

honor_labels
------------

To enable Ceph to output properly-labeled data relating to any host,
use the ``honor_labels`` setting when adding the ceph-mgr endpoints
to your prometheus configuration.

This allows Ceph to export the proper ``instance`` label without prometheus
overwriting it. Without this setting, Prometheus applies an ``instance`` label
that includes the hostname and port of the endpoint that the series came from.
Because Ceph clusters have multiple manager daemons, this results in an
``instance`` label that changes spuriously when the active manager daemon
changes.

If this is undesirable a custom ``instance`` label can be set in the
Prometheus target configuration: you might wish to set it to the hostname
of your first mgr daemon, or something completely arbitrary like "ceph_cluster".

node_exporter hostname labels
-----------------------------

Set your ``instance`` labels to match what appears in Ceph's OSD metadata
in the ``instance`` field.  This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,
but you may find it useful to do it in all cases in case you want to do
the correlation in the future.

Example configuration
---------------------

This example shows a single node configuration running ceph-mgr and
node_exporter on a server called ``senta04``. Note that this requires to add the
appropriate instance label to every ``node_exporter`` target individually.

This is just an example: there are other ways to configure prometheus
scrape targets and label rewrite rules.

prometheus.yml
~~~~~~~~~~~~~~

::

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'node'
        file_sd_configs:
          - files:
            - node_targets.yml
      - job_name: 'ceph'
        honor_labels: true
        file_sd_configs:
          - files:
            - ceph_targets.yml


ceph_targets.yml
~~~~~~~~~~~~~~~~


::

    [
        {
            "targets": [ "senta04.mydomain.com:9283" ],
            "labels": {}
        }
    ]


node_targets.yml
~~~~~~~~~~~~~~~~

::

    [
        {
            "targets": [ "senta04.mydomain.com:9100" ],
            "labels": {
                "instance": "senta04"
            }
        }
    ]


Notes
=====

Counters and gauges are exported; currently histograms and long-running 
averages are not.  It's possible that Ceph's 2-D histograms could be 
reduced to two separate 1-D histograms, and that long-running averages
could be exported as Prometheus' Summary type.

Timestamps, as with many Prometheus exporters, are established by
the server's scrape time (Prometheus expects that it is polling the
actual counter process synchronously).  It is possible to supply a
timestamp along with the stat report, but the Prometheus team strongly
advises against this.  This means that timestamps will be delayed by
an unpredictable amount; it's not clear if this will be problematic,
but it's worth knowing about.
Commit	Line	Data
11fdf7f2 TL	1	.. _mgr-prometheus:
11fdf7f2 TL	2
3efd9988	3	=================
11fdf7f2	4	Prometheus Module
c07f9fc5 FG	5	=================
	6
	7	Provides a Prometheus exporter to pass on Ceph performance counters
	8	from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
	9	messages from all MgrClient processes (mons and OSDs, for instance)
	10	with performance counter schema data and actual counter data, and keeps
11fdf7f2	11	a circular buffer of the last N samples. This module creates an HTTP
c07f9fc5 FG	12	endpoint (like all Prometheus exporters) and retrieves the latest sample
	13	of every counter when polled (or "scraped" in Prometheus terminology).
	14	The HTTP path and query parameters are ignored; all extant counters
	15	for all reporting entities are returned in text exposition format.
	16	(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
	17
3efd9988 FG	18	Enabling prometheus output
3efd9988 FG	19	==========================
c07f9fc5 FG	20
	21	The prometheus module is enabled with::
	22
	23	ceph mgr module enable prometheus
	24
	25	Configuration
	26	-------------
	27
	28	By default the module will accept HTTP requests on port ``9283`` on all
	29	IPv4 and IPv6 addresses on the host. The port and listen address are both
	30	configurable with ``ceph config-key set``, with keys
	31	``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
	32	This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
	33
e306af50 TL	34	.. _prometheus-rbd-io-statistics:
e306af50 TL	35
11fdf7f2 TL	36	RBD IO statistics
	37	-----------------
	38
	39	The module can optionally collect RBD per-image IO statistics by enabling
	40	dynamic OSD performance counters. The statistics are gathered for all images
	41	in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
	42	configuration parameter. The parameter is a comma or space separated list
	43	of ``pool[/namespace]`` entries. If the namespace is not specified the
	44	statistics are collected for all namespaces in the pool.
	45
e306af50 TL	46	Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
	47
	48	ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
	49
11fdf7f2 TL	50	The module makes the list of all available images scanning the specified
	51	pools and namespaces and refreshes it periodically. The period is
	52	configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
	53	parameter (in sec) and is 300 sec (5 minutes) by default. The module will
	54	force refresh earlier if it detects statistics from a previously unknown
	55	RBD image.
	56
e306af50 TL	57	Example to turn up the sync interval to 10 minutes::
	58
	59	ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
	60
3efd9988 FG	61	Statistic names and labels
	62	==========================
	63
	64	The names of the stats are exactly as Ceph names them, with
	65	illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
	66	and ``ceph_`` prefixed to all names.
	67
	68
	69	All daemon statistics have a ``ceph_daemon`` label such as "osd.123"
	70	that identifies the type and ID of the daemon they come from. Some
	71	statistics can come from different types of daemon, so when querying
	72	e.g. an OSD's RocksDB stats, you would probably want to filter
	73	on ceph_daemon starting with "osd" to avoid mixing in the monitor
	74	rocksdb stats.
	75
	76
	77	The cluster statistics (i.e. those global to the Ceph cluster)
	78	have labels appropriate to what they report on. For example,
	79	metrics relating to pools have a ``pool_id`` label.
	80
11fdf7f2 TL	81
	82	The long running averages that represent the histograms from core Ceph
	83	are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
	84	This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
	85	and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
	86
3efd9988 FG	87	Pool and OSD metadata series
	88	----------------------------
	89
	90	Special series are output to enable displaying and querying on
	91	certain metadata fields.
	92
	93	Pools have a ``ceph_pool_metadata`` field like this:
	94
	95	::
	96
28e407b8	97	ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
3efd9988 FG	98
	99	OSDs have a ``ceph_osd_metadata`` field like this:
	100
	101	::
	102
28e407b8	103	ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
3efd9988 FG	104
	105
	106	Correlating drive statistics with node_exporter
	107	-----------------------------------------------
	108
	109	The prometheus output from Ceph is designed to be used in conjunction
	110	with the generic host monitoring from the Prometheus node_exporter.
	111
	112	To enable correlation of Ceph OSD statistics with node_exporter's
	113	drive statistics, special series are output like this:
	114
	115	::
	116
28e407b8	117	ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
3efd9988	118
28e407b8 AA	119	To use this to get disk statistics by OSD ID, use either the ``and`` operator or
	120	the ``*`` operator in your prometheus query. All metadata metrics (like ``
	121	ceph_disk_occupation`` have the value 1 so they act neutral with ````. Using ````
	122	allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
	123	the resulting metric has additional labels from one side of the query.
	124
	125	See the
	126	`prometheus documentation`__ for more information about constructing queries.
	127
	128	__ https://prometheus.io/docs/prometheus/latest/querying/basics
	129
	130	The goal is to run a query like
3efd9988 FG	131
	132	::
	133
	134	rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	135
28e407b8 AA	136	Out of the box the above query will not return any metrics since the ``instance`` labels of
	137	both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
	138	will be the currently active MGR node.
3efd9988	139
28e407b8	140	The following two section outline two approaches to remedy this.
3efd9988	141
28e407b8 AA	142	Use label_replace
	143	=================
	144
	145	The ``label_replace`` function (cp.
	146	`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
	147	can add a label to, or alter a label of, a metric within a query.
3efd9988	148
28e407b8 AA	149	To correlate an OSD and its disks write rate, the following query can be used:
	150
	151	::
3efd9988	152
28e407b8 AA	153	label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.):.") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	154
	155	Configuring Prometheus server
	156	=============================
3efd9988 FG	157
	158	honor_labels
	159	------------
	160
11fdf7f2	161	To enable Ceph to output properly-labeled data relating to any host,
3efd9988 FG	162	use the ``honor_labels`` setting when adding the ceph-mgr endpoints
	163	to your prometheus configuration.
	164
28e407b8 AA	165	This allows Ceph to export the proper ``instance`` label without prometheus
28e407b8 AA	166	overwriting it. Without this setting, Prometheus applies an ``instance`` label
11fdf7f2	167	that includes the hostname and port of the endpoint that the series came from.
28e407b8 AA	168	Because Ceph clusters have multiple manager daemons, this results in an
	169	``instance`` label that changes spuriously when the active manager daemon
	170	changes.
3efd9988	171
11fdf7f2 TL	172	If this is undesirable a custom ``instance`` label can be set in the
	173	Prometheus target configuration: you might wish to set it to the hostname
	174	of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
	175
28e407b8	176	node_exporter hostname labels
3efd9988 FG	177	-----------------------------
	178
	179	Set your ``instance`` labels to match what appears in Ceph's OSD metadata
28e407b8	180	in the ``instance`` field. This is generally the short hostname of the node.
3efd9988 FG	181
	182	This is only necessary if you want to correlate Ceph stats with host stats,
	183	but you may find it useful to do it in all cases in case you want to do
	184	the correlation in the future.
	185
	186	Example configuration
	187	---------------------
	188
	189	This example shows a single node configuration running ceph-mgr and
28e407b8 AA	190	node_exporter on a server called ``senta04``. Note that this requires to add the
28e407b8 AA	191	appropriate instance label to every ``node_exporter`` target individually.
3efd9988 FG	192
	193	This is just an example: there are other ways to configure prometheus
	194	scrape targets and label rewrite rules.
	195
	196	prometheus.yml
	197	~~~~~~~~~~~~~~
	198
	199	::
	200
	201	global:
	202	scrape_interval: 15s
	203	evaluation_interval: 15s
	204
	205	scrape_configs:
	206	- job_name: 'node'
	207	file_sd_configs:
	208	- files:
	209	- node_targets.yml
	210	- job_name: 'ceph'
	211	honor_labels: true
	212	file_sd_configs:
	213	- files:
	214	- ceph_targets.yml
	215
	216
	217	ceph_targets.yml
	218	~~~~~~~~~~~~~~~~
	219
	220
	221	::
	222
	223	[
	224	{
	225	"targets": [ "senta04.mydomain.com:9283" ],
28e407b8	226	"labels": {}
3efd9988 FG	227	}
	228	]
	229
	230
	231	node_targets.yml
	232	~~~~~~~~~~~~~~~~
	233
	234	::
	235
	236	[
	237	{
	238	"targets": [ "senta04.mydomain.com:9100" ],
	239	"labels": {
	240	"instance": "senta04"
	241	}
	242	}
	243	]
	244
	245
c07f9fc5	246	Notes
3efd9988	247	=====
c07f9fc5 FG	248
	249	Counters and gauges are exported; currently histograms and long-running
	250	averages are not. It's possible that Ceph's 2-D histograms could be
	251	reduced to two separate 1-D histograms, and that long-running averages
	252	could be exported as Prometheus' Summary type.
	253
c07f9fc5 FG	254	Timestamps, as with many Prometheus exporters, are established by
	255	the server's scrape time (Prometheus expects that it is polling the
	256	actual counter process synchronously). It is possible to supply a
	257	timestamp along with the stat report, but the Prometheus team strongly
	258	advises against this. This means that timestamps will be delayed by
	259	an unpredictable amount; it's not clear if this will be problematic,
	260	but it's worth knowing about.