[ceph.git] / ceph / doc / mgr / prometheus.rst

=================
Prometheus plugin
=================

Provides a Prometheus exporter to pass on Ceph performance counters
from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
messages from all MgrClient processes (mons and OSDs, for instance)
with performance counter schema data and actual counter data, and keeps
a circular buffer of the last N samples.  This plugin creates an HTTP
endpoint (like all Prometheus exporters) and retrieves the latest sample
of every counter when polled (or "scraped" in Prometheus terminology).
The HTTP path and query parameters are ignored; all extant counters
for all reporting entities are returned in text exposition format.
(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)

Enabling prometheus output
==========================

The *prometheus* module is enabled with::

  ceph mgr module enable prometheus

Configuration
-------------

By default the module will accept HTTP requests on port ``9283`` on all
IPv4 and IPv6 addresses on the host.  The port and listen address are both
configurable with ``ceph config-key set``, with keys
``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

Statistic names and labels
==========================

The names of the stats are exactly as Ceph names them, with
illegal characters ``.``, ``-`` and ``::`` translated to ``_``, 
and ``ceph_`` prefixed to all names.


All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
that identifies the type and ID of the daemon they come from.  Some
statistics can come from different types of daemon, so when querying
e.g. an OSD's RocksDB stats, you would probably want to filter
on ceph_daemon starting with "osd" to avoid mixing in the monitor
rocksdb stats.


The *cluster* statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on.  For example, 
metrics relating to pools have a ``pool_id`` label.

Pool and OSD metadata series
----------------------------

Special series are output to enable displaying and querying on
certain metadata fields.

Pools have a ``ceph_pool_metadata`` field like this:

::

    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ``ceph_osd_metadata`` field like this:

::

    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0


Correlating drive statistics with node_exporter
-----------------------------------------------

The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter's 
drive statistics, special series are output like this:

::

    ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.

See the
`prometheus documentation`__ for more information about constructing queries.

__ https://prometheus.io/docs/prometheus/latest/querying/basics

The goal is to run a query like

::

    rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the ``instance`` labels of
both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
will be the currently active MGR node.

 The following two section outline two approaches to remedy this.

Use label_replace
=================

The ``label_replace`` function (cp.
`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

::

    label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Configuring Prometheus server
=============================

honor_labels
------------

To enable Ceph to output properly-labelled data relating to any host,
use the ``honor_labels`` setting when adding the ceph-mgr endpoints
to your prometheus configuration.

This allows Ceph to export the proper ``instance`` label without prometheus
overwriting it. Without this setting, Prometheus applies an ``instance`` label
that includes the hostname and port of the endpoint that the series game from.
Because Ceph clusters have multiple manager daemons, this results in an
``instance`` label that changes spuriously when the active manager daemon
changes.

node_exporter hostname labels
-----------------------------

Set your ``instance`` labels to match what appears in Ceph's OSD metadata
in the ``instance`` field.  This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,
but you may find it useful to do it in all cases in case you want to do
the correlation in the future.

Example configuration
---------------------

This example shows a single node configuration running ceph-mgr and
node_exporter on a server called ``senta04``. Note that this requires to add the
appropriate instance label to every ``node_exporter`` target individually.

This is just an example: there are other ways to configure prometheus
scrape targets and label rewrite rules.

prometheus.yml
~~~~~~~~~~~~~~

::

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'node'
        file_sd_configs:
          - files:
            - node_targets.yml
      - job_name: 'ceph'
        honor_labels: true
        file_sd_configs:
          - files:
            - ceph_targets.yml


ceph_targets.yml
~~~~~~~~~~~~~~~~


::

    [
        {
            "targets": [ "senta04.mydomain.com:9283" ],
            "labels": {}
        }
    ]


node_targets.yml
~~~~~~~~~~~~~~~~

::

    [
        {
            "targets": [ "senta04.mydomain.com:9100" ],
            "labels": {
                "instance": "senta04"
            }
        }
    ]


Notes
=====

Counters and gauges are exported; currently histograms and long-running 
averages are not.  It's possible that Ceph's 2-D histograms could be 
reduced to two separate 1-D histograms, and that long-running averages
could be exported as Prometheus' Summary type.

Timestamps, as with many Prometheus exporters, are established by
the server's scrape time (Prometheus expects that it is polling the
actual counter process synchronously).  It is possible to supply a
timestamp along with the stat report, but the Prometheus team strongly
advises against this.  This means that timestamps will be delayed by
an unpredictable amount; it's not clear if this will be problematic,
but it's worth knowing about.
Commit	Line	Data
3efd9988	1	=================
c07f9fc5 FG	2	Prometheus plugin
	3	=================
	4
	5	Provides a Prometheus exporter to pass on Ceph performance counters
	6	from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
	7	messages from all MgrClient processes (mons and OSDs, for instance)
	8	with performance counter schema data and actual counter data, and keeps
	9	a circular buffer of the last N samples. This plugin creates an HTTP
	10	endpoint (like all Prometheus exporters) and retrieves the latest sample
	11	of every counter when polled (or "scraped" in Prometheus terminology).
	12	The HTTP path and query parameters are ignored; all extant counters
	13	for all reporting entities are returned in text exposition format.
	14	(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
	15
3efd9988 FG	16	Enabling prometheus output
3efd9988 FG	17	==========================
c07f9fc5 FG	18
	19	The prometheus module is enabled with::
	20
	21	ceph mgr module enable prometheus
	22
	23	Configuration
	24	-------------
	25
	26	By default the module will accept HTTP requests on port ``9283`` on all
	27	IPv4 and IPv6 addresses on the host. The port and listen address are both
	28	configurable with ``ceph config-key set``, with keys
	29	``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
	30	This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
	31
3efd9988 FG	32	Statistic names and labels
	33	==========================
	34
	35	The names of the stats are exactly as Ceph names them, with
	36	illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
	37	and ``ceph_`` prefixed to all names.
	38
	39
	40	All daemon statistics have a ``ceph_daemon`` label such as "osd.123"
	41	that identifies the type and ID of the daemon they come from. Some
	42	statistics can come from different types of daemon, so when querying
	43	e.g. an OSD's RocksDB stats, you would probably want to filter
	44	on ceph_daemon starting with "osd" to avoid mixing in the monitor
	45	rocksdb stats.
	46
	47
	48	The cluster statistics (i.e. those global to the Ceph cluster)
	49	have labels appropriate to what they report on. For example,
	50	metrics relating to pools have a ``pool_id`` label.
	51
	52	Pool and OSD metadata series
	53	----------------------------
	54
	55	Special series are output to enable displaying and querying on
	56	certain metadata fields.
	57
	58	Pools have a ``ceph_pool_metadata`` field like this:
	59
	60	::
	61
28e407b8	62	ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
3efd9988 FG	63
	64	OSDs have a ``ceph_osd_metadata`` field like this:
	65
	66	::
	67
28e407b8	68	ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
3efd9988 FG	69
	70
	71	Correlating drive statistics with node_exporter
	72	-----------------------------------------------
	73
	74	The prometheus output from Ceph is designed to be used in conjunction
	75	with the generic host monitoring from the Prometheus node_exporter.
	76
	77	To enable correlation of Ceph OSD statistics with node_exporter's
	78	drive statistics, special series are output like this:
	79
	80	::
	81
28e407b8	82	ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
3efd9988	83
28e407b8 AA	84	To use this to get disk statistics by OSD ID, use either the ``and`` operator or
	85	the ``*`` operator in your prometheus query. All metadata metrics (like ``
	86	ceph_disk_occupation`` have the value 1 so they act neutral with ````. Using ````
	87	allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
	88	the resulting metric has additional labels from one side of the query.
	89
	90	See the
	91	`prometheus documentation`__ for more information about constructing queries.
	92
	93	__ https://prometheus.io/docs/prometheus/latest/querying/basics
	94
	95	The goal is to run a query like
3efd9988 FG	96
	97	::
	98
	99	rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	100
28e407b8 AA	101	Out of the box the above query will not return any metrics since the ``instance`` labels of
	102	both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
	103	will be the currently active MGR node.
3efd9988	104
28e407b8	105	The following two section outline two approaches to remedy this.
3efd9988	106
28e407b8 AA	107	Use label_replace
	108	=================
	109
	110	The ``label_replace`` function (cp.
	111	`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
	112	can add a label to, or alter a label of, a metric within a query.
3efd9988	113
28e407b8 AA	114	To correlate an OSD and its disks write rate, the following query can be used:
	115
	116	::
3efd9988	117
28e407b8 AA	118	label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.):.") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	119
	120	Configuring Prometheus server
	121	=============================
3efd9988 FG	122
	123	honor_labels
	124	------------
	125
	126	To enable Ceph to output properly-labelled data relating to any host,
	127	use the ``honor_labels`` setting when adding the ceph-mgr endpoints
	128	to your prometheus configuration.
	129
28e407b8 AA	130	This allows Ceph to export the proper ``instance`` label without prometheus
	131	overwriting it. Without this setting, Prometheus applies an ``instance`` label
	132	that includes the hostname and port of the endpoint that the series game from.
	133	Because Ceph clusters have multiple manager daemons, this results in an
	134	``instance`` label that changes spuriously when the active manager daemon
	135	changes.
3efd9988	136
28e407b8	137	node_exporter hostname labels
3efd9988 FG	138	-----------------------------
	139
	140	Set your ``instance`` labels to match what appears in Ceph's OSD metadata
28e407b8	141	in the ``instance`` field. This is generally the short hostname of the node.
3efd9988 FG	142
	143	This is only necessary if you want to correlate Ceph stats with host stats,
	144	but you may find it useful to do it in all cases in case you want to do
	145	the correlation in the future.
	146
	147	Example configuration
	148	---------------------
	149
	150	This example shows a single node configuration running ceph-mgr and
28e407b8 AA	151	node_exporter on a server called ``senta04``. Note that this requires to add the
28e407b8 AA	152	appropriate instance label to every ``node_exporter`` target individually.
3efd9988 FG	153
	154	This is just an example: there are other ways to configure prometheus
	155	scrape targets and label rewrite rules.
	156
	157	prometheus.yml
	158	~~~~~~~~~~~~~~
	159
	160	::
	161
	162	global:
	163	scrape_interval: 15s
	164	evaluation_interval: 15s
	165
	166	scrape_configs:
	167	- job_name: 'node'
	168	file_sd_configs:
	169	- files:
	170	- node_targets.yml
	171	- job_name: 'ceph'
	172	honor_labels: true
	173	file_sd_configs:
	174	- files:
	175	- ceph_targets.yml
	176
	177
	178	ceph_targets.yml
	179	~~~~~~~~~~~~~~~~
	180
	181
	182	::
	183
	184	[
	185	{
	186	"targets": [ "senta04.mydomain.com:9283" ],
28e407b8	187	"labels": {}
3efd9988 FG	188	}
	189	]
	190
	191
	192	node_targets.yml
	193	~~~~~~~~~~~~~~~~
	194
	195	::
	196
	197	[
	198	{
	199	"targets": [ "senta04.mydomain.com:9100" ],
	200	"labels": {
	201	"instance": "senta04"
	202	}
	203	}
	204	]
	205
	206
c07f9fc5	207	Notes
3efd9988	208	=====
c07f9fc5 FG	209
	210	Counters and gauges are exported; currently histograms and long-running
	211	averages are not. It's possible that Ceph's 2-D histograms could be
	212	reduced to two separate 1-D histograms, and that long-running averages
	213	could be exported as Prometheus' Summary type.
	214
c07f9fc5 FG	215	Timestamps, as with many Prometheus exporters, are established by
	216	the server's scrape time (Prometheus expects that it is polling the
	217	actual counter process synchronously). It is possible to supply a
	218	timestamp along with the stat report, but the Prometheus team strongly
	219	advises against this. This means that timestamps will be delayed by
	220	an unpredictable amount; it's not clear if this will be problematic,
	221	but it's worth knowing about.