[ceph.git] / ceph / doc / mgr / prometheus.rst

.. _mgr-prometheus:

=================
Prometheus Module
=================

Provides a Prometheus exporter to pass on Ceph performance counters
from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
messages from all MgrClient processes (mons and OSDs, for instance)
with performance counter schema data and actual counter data, and keeps
a circular buffer of the last N samples.  This module creates an HTTP
endpoint (like all Prometheus exporters) and retrieves the latest sample
of every counter when polled (or "scraped" in Prometheus terminology).
The HTTP path and query parameters are ignored; all extant counters
for all reporting entities are returned in text exposition format.
(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)

Enabling prometheus output
==========================

The *prometheus* module is enabled with::

  ceph mgr module enable prometheus

Configuration
-------------

.. note::

    The Prometheus manager module needs to be restarted for configuration changes to be applied.

By default the module will accept HTTP requests on port ``9283`` on all IPv4
and IPv6 addresses on the host.  The port and listen address are both
configurable with ``ceph config set``, with keys
``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
is registered with Prometheus's `registry
<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

::

    ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
    ceph config set mgr mgr/prometheus/server_port 9283

.. warning::

    The ``scrape_interval`` of this module should always be set to match
    Prometheus' scrape interval to work properly and not cause any issues.
    
The Prometheus manager module is, by default, configured with a scrape interval
of 15 seconds.  The scrape interval in the module is used for caching purposes
and to determine when a cache is stale.

It is not recommended to use a scrape interval below 10 seconds.  It is
recommended to use 15 seconds as scrape interval, though, in some cases it
might be useful to increase the scrape interval.

To set a different scrape interval in the Prometheus module, set
``scrape_interval`` to the desired value::

    ceph config set mgr mgr/prometheus/scrape_interval 20

On large clusters (>1000 OSDs), the time to fetch the metrics may become
significant.  Without the cache, the Prometheus manager module could, especially
in conjunction with multiple Prometheus instances, overload the manager and lead
to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
by default.  This means that there is a possibility that the cache becomes
stale.  The cache is considered stale when the time to fetch the metrics from
Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.

If that is the case, **a warning will be logged** and the module will either

* respond with a 503 HTTP status code (service unavailable) or,
* it will return the content of the cache, even though it might be stale.

This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ``ceph config
set`` commands.

To tell the module to respond with possibly stale data, set it to ``return``::

    ceph config set mgr mgr/prometheus/stale_cache_strategy return

To tell the module to respond with "service unavailable", set it to ``fail``::

    ceph config set mgr mgr/prometheus/stale_cache_strategy fail

If you are confident that you don't require the cache, you can disable it::

    ceph config set mgr mgr/prometheus/cache false

.. _prometheus-rbd-io-statistics:

RBD IO statistics
-----------------

The module can optionally collect RBD per-image IO statistics by enabling
dynamic OSD performance counters. The statistics are gathered for all images
in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.

Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::

  ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.

Example to turn up the sync interval to 10 minutes::

  ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

Statistic names and labels
==========================

The names of the stats are exactly as Ceph names them, with
illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
and ``ceph_`` prefixed to all names.


All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
that identifies the type and ID of the daemon they come from.  Some
statistics can come from different types of daemon, so when querying
e.g. an OSD's RocksDB stats, you would probably want to filter
on ceph_daemon starting with "osd" to avoid mixing in the monitor
rocksdb stats.


The *cluster* statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on.  For example,
metrics relating to pools have a ``pool_id`` label.


The long running averages that represent the histograms from core Ceph
are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.

Pool and OSD metadata series
----------------------------

Special series are output to enable displaying and querying on
certain metadata fields.

Pools have a ``ceph_pool_metadata`` field like this:

::

    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ``ceph_osd_metadata`` field like this:

::

    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0


Correlating drive statistics with node_exporter
-----------------------------------------------

The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter's
drive statistics, special series are output like this:

::

    ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.

See the
`prometheus documentation`__ for more information about constructing queries.

__ https://prometheus.io/docs/prometheus/latest/querying/basics

The goal is to run a query like

::

    rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the ``instance`` labels of
both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
will be the currently active MGR node.

 The following two section outline two approaches to remedy this.

Use label_replace
=================

The ``label_replace`` function (cp.
`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

::

    label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Configuring Prometheus server
=============================

honor_labels
------------

To enable Ceph to output properly-labeled data relating to any host,
use the ``honor_labels`` setting when adding the ceph-mgr endpoints
to your prometheus configuration.

This allows Ceph to export the proper ``instance`` label without prometheus
overwriting it. Without this setting, Prometheus applies an ``instance`` label
that includes the hostname and port of the endpoint that the series came from.
Because Ceph clusters have multiple manager daemons, this results in an
``instance`` label that changes spuriously when the active manager daemon
changes.

If this is undesirable a custom ``instance`` label can be set in the
Prometheus target configuration: you might wish to set it to the hostname
of your first mgr daemon, or something completely arbitrary like "ceph_cluster".

node_exporter hostname labels
-----------------------------

Set your ``instance`` labels to match what appears in Ceph's OSD metadata
in the ``instance`` field.  This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,
but you may find it useful to do it in all cases in case you want to do
the correlation in the future.

Example configuration
---------------------

This example shows a single node configuration running ceph-mgr and
node_exporter on a server called ``senta04``. Note that this requires one
to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.

This is just an example: there are other ways to configure prometheus
scrape targets and label rewrite rules.

prometheus.yml
~~~~~~~~~~~~~~

::

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'node'
        file_sd_configs:
          - files:
            - node_targets.yml
      - job_name: 'ceph'
        honor_labels: true
        file_sd_configs:
          - files:
            - ceph_targets.yml


ceph_targets.yml
~~~~~~~~~~~~~~~~


::

    [
        {
            "targets": [ "senta04.mydomain.com:9283" ],
            "labels": {}
        }
    ]


node_targets.yml
~~~~~~~~~~~~~~~~

::

    [
        {
            "targets": [ "senta04.mydomain.com:9100" ],
            "labels": {
                "instance": "senta04"
            }
        }
    ]


Notes
=====

Counters and gauges are exported; currently histograms and long-running 
averages are not.  It's possible that Ceph's 2-D histograms could be 
reduced to two separate 1-D histograms, and that long-running averages
could be exported as Prometheus' Summary type.

Timestamps, as with many Prometheus exporters, are established by
the server's scrape time (Prometheus expects that it is polling the
actual counter process synchronously).  It is possible to supply a
timestamp along with the stat report, but the Prometheus team strongly
advises against this.  This means that timestamps will be delayed by
an unpredictable amount; it's not clear if this will be problematic,
but it's worth knowing about.
Commit	Line	Data
11fdf7f2 TL	1	.. _mgr-prometheus:
11fdf7f2 TL	2
3efd9988	3	=================
11fdf7f2	4	Prometheus Module
c07f9fc5 FG	5	=================
	6
	7	Provides a Prometheus exporter to pass on Ceph performance counters
	8	from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport
	9	messages from all MgrClient processes (mons and OSDs, for instance)
	10	with performance counter schema data and actual counter data, and keeps
11fdf7f2	11	a circular buffer of the last N samples. This module creates an HTTP
c07f9fc5 FG	12	endpoint (like all Prometheus exporters) and retrieves the latest sample
	13	of every counter when polled (or "scraped" in Prometheus terminology).
	14	The HTTP path and query parameters are ignored; all extant counters
	15	for all reporting entities are returned in text exposition format.
	16	(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
	17
3efd9988 FG	18	Enabling prometheus output
3efd9988 FG	19	==========================
c07f9fc5 FG	20
	21	The prometheus module is enabled with::
	22
	23	ceph mgr module enable prometheus
	24
	25	Configuration
	26	-------------
	27
f6b5b4d7 TL	28	.. note::
	29
	30	The Prometheus manager module needs to be restarted for configuration changes to be applied.
	31
	32	By default the module will accept HTTP requests on port ``9283`` on all IPv4
	33	and IPv6 addresses on the host. The port and listen address are both
f67539c2	34	configurable with ``ceph config set``, with keys
f6b5b4d7 TL	35	``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
	36	is registered with Prometheus's `registry
	37	<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
	38
	39	::
	40
	41	ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
	42	ceph config set mgr mgr/prometheus/server_port 9283
	43
	44	.. warning::
	45
	46	The ``scrape_interval`` of this module should always be set to match
	47	Prometheus' scrape interval to work properly and not cause any issues.
	48
	49	The Prometheus manager module is, by default, configured with a scrape interval
	50	of 15 seconds. The scrape interval in the module is used for caching purposes
	51	and to determine when a cache is stale.
	52
	53	It is not recommended to use a scrape interval below 10 seconds. It is
	54	recommended to use 15 seconds as scrape interval, though, in some cases it
	55	might be useful to increase the scrape interval.
	56
	57	To set a different scrape interval in the Prometheus module, set
	58	``scrape_interval`` to the desired value::
	59
	60	ceph config set mgr mgr/prometheus/scrape_interval 20
	61
	62	On large clusters (>1000 OSDs), the time to fetch the metrics may become
a4b75251 TL	63	significant. Without the cache, the Prometheus manager module could, especially
	64	in conjunction with multiple Prometheus instances, overload the manager and lead
	65	to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
	66	by default. This means that there is a possibility that the cache becomes
	67	stale. The cache is considered stale when the time to fetch the metrics from
	68	Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
f6b5b4d7 TL	69
	70	If that is the case, a warning will be logged and the module will either
	71
	72	* respond with a 503 HTTP status code (service unavailable) or,
	73	* it will return the content of the cache, even though it might be stale.
	74
	75	This behavior can be configured. By default, it will return a 503 HTTP status
	76	code (service unavailable). You can set other options using the ``ceph config
	77	set`` commands.
	78
	79	To tell the module to respond with possibly stale data, set it to ``return``::
	80
	81	ceph config set mgr mgr/prometheus/stale_cache_strategy return
	82
	83	To tell the module to respond with "service unavailable", set it to ``fail``::
	84
	85	ceph config set mgr mgr/prometheus/stale_cache_strategy fail
c07f9fc5	86
a4b75251 TL	87	If you are confident that you don't require the cache, you can disable it::
	88
	89	ceph config set mgr mgr/prometheus/cache false
	90
e306af50 TL	91	.. _prometheus-rbd-io-statistics:
e306af50 TL	92
11fdf7f2 TL	93	RBD IO statistics
	94	-----------------
	95
	96	The module can optionally collect RBD per-image IO statistics by enabling
	97	dynamic OSD performance counters. The statistics are gathered for all images
	98	in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
	99	configuration parameter. The parameter is a comma or space separated list
	100	of ``pool[/namespace]`` entries. If the namespace is not specified the
	101	statistics are collected for all namespaces in the pool.
	102
e306af50 TL	103	Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
	104
	105	ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
	106
11fdf7f2 TL	107	The module makes the list of all available images scanning the specified
	108	pools and namespaces and refreshes it periodically. The period is
	109	configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
	110	parameter (in sec) and is 300 sec (5 minutes) by default. The module will
	111	force refresh earlier if it detects statistics from a previously unknown
	112	RBD image.
	113
e306af50 TL	114	Example to turn up the sync interval to 10 minutes::
	115
	116	ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
	117
3efd9988 FG	118	Statistic names and labels
	119	==========================
	120
	121	The names of the stats are exactly as Ceph names them, with
f6b5b4d7	122	illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
3efd9988 FG	123	and ``ceph_`` prefixed to all names.
	124
	125
	126	All daemon statistics have a ``ceph_daemon`` label such as "osd.123"
	127	that identifies the type and ID of the daemon they come from. Some
	128	statistics can come from different types of daemon, so when querying
	129	e.g. an OSD's RocksDB stats, you would probably want to filter
	130	on ceph_daemon starting with "osd" to avoid mixing in the monitor
	131	rocksdb stats.
	132
	133
	134	The cluster statistics (i.e. those global to the Ceph cluster)
f6b5b4d7	135	have labels appropriate to what they report on. For example,
3efd9988 FG	136	metrics relating to pools have a ``pool_id`` label.
3efd9988 FG	137
11fdf7f2 TL	138
	139	The long running averages that represent the histograms from core Ceph
	140	are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
	141	This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
	142	and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
	143
3efd9988 FG	144	Pool and OSD metadata series
	145	----------------------------
	146
	147	Special series are output to enable displaying and querying on
	148	certain metadata fields.
	149
	150	Pools have a ``ceph_pool_metadata`` field like this:
	151
	152	::
	153
28e407b8	154	ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
3efd9988 FG	155
	156	OSDs have a ``ceph_osd_metadata`` field like this:
	157
	158	::
	159
28e407b8	160	ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
3efd9988 FG	161
	162
	163	Correlating drive statistics with node_exporter
	164	-----------------------------------------------
	165
	166	The prometheus output from Ceph is designed to be used in conjunction
	167	with the generic host monitoring from the Prometheus node_exporter.
	168
f6b5b4d7	169	To enable correlation of Ceph OSD statistics with node_exporter's
3efd9988 FG	170	drive statistics, special series are output like this:
	171
	172	::
	173
28e407b8	174	ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
3efd9988	175
28e407b8 AA	176	To use this to get disk statistics by OSD ID, use either the ``and`` operator or
	177	the ``*`` operator in your prometheus query. All metadata metrics (like ``
	178	ceph_disk_occupation`` have the value 1 so they act neutral with ````. Using ````
	179	allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
	180	the resulting metric has additional labels from one side of the query.
	181
	182	See the
	183	`prometheus documentation`__ for more information about constructing queries.
	184
	185	__ https://prometheus.io/docs/prometheus/latest/querying/basics
	186
	187	The goal is to run a query like
3efd9988 FG	188
	189	::
	190
	191	rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	192
28e407b8 AA	193	Out of the box the above query will not return any metrics since the ``instance`` labels of
	194	both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
	195	will be the currently active MGR node.
3efd9988	196
28e407b8	197	The following two section outline two approaches to remedy this.
3efd9988	198
28e407b8 AA	199	Use label_replace
	200	=================
	201
	202	The ``label_replace`` function (cp.
	203	`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
	204	can add a label to, or alter a label of, a metric within a query.
3efd9988	205
28e407b8 AA	206	To correlate an OSD and its disks write rate, the following query can be used:
	207
	208	::
3efd9988	209
28e407b8 AA	210	label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.):.") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}
	211
	212	Configuring Prometheus server
	213	=============================
3efd9988 FG	214
	215	honor_labels
	216	------------
	217
11fdf7f2	218	To enable Ceph to output properly-labeled data relating to any host,
3efd9988 FG	219	use the ``honor_labels`` setting when adding the ceph-mgr endpoints
	220	to your prometheus configuration.
	221
28e407b8 AA	222	This allows Ceph to export the proper ``instance`` label without prometheus
28e407b8 AA	223	overwriting it. Without this setting, Prometheus applies an ``instance`` label
11fdf7f2	224	that includes the hostname and port of the endpoint that the series came from.
28e407b8 AA	225	Because Ceph clusters have multiple manager daemons, this results in an
	226	``instance`` label that changes spuriously when the active manager daemon
	227	changes.
3efd9988	228
11fdf7f2 TL	229	If this is undesirable a custom ``instance`` label can be set in the
	230	Prometheus target configuration: you might wish to set it to the hostname
	231	of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
	232
28e407b8	233	node_exporter hostname labels
3efd9988 FG	234	-----------------------------
	235
	236	Set your ``instance`` labels to match what appears in Ceph's OSD metadata
28e407b8	237	in the ``instance`` field. This is generally the short hostname of the node.
3efd9988 FG	238
	239	This is only necessary if you want to correlate Ceph stats with host stats,
	240	but you may find it useful to do it in all cases in case you want to do
	241	the correlation in the future.
	242
	243	Example configuration
	244	---------------------
	245
	246	This example shows a single node configuration running ceph-mgr and
f67539c2 TL	247	node_exporter on a server called ``senta04``. Note that this requires one
f67539c2 TL	248	to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
3efd9988 FG	249
	250	This is just an example: there are other ways to configure prometheus
	251	scrape targets and label rewrite rules.
	252
	253	prometheus.yml
	254	~~~~~~~~~~~~~~
	255
	256	::
	257
	258	global:
	259	scrape_interval: 15s
	260	evaluation_interval: 15s
	261
	262	scrape_configs:
	263	- job_name: 'node'
	264	file_sd_configs:
	265	- files:
	266	- node_targets.yml
	267	- job_name: 'ceph'
	268	honor_labels: true
	269	file_sd_configs:
	270	- files:
	271	- ceph_targets.yml
	272
	273
	274	ceph_targets.yml
	275	~~~~~~~~~~~~~~~~
	276
	277
	278	::
	279
	280	[
	281	{
	282	"targets": [ "senta04.mydomain.com:9283" ],
28e407b8	283	"labels": {}
3efd9988 FG	284	}
	285	]
	286
	287
	288	node_targets.yml
	289	~~~~~~~~~~~~~~~~
	290
	291	::
	292
	293	[
	294	{
	295	"targets": [ "senta04.mydomain.com:9100" ],
	296	"labels": {
	297	"instance": "senta04"
	298	}
	299	}
	300	]
	301
	302
c07f9fc5	303	Notes
3efd9988	304	=====
c07f9fc5 FG	305
	306	Counters and gauges are exported; currently histograms and long-running
	307	averages are not. It's possible that Ceph's 2-D histograms could be
	308	reduced to two separate 1-D histograms, and that long-running averages
	309	could be exported as Prometheus' Summary type.
	310
c07f9fc5 FG	311	Timestamps, as with many Prometheus exporters, are established by
	312	the server's scrape time (Prometheus expects that it is polling the
	313	actual counter process synchronously). It is possible to supply a
	314	timestamp along with the stat report, but the Prometheus team strongly
	315	advises against this. This means that timestamps will be delayed by
	316	an unpredictable amount; it's not clear if this will be problematic,
	317	but it's worth knowing about.