[ceph.git] / ceph / doc / rados / operations / monitoring.rst

======================
 Monitoring a Cluster
======================

Once you have a running cluster, you may use the ``ceph`` tool to monitor your
cluster. Monitoring a cluster typically involves checking OSD status, monitor 
status, placement group status and metadata server status.

Using the command line
======================

Interactive mode
----------------

To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
with no arguments.  For example:: 

	ceph
	ceph> health
	ceph> status
	ceph> quorum_status
	ceph> mon_status

Non-default paths
-----------------

If you specified non-default locations for your configuration or keyring,
you may specify their locations::

   ceph -c /path/to/conf -k /path/to/keyring health

Checking a Cluster's Status
===========================

After you start your cluster, and before you start reading and/or
writing data, check your cluster's status first.

To check a cluster's status, execute the following:: 

	ceph status
	
Or:: 

	ceph -s

In interactive mode, type ``status`` and press **Enter**. ::

	ceph> status

Ceph will print the cluster status. For example, a tiny Ceph demonstration
cluster with one of each service may print the following:

::

  cluster:
    id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
    health: HEALTH_OK
   
  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    mds: cephfs_a-1/1/1 up  {0=a=up:active}, 2 up:standby
    osd: 3 osds: 3 up, 3 in
  
  data:
    pools:   2 pools, 16 pgs
    objects: 21 objects, 2.19K
    usage:   546 GB used, 384 GB / 931 GB avail
    pgs:     16 active+clean


.. topic:: How Ceph Calculates Data Usage

   The ``usage`` value reflects the *actual* amount of raw storage used. The 
   ``xxx GB / xxx GB`` value means the amount available (the lesser number)
   of the overall storage capacity of the cluster. The notional number reflects 
   the size of the stored data before it is replicated, cloned or snapshotted.
   Therefore, the amount of data actually stored typically exceeds the notional
   amount stored, because Ceph creates replicas of the data and may also use 
   storage capacity for cloning and snapshotting.


Watching a Cluster
==================

In addition to local logging by each daemon, Ceph clusters maintain
a *cluster log* that records high level events about the whole system.
This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
default), but can also be monitored via the command line.

To follow the cluster log, use the following command

:: 

	ceph -w

Ceph will print the status of the system, followed by each log message as it
is emitted.  For example:

:: 

  cluster:
    id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
    health: HEALTH_OK
  
  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    mds: cephfs_a-1/1/1 up  {0=a=up:active}, 2 up:standby
    osd: 3 osds: 3 up, 3 in
  
  data:
    pools:   2 pools, 16 pgs
    objects: 21 objects, 2.19K
    usage:   546 GB used, 384 GB / 931 GB avail
    pgs:     16 active+clean
  
  
  2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
  2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
  2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available


In addition to using ``ceph -w`` to print log lines as they are emitted,
use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
log.

Monitoring Health Checks
========================

Ceph continuously runs various *health checks* against its own status.  When
a health check fails, this is reflected in the output of ``ceph status`` (or
``ceph health``).  In addition, messages are sent to the cluster log to
indicate when a check fails, and when the cluster recovers.

For example, when an OSD goes down, the ``health`` section of the status
output may be updated as follows:

::

    health: HEALTH_WARN
            1 osds down
            Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded

At this time, cluster log messages are also emitted to record the failure of the 
health checks:

::

    2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
    2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)

When the OSD comes back online, the cluster log records the cluster's return
to a health state:

::

    2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
    2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
    2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy

Network Performance Checks
--------------------------

Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability.  We
also use the response times to monitor network performance.
While it is possible that a busy OSD could delay a ping response, we can assume
that if a network switch fails mutiple delays will be detected between distinct pairs of OSDs.

By default we will warn about ping times which exceed 1 second (1000 milliseconds).

::

    HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 1118.001 msec

The health detail will add the combination of OSDs are seeing the delays and by how much.  There is a limit of 10
detail line items.

::

    [WRN] OSD_SLOW_PING_TIME_BACK: Long heartbeat ping times on back interface seen, longest is 1118.001 msec
        Slow heartbeat ping on back interface from osd.0 to osd.1 1118.001 msec
        Slow heartbeat ping on back interface from osd.0 to osd.2 1030.123 msec
        Slow heartbeat ping on back interface from osd.2 to osd.1 1015.321 msec
        Slow heartbeat ping on back interface from osd.1 to osd.0 1010.456 msec

To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used.  Typically, this would be
sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD.  The current threshold which defaults to 1 second
(1000 milliseconds) can be overridden as an argument in milliseconds.

The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.

::

    $ ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0
    {
        "threshold": 0,
        "entries": [
            {
                "last update": "Wed Sep  4 17:04:49 2019",
                "stale": false,
                "from osd": 2,
                "to osd": 0,
                "interface": "front",
                "average": {
                    "1min": 1.023,
                    "5min": 0.860,
                    "15min": 0.883
                },
                "min": {
                    "1min": 0.818,
                    "5min": 0.607,
                    "15min": 0.607
                },
                "max": {
                    "1min": 1.164,
                    "5min": 1.173,
                    "15min": 1.544
                },
                "last": 0.924
            },
            {
                "last update": "Wed Sep  4 17:04:49 2019",
                "stale": false,
                "from osd": 2,
                "to osd": 0,
                "interface": "back",
                "average": {
                    "1min": 0.968,
                    "5min": 0.897,
                    "15min": 0.830
                },
                "min": {
                    "1min": 0.860,
                    "5min": 0.563,
                    "15min": 0.502
                },
                "max": {
                    "1min": 1.171,
                    "5min": 1.216,
                    "15min": 1.456
                },
                "last": 0.845
            },
            {
                "last update": "Wed Sep  4 17:04:48 2019",
                "stale": false,
                "from osd": 0,
                "to osd": 1,
                "interface": "front",
                "average": {
                    "1min": 0.965,
                    "5min": 0.811,
                    "15min": 0.850
                },
                "min": {
                    "1min": 0.650,
                    "5min": 0.488,
                    "15min": 0.466
                },
                "max": {
                    "1min": 1.252,
                    "5min": 1.252,
                    "15min": 1.362
                },
            "last": 0.791
        },
        ...


Detecting configuration issues
==============================

In addition to the health checks that Ceph continuously runs on its
own status, there are some configuration issues that may only be detected
by an external tool.

Use the `ceph-medic`_ tool to run these additional checks on your Ceph
cluster's configuration.

Checking a Cluster's Usage Stats
================================

To check a cluster's data usage and data distribution among pools, you can
use the ``df`` option. It is similar to Linux ``df``. Execute 
the following::

	ceph df

The **RAW STORAGE** section of the output provides an overview of the
amount of storage that is managed by your cluster.

- **CLASS:** The class of OSD device (or the total for the cluster)
- **SIZE:** The amount of storage capacity managed by the cluster.
- **AVAIL:** The amount of free space available in the cluster.
- **USED:** The amount of raw storage consumed by user data.
- **RAW USED:** The amount of raw storage consumed by user data, internal overhead, or reserved capacity.
- **%RAW USED:** The percentage of raw storage used. Use this number in
  conjunction with the ``full ratio`` and ``near full ratio`` to ensure that 
  you are not reaching your cluster's capacity. See `Storage Capacity`_ for 
  additional details.

The **POOLS** section of the output provides a list of pools and the notional 
usage of each pool. The output from this section **DOES NOT** reflect replicas,
clones or snapshots. For example, if you store an object with 1MB of data, the 
notional usage will be 1MB, but the actual usage may be 2MB or more depending 
on the number of replicas, clones and snapshots.

- **NAME:** The name of the pool.
- **ID:** The pool ID.
- **USED:** The notional amount of data stored in kilobytes, unless the number 
  appends **M** for megabytes or **G** for gigabytes.
- **%USED:** The notional percentage of storage used per pool.
- **MAX AVAIL:** An estimate of the notional amount of data that can be written
  to this pool.
- **OBJECTS:** The notional number of objects stored per pool.

.. note:: The numbers in the **POOLS** section are notional. They are not 
   inclusive of the number of replicas, snapshots or clones. As a result, 
   the sum of the **USED** and **%USED** amounts will not add up to the 
   **USED** and **%USED** amounts in the **RAW** section of the
   output.

.. note:: The **MAX AVAIL** value is a complicated function of the
   replication or erasure code used, the CRUSH rule that maps storage
   to devices, the utilization of those devices, and the configured
   mon_osd_full_ratio.


Checking OSD Status
===================

You can check OSDs to ensure they are ``up`` and ``in`` by executing:: 

	ceph osd stat
	
Or:: 

	ceph osd dump
	
You can also check view OSDs according to their position in the CRUSH map. :: 

	ceph osd tree

Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
and their weight. ::  

	#ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
	 -1       3.00000 pool default
	 -3       3.00000 rack mainrack
	 -2       3.00000 host osd-host
	  0   ssd 1.00000         osd.0             up  1.00000 1.00000
	  1   ssd 1.00000         osd.1             up  1.00000 1.00000
	  2   ssd 1.00000         osd.2             up  1.00000 1.00000

For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.

Checking Monitor Status
=======================

If your cluster has multiple monitors (likely), you should check the monitor
quorum status after you start the cluster and before reading and/or writing data. A
quorum must be present when multiple monitors are running. You should also check
monitor status periodically to ensure that they are running.

To see display the monitor map, execute the following::

	ceph mon stat
	
Or:: 

	ceph mon dump
	
To check the quorum status for the monitor cluster, execute the following:: 
	
	ceph quorum_status

Ceph will return the quorum status. For example, a Ceph  cluster consisting of
three monitors may return the following:

.. code-block:: javascript

	{ "election_epoch": 10,
	  "quorum": [
	        0,
	        1,
	        2],
	  "quorum_names": [
		"a",
		"b",
		"c"],
	  "quorum_leader_name": "a",
	  "monmap": { "epoch": 1,
	      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
	      "modified": "2011-12-12 13:28:27.505520",
	      "created": "2011-12-12 13:28:27.505520",
	      "features": {"persistent": [
				"kraken",
				"luminous",
				"mimic"],
		"optional": []
	      },
	      "mons": [
	            { "rank": 0,
	              "name": "a",
	              "addr": "127.0.0.1:6789/0",
		      "public_addr": "127.0.0.1:6789/0"},
	            { "rank": 1,
	              "name": "b",
	              "addr": "127.0.0.1:6790/0",
		      "public_addr": "127.0.0.1:6790/0"},
	            { "rank": 2,
	              "name": "c",
	              "addr": "127.0.0.1:6791/0",
		      "public_addr": "127.0.0.1:6791/0"}
	           ]
	  }
	}

Checking MDS Status
===================

Metadata servers provide metadata services for  CephFS. Metadata servers have
two sets of states: ``up | down`` and ``active | inactive``. To ensure your
metadata servers are ``up`` and ``active``,  execute the following:: 

	ceph mds stat
	
To display details of the metadata cluster, execute the following:: 

	ceph fs dump


Checking Placement Group States
===============================

Placement groups map objects to OSDs. When you monitor your
placement groups,  you will want them to be ``active`` and ``clean``. 
For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.

.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg


Using the Admin Socket
======================

The Ceph admin socket allows you to query a daemon via a socket interface. 
By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
via the admin socket, login to the host running the daemon and use the 
following command:: 

	ceph daemon {daemon-name}
	ceph daemon {path-to-socket-file}

For example, the following are equivalent::

    ceph daemon osd.0 foo
    ceph daemon /var/run/ceph/ceph-osd.0.asok foo

To view the available admin socket commands, execute the following command:: 

	ceph daemon {daemon-name} help

The admin socket command enables you to show and set your configuration at
runtime. See `Viewing a Configuration at Runtime`_ for details.

Additionally, you can set configuration values at runtime directly (i.e., the
admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
config set``, which relies on the monitor but doesn't require you to login
directly to the host in question ).

.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/
Commit	Line	Data
7c673cae FG	1	======================
	2	Monitoring a Cluster
	3	======================
	4
	5	Once you have a running cluster, you may use the ``ceph`` tool to monitor your
	6	cluster. Monitoring a cluster typically involves checking OSD status, monitor
	7	status, placement group status and metadata server status.
	8
c07f9fc5 FG	9	Using the command line
	10	======================
	11
	12	Interactive mode
	13	----------------
7c673cae FG	14
	15	To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
	16	with no arguments. For example::
	17
	18	ceph
	19	ceph> health
	20	ceph> status
	21	ceph> quorum_status
	22	ceph> mon_status
7c673cae	23
c07f9fc5 FG	24	Non-default paths
c07f9fc5 FG	25	-----------------
7c673cae FG	26
	27	If you specified non-default locations for your configuration or keyring,
	28	you may specify their locations::
	29
	30	ceph -c /path/to/conf -k /path/to/keyring health
	31
c07f9fc5 FG	32	Checking a Cluster's Status
	33	===========================
	34
	35	After you start your cluster, and before you start reading and/or
	36	writing data, check your cluster's status first.
7c673cae	37
c07f9fc5	38	To check a cluster's status, execute the following::
7c673cae	39
c07f9fc5 FG	40	ceph status
	41
	42	Or::
7c673cae	43
c07f9fc5 FG	44	ceph -s
	45
	46	In interactive mode, type ``status`` and press Enter. ::
	47
	48	ceph> status
	49
	50	Ceph will print the cluster status. For example, a tiny Ceph demonstration
	51	cluster with one of each service may print the following:
	52
	53	::
	54
	55	cluster:
	56	id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
	57	health: HEALTH_OK
	58
	59	services:
11fdf7f2	60	mon: 3 daemons, quorum a,b,c
c07f9fc5	61	mgr: x(active)
11fdf7f2 TL	62	mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
11fdf7f2 TL	63	osd: 3 osds: 3 up, 3 in
c07f9fc5 FG	64
	65	data:
	66	pools: 2 pools, 16 pgs
11fdf7f2	67	objects: 21 objects, 2.19K
c07f9fc5 FG	68	usage: 546 GB used, 384 GB / 931 GB avail
c07f9fc5 FG	69	pgs: 16 active+clean
7c673cae	70
7c673cae FG	71
	72	.. topic:: How Ceph Calculates Data Usage
	73
c07f9fc5	74	The ``usage`` value reflects the actual amount of raw storage used. The
7c673cae FG	75	``xxx GB / xxx GB`` value means the amount available (the lesser number)
	76	of the overall storage capacity of the cluster. The notional number reflects
	77	the size of the stored data before it is replicated, cloned or snapshotted.
	78	Therefore, the amount of data actually stored typically exceeds the notional
	79	amount stored, because Ceph creates replicas of the data and may also use
	80	storage capacity for cloning and snapshotting.
	81
	82
c07f9fc5 FG	83	Watching a Cluster
	84	==================
	85
	86	In addition to local logging by each daemon, Ceph clusters maintain
	87	a cluster log that records high level events about the whole system.
	88	This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
	89	default), but can also be monitored via the command line.
	90
	91	To follow the cluster log, use the following command
	92
	93	::
	94
	95	ceph -w
	96
	97	Ceph will print the status of the system, followed by each log message as it
	98	is emitted. For example:
	99
	100	::
	101
	102	cluster:
	103	id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
	104	health: HEALTH_OK
	105
	106	services:
11fdf7f2	107	mon: 3 daemons, quorum a,b,c
c07f9fc5	108	mgr: x(active)
11fdf7f2 TL	109	mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
11fdf7f2 TL	110	osd: 3 osds: 3 up, 3 in
c07f9fc5 FG	111
	112	data:
	113	pools: 2 pools, 16 pgs
11fdf7f2	114	objects: 21 objects, 2.19K
c07f9fc5 FG	115	usage: 546 GB used, 384 GB / 931 GB avail
	116	pgs: 16 active+clean
	117
	118
	119	2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
	120	2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
	121	2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
	122
	123
	124	In addition to using ``ceph -w`` to print log lines as they are emitted,
	125	use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
	126	log.
	127
	128	Monitoring Health Checks
	129	========================
	130
11fdf7f2	131	Ceph continuously runs various health checks against its own status. When
c07f9fc5 FG	132	a health check fails, this is reflected in the output of ``ceph status`` (or
	133	``ceph health``). In addition, messages are sent to the cluster log to
	134	indicate when a check fails, and when the cluster recovers.
	135
	136	For example, when an OSD goes down, the ``health`` section of the status
	137	output may be updated as follows:
	138
	139	::
	140
	141	health: HEALTH_WARN
	142	1 osds down
	143	Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
	144
	145	At this time, cluster log messages are also emitted to record the failure of the
	146	health checks:
	147
	148	::
	149
	150	2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
	151	2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
	152
	153	When the OSD comes back online, the cluster log records the cluster's return
	154	to a health state:
	155
	156	::
	157
	158	2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
	159	2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
	160	2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
	161
eafe8130 TL	162	Network Performance Checks
	163	--------------------------
	164
	165	Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We
	166	also use the response times to monitor network performance.
	167	While it is possible that a busy OSD could delay a ping response, we can assume
	168	that if a network switch fails mutiple delays will be detected between distinct pairs of OSDs.
	169
	170	By default we will warn about ping times which exceed 1 second (1000 milliseconds).
	171
	172	::
	173
	174	HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 1118.001 msec
	175
	176	The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10
	177	detail line items.
	178
	179	::
	180
	181	[WRN] OSD_SLOW_PING_TIME_BACK: Long heartbeat ping times on back interface seen, longest is 1118.001 msec
	182	Slow heartbeat ping on back interface from osd.0 to osd.1 1118.001 msec
	183	Slow heartbeat ping on back interface from osd.0 to osd.2 1030.123 msec
	184	Slow heartbeat ping on back interface from osd.2 to osd.1 1015.321 msec
	185	Slow heartbeat ping on back interface from osd.1 to osd.0 1010.456 msec
	186
	187	To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be
	188	sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second
	189	(1000 milliseconds) can be overridden as an argument in milliseconds.
	190
	191	The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.
	192
	193	::
	194
	195	$ ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0
	196	{
	197	"threshold": 0,
	198	"entries": [
	199	{
	200	"last update": "Wed Sep 4 17:04:49 2019",
	201	"stale": false,
	202	"from osd": 2,
	203	"to osd": 0,
	204	"interface": "front",
	205	"average": {
	206	"1min": 1.023,
	207	"5min": 0.860,
	208	"15min": 0.883
	209	},
	210	"min": {
	211	"1min": 0.818,
	212	"5min": 0.607,
	213	"15min": 0.607
	214	},
	215	"max": {
	216	"1min": 1.164,
	217	"5min": 1.173,
	218	"15min": 1.544
	219	},
	220	"last": 0.924
	221	},
	222	{
	223	"last update": "Wed Sep 4 17:04:49 2019",
	224	"stale": false,
	225	"from osd": 2,
226	"to osd": 0,
227	"interface": "back",
228	"average": {
229	"1min": 0.968,
230	"5min": 0.897,
231	"15min": 0.830
232	},
233	"min": {
234	"1min": 0.860,
235	"5min": 0.563,
236	"15min": 0.502
237	},
238	"max": {
239	"1min": 1.171,
240	"5min": 1.216,
241	"15min": 1.456
242	},
243	"last": 0.845
244	},
245	{
246	"last update": "Wed Sep 4 17:04:48 2019",
247	"stale": false,
248	"from osd": 0,
249	"to osd": 1,
250	"interface": "front",
251	"average": {
252	"1min": 0.965,
253	"5min": 0.811,
254	"15min": 0.850
255	},
256	"min": {
257	"1min": 0.650,
258	"5min": 0.488,
259	"15min": 0.466
260	},
261	"max": {
262	"1min": 1.252,
263	"5min": 1.252,
264	"15min": 1.362
265	},
266	"last": 0.791
267	},
268	...
269
c07f9fc5 FG	270
	271	Detecting configuration issues
	272	==============================
	273
	274	In addition to the health checks that Ceph continuously runs on its
	275	own status, there are some configuration issues that may only be detected
	276	by an external tool.
	277
	278	Use the `ceph-medic`_ tool to run these additional checks on your Ceph
	279	cluster's configuration.
	280
7c673cae FG	281	Checking a Cluster's Usage Stats
	282	================================
	283
	284	To check a cluster's data usage and data distribution among pools, you can
	285	use the ``df`` option. It is similar to Linux ``df``. Execute
	286	the following::
	287
	288	ceph df
	289
11fdf7f2 TL	290	The RAW STORAGE section of the output provides an overview of the
11fdf7f2 TL	291	amount of storage that is managed by your cluster.
7c673cae	292
11fdf7f2 TL	293	- CLASS: The class of OSD device (or the total for the cluster)
11fdf7f2 TL	294	- SIZE: The amount of storage capacity managed by the cluster.
7c673cae	295	- AVAIL: The amount of free space available in the cluster.
11fdf7f2 TL	296	- USED: The amount of raw storage consumed by user data.
	297	- RAW USED: The amount of raw storage consumed by user data, internal overhead, or reserved capacity.
	298	- %RAW USED: The percentage of raw storage used. Use this number in
7c673cae FG	299	conjunction with the ``full ratio`` and ``near full ratio`` to ensure that
	300	you are not reaching your cluster's capacity. See `Storage Capacity`_ for
	301	additional details.
	302
	303	The POOLS section of the output provides a list of pools and the notional
	304	usage of each pool. The output from this section DOES NOT reflect replicas,
	305	clones or snapshots. For example, if you store an object with 1MB of data, the
	306	notional usage will be 1MB, but the actual usage may be 2MB or more depending
	307	on the number of replicas, clones and snapshots.
	308
	309	- NAME: The name of the pool.
	310	- ID: The pool ID.
	311	- USED: The notional amount of data stored in kilobytes, unless the number
	312	appends M for megabytes or G for gigabytes.
	313	- %USED: The notional percentage of storage used per pool.
	314	- MAX AVAIL: An estimate of the notional amount of data that can be written
	315	to this pool.
11fdf7f2	316	- OBJECTS: The notional number of objects stored per pool.
7c673cae FG	317
7c673cae FG	318	.. note:: The numbers in the POOLS section are notional. They are not
11fdf7f2	319	inclusive of the number of replicas, snapshots or clones. As a result,
7c673cae	320	the sum of the USED and %USED amounts will not add up to the
11fdf7f2	321	USED and %USED amounts in the RAW section of the
7c673cae FG	322	output.
	323
	324	.. note:: The MAX AVAIL value is a complicated function of the
	325	replication or erasure code used, the CRUSH rule that maps storage
	326	to devices, the utilization of those devices, and the configured
	327	mon_osd_full_ratio.
	328
	329
7c673cae FG	330
	331	Checking OSD Status
	332	===================
	333
	334	You can check OSDs to ensure they are ``up`` and ``in`` by executing::
	335
	336	ceph osd stat
	337
	338	Or::
	339
	340	ceph osd dump
	341
	342	You can also check view OSDs according to their position in the CRUSH map. ::
	343
	344	ceph osd tree
	345
	346	Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
	347	and their weight. ::
	348
11fdf7f2 TL	349	#ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
	350	-1 3.00000 pool default
	351	-3 3.00000 rack mainrack
	352	-2 3.00000 host osd-host
	353	0 ssd 1.00000 osd.0 up 1.00000 1.00000
	354	1 ssd 1.00000 osd.1 up 1.00000 1.00000
	355	2 ssd 1.00000 osd.2 up 1.00000 1.00000
7c673cae FG	356
	357	For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
	358
	359	Checking Monitor Status
	360	=======================
	361
	362	If your cluster has multiple monitors (likely), you should check the monitor
11fdf7f2	363	quorum status after you start the cluster and before reading and/or writing data. A
7c673cae FG	364	quorum must be present when multiple monitors are running. You should also check
	365	monitor status periodically to ensure that they are running.
	366
	367	To see display the monitor map, execute the following::
	368
	369	ceph mon stat
	370
	371	Or::
	372
	373	ceph mon dump
	374
	375	To check the quorum status for the monitor cluster, execute the following::
	376
	377	ceph quorum_status
	378
	379	Ceph will return the quorum status. For example, a Ceph cluster consisting of
	380	three monitors may return the following:
	381
	382	.. code-block:: javascript
	383
	384	{ "election_epoch": 10,
	385	"quorum": [
	386	0,
	387	1,
	388	2],
11fdf7f2 TL	389	"quorum_names": [
	390	"a",
	391	"b",
	392	"c"],
	393	"quorum_leader_name": "a",
7c673cae FG	394	"monmap": { "epoch": 1,
	395	"fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
	396	"modified": "2011-12-12 13:28:27.505520",
	397	"created": "2011-12-12 13:28:27.505520",
11fdf7f2 TL	398	"features": {"persistent": [
	399	"kraken",
	400	"luminous",
	401	"mimic"],
	402	"optional": []
	403	},
7c673cae FG	404	"mons": [
	405	{ "rank": 0,
	406	"name": "a",
11fdf7f2 TL	407	"addr": "127.0.0.1:6789/0",
11fdf7f2 TL	408	"public_addr": "127.0.0.1:6789/0"},
7c673cae FG	409	{ "rank": 1,
7c673cae FG	410	"name": "b",
11fdf7f2 TL	411	"addr": "127.0.0.1:6790/0",
11fdf7f2 TL	412	"public_addr": "127.0.0.1:6790/0"},
7c673cae FG	413	{ "rank": 2,
7c673cae FG	414	"name": "c",
11fdf7f2 TL	415	"addr": "127.0.0.1:6791/0",
11fdf7f2 TL	416	"public_addr": "127.0.0.1:6791/0"}
7c673cae	417	]
11fdf7f2	418	}
7c673cae FG	419	}
	420
	421	Checking MDS Status
	422	===================
	423
91327a77	424	Metadata servers provide metadata services for CephFS. Metadata servers have
7c673cae FG	425	two sets of states: ``up \| down`` and ``active \| inactive``. To ensure your
	426	metadata servers are ``up`` and ``active``, execute the following::
	427
	428	ceph mds stat
	429
	430	To display details of the metadata cluster, execute the following::
	431
	432	ceph fs dump
	433
	434
	435	Checking Placement Group States
	436	===============================
	437
	438	Placement groups map objects to OSDs. When you monitor your
	439	placement groups, you will want them to be ``active`` and ``clean``.
	440	For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
	441
	442	.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
	443
	444
	445	Using the Admin Socket
	446	======================
	447
	448	The Ceph admin socket allows you to query a daemon via a socket interface.
	449	By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
	450	via the admin socket, login to the host running the daemon and use the
	451	following command::
	452
	453	ceph daemon {daemon-name}
	454	ceph daemon {path-to-socket-file}
	455
	456	For example, the following are equivalent::
	457
	458	ceph daemon osd.0 foo
	459	ceph daemon /var/run/ceph/ceph-osd.0.asok foo
	460
	461	To view the available admin socket commands, execute the following command::
	462
	463	ceph daemon {daemon-name} help
	464
	465	The admin socket command enables you to show and set your configuration at
	466	runtime. See `Viewing a Configuration at Runtime`_ for details.
	467
	468	Additionally, you can set configuration values at runtime directly (i.e., the
	469	admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
11fdf7f2	470	config set``, which relies on the monitor but doesn't require you to login
7c673cae FG	471	directly to the host in question ).
7c673cae FG	472
11fdf7f2	473	.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
7c673cae	474	.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
c07f9fc5	475	.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/