ceph> health
ceph> status
ceph> quorum_status
- ceph> mon_status
+ ceph> mon stat
Non-default paths
-----------------
Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We
also use the response times to monitor network performance.
While it is possible that a busy OSD could delay a ping response, we can assume
-that if a network switch fails mutiple delays will be detected between distinct pairs of OSDs.
+that if a network switch fails multiple delays will be detected between distinct pairs of OSDs.
By default we will warn about ping times which exceed 1 second (1000 milliseconds).
::
- HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 1118.001 msec
+ HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10
detail line items.
::
- [WRN] OSD_SLOW_PING_TIME_BACK: Long heartbeat ping times on back interface seen, longest is 1118.001 msec
- Slow heartbeat ping on back interface from osd.0 to osd.1 1118.001 msec
- Slow heartbeat ping on back interface from osd.0 to osd.2 1030.123 msec
- Slow heartbeat ping on back interface from osd.2 to osd.1 1015.321 msec
- Slow heartbeat ping on back interface from osd.1 to osd.0 1010.456 msec
+ [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
+ Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
+ Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec
+ Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
+ Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be
sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second
...
+
+Muting health checks
+--------------------
+
+Health checks can be muted so that they do not affect the overall
+reported status of the cluster. Alerts are specified using the health
+check code (see :ref:`health-checks`)::
+
+ ceph health mute <code>
+
+For example, if there is a health warning, muting it will make the
+cluster report an overall status of ``HEALTH_OK``. For example, to
+mute an ``OSD_DOWN`` alert,::
+
+ ceph health mute OSD_DOWN
+
+Mutes are reported as part of the short and long form of the ``ceph health`` command.
+For example, in the above scenario, the cluster would report::
+
+ $ ceph health
+ HEALTH_OK (muted: OSD_DOWN)
+ $ ceph health detail
+ HEALTH_OK (muted: OSD_DOWN)
+ (MUTED) OSD_DOWN 1 osds down
+ osd.1 is down
+
+A mute can be explicitly removed with::
+
+ ceph health unmute <code>
+
+For example,::
+
+ ceph health unmute OSD_DOWN
+
+A health check mute may optionally have a TTL (time to live)
+associated with it, such that the mute will automatically expire
+after the specified period of time has elapsed. The TTL is specified as an optional
+duration argument, e.g.::
+
+ ceph health mute OSD_DOWN 4h # mute for 4 hours
+ ceph health mute MON_DOWN 15m # mute for 15 minutes
+
+Normally, if a muted health alert is resolved (e.g., in the example
+above, the OSD comes back up), the mute goes away. If the alert comes
+back later, it will be reported in the usual way.
+
+It is possible to make a mute "sticky" such that the mute will remain even if the
+alert clears. For example,::
+
+ ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour
+
+Most health mutes also disappear if the extent of an alert gets worse. For example,
+if there is one OSD down, and the alert is muted, the mute will disappear if one
+or more additional OSDs go down. This is true for any health alert that involves
+a count indicating how much or how many of something is triggering the warning or
+error.
+
+
Detecting configuration issues
==============================