[ceph.git] / ceph / doc / rados / operations / devices.rst


.. _devices:

Device Management
=================

Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
which daemons, and collects health metrics about those devices in order to
provide tools to predict and/or automatically respond to hardware failure.

Device tracking
---------------

You can query which storage devices are in use with::

  ceph device ls

You can also list devices by daemon or by host::

  ceph device ls-by-daemon <daemon>
  ceph device ls-by-host <host>

For any individual device, you can query information about its
location and how it is being consumed with::

  ceph device info <devid>

Identifying physical devices
----------------------------

You can blink the drive LEDs on hardware enclosures to make the replacement of
failed disks easy and less error-prone.  Use the following command::

  device light on|off <devid> [ident|fault] [--force]

The ``<devid>`` parameter is the device identification. You can obtain this
information using the following command::

  ceph device ls

The ``[ident|fault]`` parameter is used to set the kind of light to blink.
By default, the `identification` light is used.

.. note::
   This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
   The orchestrator module enabled is shown by executing the following command::

     ceph orch status

Enabling monitoring
-------------------

Ceph can also monitor health metrics associated with your device.  For
example, SATA hard disks implement a standard called SMART that
provides a wide range of internal metrics about the device's usage and
health, like the number of hours powered on, number of power cycles,
or unrecoverable read errors.  Other device types like SAS and NVMe
implement a similar set of metrics (via slightly different standards).
All of these can be collected by Ceph via the ``smartctl`` tool.

You can enable or disable health monitoring with::

  ceph device monitoring on

or::

  ceph device monitoring off


Scraping
--------

If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::

  ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>

The default is to scrape once every 24 hours.

You can manually trigger a scrape of all devices with::

  ceph device scrape-health-metrics

A single device can be scraped with::

  ceph device scrape-health-metrics <device-id>

Or a single daemon's devices can be scraped with::

  ceph device scrape-daemon-health-metrics <who>

The stored health metrics for a device can be retrieved (optionally
for a specific timestamp) with::

  ceph device get-health-metrics <devid> [sample-timestamp]

Failure prediction
------------------

Ceph can predict life expectancy and device failures based on the
health metrics it collects.  There are three modes:

* *none*: disable device failure prediction.
* *local*: use a pre-trained prediction model from the ceph-mgr daemon
* *cloud*: share device health and performance metrics an external
  cloud service run by ProphetStor, using either their free service or
  a paid service with more accurate predictions

The prediction mode can be configured with::

  ceph config set global device_failure_prediction_mode <mode>

Prediction normally runs in the background on a periodic basis, so it
may take some time before life expectancy values are populated.  You
can see the life expectancy of all devices in output from::

  ceph device ls

You can also query the metadata for a specific device with::

  ceph device info <devid>

You can explicitly force prediction of a device's life expectancy with::

  ceph device predict-life-expectancy <devid>

If you are not using Ceph's internal device failure prediction but
have some external source of information about device failures, you
can inform Ceph of a device's life expectancy with::

  ceph device set-life-expectancy <devid> <from> [<to>]

Life expectancies are expressed as a time interval so that
uncertainty can be expressed in the form of a wide interval. The
interval end can also be left unspecified.

Health alerts
-------------

The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
device failure must be before we generate a health warning.

The stored life expectancy of all devices can be checked, and any
appropriate health alerts generated, with::

  ceph device check-health

Automatic Mitigation
--------------------

If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
default), then for devices that are expected to fail soon the module
will automatically migrate data away from them by marking the devices
"out".

The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
expected device failure must be before we automatically mark an osd
"out".
Commit	Line	Data
9f95a23c TL	1
	2	.. _devices:
	3
11fdf7f2 TL	4	Device Management
	5	=================
	6
	7	Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
	8	which daemons, and collects health metrics about those devices in order to
	9	provide tools to predict and/or automatically respond to hardware failure.
	10
	11	Device tracking
	12	---------------
	13
	14	You can query which storage devices are in use with::
	15
	16	ceph device ls
	17
	18	You can also list devices by daemon or by host::
	19
	20	ceph device ls-by-daemon <daemon>
	21	ceph device ls-by-host <host>
	22
	23	For any individual device, you can query information about its
	24	location and how it is being consumed with::
	25
	26	ceph device info <devid>
	27
e306af50 TL	28	Identifying physical devices
	29	----------------------------
	30
	31	You can blink the drive LEDs on hardware enclosures to make the replacement of
	32	failed disks easy and less error-prone. Use the following command::
	33
	34	device light on\|off <devid> [ident\|fault] [--force]
	35
	36	The ``<devid>`` parameter is the device identification. You can obtain this
	37	information using the following command::
	38
	39	ceph device ls
	40
	41	The ``[ident\|fault]`` parameter is used to set the kind of light to blink.
	42	By default, the `identification` light is used.
	43
	44	.. note::
	45	This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
	46	The orchestrator module enabled is shown by executing the following command::
	47
	48	ceph orch status
11fdf7f2 TL	49
	50	Enabling monitoring
	51	-------------------
	52
	53	Ceph can also monitor health metrics associated with your device. For
	54	example, SATA hard disks implement a standard called SMART that
	55	provides a wide range of internal metrics about the device's usage and
	56	health, like the number of hours powered on, number of power cycles,
	57	or unrecoverable read errors. Other device types like SAS and NVMe
	58	implement a similar set of metrics (via slightly different standards).
	59	All of these can be collected by Ceph via the ``smartctl`` tool.
	60
	61	You can enable or disable health monitoring with::
	62
	63	ceph device monitoring on
	64
	65	or::
	66
	67	ceph device monitoring off
	68
	69
	70	Scraping
	71	--------
	72
	73	If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with::
	74
	75	ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
	76
	77	The default is to scrape once every 24 hours.
	78
	79	You can manually trigger a scrape of all devices with::
	80
	81	ceph device scrape-health-metrics
	82
	83	A single device can be scraped with::
	84
	85	ceph device scrape-health-metrics <device-id>
	86
	87	Or a single daemon's devices can be scraped with::
	88
	89	ceph device scrape-daemon-health-metrics <who>
	90
	91	The stored health metrics for a device can be retrieved (optionally
	92	for a specific timestamp) with::
	93
	94	ceph device get-health-metrics <devid> [sample-timestamp]
	95
	96	Failure prediction
	97	------------------
	98
81eedcae TL	99	Ceph can predict life expectancy and device failures based on the
	100	health metrics it collects. There are three modes:
	101
	102	* none: disable device failure prediction.
	103	* local: use a pre-trained prediction model from the ceph-mgr daemon
	104	* cloud: share device health and performance metrics an external
	105	cloud service run by ProphetStor, using either their free service or
	106	a paid service with more accurate predictions
	107
	108	The prediction mode can be configured with::
	109
	110	ceph config set global device_failure_prediction_mode <mode>
	111
	112	Prediction normally runs in the background on a periodic basis, so it
	113	may take some time before life expectancy values are populated. You
	114	can see the life expectancy of all devices in output from::
	115
	116	ceph device ls
	117
	118	You can also query the metadata for a specific device with::
	119
	120	ceph device info <devid>
	121
	122	You can explicitly force prediction of a device's life expectancy with::
	123
	124	ceph device predict-life-expectancy <devid>
	125
	126	If you are not using Ceph's internal device failure prediction but
	127	have some external source of information about device failures, you
	128	can inform Ceph of a device's life expectancy with::
	129
	130	ceph device set-life-expectancy <devid> <from> [<to>]
	131
	132	Life expectancies are expressed as a time interval so that
	133	uncertainty can be expressed in the form of a wide interval. The
	134	interval end can also be left unspecified.
11fdf7f2 TL	135
	136	Health alerts
	137	-------------
	138
	139	The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
	140	device failure must be before we generate a health warning.
	141
	142	The stored life expectancy of all devices can be checked, and any
	143	appropriate health alerts generated, with::
	144
	145	ceph device check-health
	146
	147	Automatic Mitigation
	148	--------------------
	149
	150	If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
	151	default), then for devices that are expected to fail soon the module
	152	will automatically migrate data away from them by marking the devices
	153	"out".
	154
	155	The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
	156	expected device failure must be before we automatically mark an osd
	157	"out".