Device Management
=================
-Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
-which daemons, and collects health metrics about those devices in order to
-provide tools to predict and/or automatically respond to hardware failure.
+Device management allows Ceph to address hardware failure. Ceph tracks hardware
+storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
+Ceph also collects health metrics about these devices. By doing so, Ceph can
+provide tools that predict hardware failure and can automatically respond to
+hardware failure.
Device tracking
---------------
-You can query which storage devices are in use with:
+To see a list of the storage devices that are in use, run the following
+command:
.. prompt:: bash $
ceph device ls
-You can also list devices by daemon or by host:
+Alternatively, to list devices by daemon or by host, run a command of one of
+the following forms:
.. prompt:: bash $
ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host>
-For any individual device, you can query information about its
-location and how it is being consumed with:
+To see information about the location of an specific device and about how the
+device is being consumed, run a command of the following form:
.. prompt:: bash $
Identifying physical devices
----------------------------
-You can blink the drive LEDs on hardware enclosures to make the replacement of
-failed disks easy and less error-prone. Use the following command::
+To make the replacement of failed disks easier and less error-prone, you can
+(in some cases) "blink" the drive's LEDs on hardware enclosures by running a
+command of the following form::
device light on|off <devid> [ident|fault] [--force]
-The ``<devid>`` parameter is the device identification. You can obtain this
-information using the following command:
+.. note:: Using this command to blink the lights might not work. Whether it
+ works will depend upon such factors as your kernel revision, your SES
+ firmware, or the setup of your HBA.
+
+The ``<devid>`` parameter is the device identification. To retrieve this
+information, run the following command:
.. prompt:: bash $
ceph device ls
-The ``[ident|fault]`` parameter is used to set the kind of light to blink.
-By default, the `identification` light is used.
+The ``[ident|fault]`` parameter determines which kind of light will blink. By
+default, the `identification` light is used.
-.. note::
- This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
- The orchestrator module enabled is shown by executing the following command:
+.. note:: This command works only if the Cephadm or the Rook `orchestrator
+ <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
+ module is enabled. To see which orchestrator module is enabled, run the
+ following command:
.. prompt:: bash $
ceph orch status
-The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
-to customize this command you can configure this via a Jinja2 template::
+The command that makes the drive's LEDs blink is `lsmcli`. To customize this
+command, configure it via a Jinja2 template by running commands of the
+following forms::
ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
-The Jinja2 template is rendered using the following arguments:
+The following arguments can be used to customize the Jinja2 template:
* ``on``
A boolean value.
* ``ident_fault``
- A string containing `ident` or `fault`.
+ A string that contains `ident` or `fault`.
* ``dev``
- A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
+ A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
* ``path``
- A string containing the device path, e.g. `/dev/sda`.
+ A string that contains the device path: for example, `/dev/sda`.
.. _enabling-monitoring:
Enabling monitoring
-------------------
-Ceph can also monitor health metrics associated with your device. For
-example, SATA hard disks implement a standard called SMART that
-provides a wide range of internal metrics about the device's usage and
-health, like the number of hours powered on, number of power cycles,
-or unrecoverable read errors. Other device types like SAS and NVMe
-implement a similar set of metrics (via slightly different standards).
-All of these can be collected by Ceph via the ``smartctl`` tool.
+Ceph can also monitor the health metrics associated with your device. For
+example, SATA drives implement a standard called SMART that provides a wide
+range of internal metrics about the device's usage and health (for example: the
+number of hours powered on, the number of power cycles, the number of
+unrecoverable read errors). Other device types such as SAS and NVMe present a
+similar set of metrics (via slightly different standards). All of these
+metrics can be collected by Ceph via the ``smartctl`` tool.
-You can enable or disable health monitoring with:
+You can enable or disable health monitoring by running one of the following
+commands:
.. prompt:: bash $
ceph device monitoring on
-
-or:
-
-.. prompt:: bash $
-
ceph device monitoring off
-
Scraping
--------
-If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:
+If monitoring is enabled, device metrics will be scraped automatically at
+regular intervals. To configure that interval, run a command of the following
+form:
.. prompt:: bash $
ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
-The default is to scrape once every 24 hours.
+By default, device metrics are scraped once every 24 hours.
-You can manually trigger a scrape of all devices with:
+To manually scrape all devices, run the following command:
.. prompt:: bash $
ceph device scrape-health-metrics
-A single device can be scraped with:
+To scrape a single device, run a command of the following form:
.. prompt:: bash $
ceph device scrape-health-metrics <device-id>
-Or a single daemon's devices can be scraped with:
+To scrape a single daemon's devices, run a command of the following form:
.. prompt:: bash $
ceph device scrape-daemon-health-metrics <who>
-The stored health metrics for a device can be retrieved (optionally
-for a specific timestamp) with:
+To retrieve the stored health metrics for a device (optionally for a specific
+timestamp), run a command of the following form:
.. prompt:: bash $
Failure prediction
------------------
-Ceph can predict life expectancy and device failures based on the
-health metrics it collects. There are three modes:
+Ceph can predict drive life expectancy and device failures by analyzing the
+health metrics that it collects. The prediction modes are as follows:
* *none*: disable device failure prediction.
-* *local*: use a pre-trained prediction model from the ceph-mgr daemon
+* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.
-The prediction mode can be configured with:
+To configure the prediction mode, run a command of the following form:
.. prompt:: bash $
ceph config set global device_failure_prediction_mode <mode>
-Prediction normally runs in the background on a periodic basis, so it
-may take some time before life expectancy values are populated. You
-can see the life expectancy of all devices in output from:
+Under normal conditions, failure prediction runs periodically in the
+background. For this reason, life expectancy values might be populated only
+after a significant amount of time has passed. The life expectancy of all
+devices is displayed in the output of the following command:
.. prompt:: bash $
ceph device ls
-You can also query the metadata for a specific device with:
+To see the metadata of a specific device, run a command of the following form:
.. prompt:: bash $
ceph device info <devid>
-You can explicitly force prediction of a device's life expectancy with:
+To explicitly force prediction of a specific device's life expectancy, run a
+command of the following form:
.. prompt:: bash $
ceph device predict-life-expectancy <devid>
-If you are not using Ceph's internal device failure prediction but
-have some external source of information about device failures, you
-can inform Ceph of a device's life expectancy with:
+In addition to Ceph's internal device failure prediction, you might have an
+external source of information about device failures. To inform Ceph of a
+specific device's life expectancy, run a command of the following form:
.. prompt:: bash $
ceph device set-life-expectancy <devid> <from> [<to>]
-Life expectancies are expressed as a time interval so that
-uncertainty can be expressed in the form of a wide interval. The
-interval end can also be left unspecified.
+Life expectancies are expressed as a time interval. This means that the
+uncertainty of the life expectancy can be expressed in the form of a range of
+time, and perhaps a wide range of time. The interval's end can be left
+unspecified.
Health alerts
-------------
-The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
-device failure must be before we generate a health warning.
+The ``mgr/devicehealth/warn_threshold`` configuration option controls the
+health check for an expected device failure. If the device is expected to fail
+within the specified time interval, an alert is raised.
-The stored life expectancy of all devices can be checked, and any
-appropriate health alerts generated, with:
+To check the stored life expectancy of all devices and generate any appropriate
+health alert, run the following command:
.. prompt:: bash $
ceph device check-health
-Automatic Mitigation
---------------------
-
-If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
-default), then for devices that are expected to fail soon the module
-will automatically migrate data away from them by marking the devices
-"out".
+Automatic Migration
+-------------------
-The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
-expected device failure must be before we automatically mark an osd
-"out".
+The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
+migrates data away from devices that are expected to fail soon. If this option
+is enabled, the module marks such devices ``out`` so that automatic migration
+will occur.
+
+.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
+ this process from cascading to total failure. If the "self heal" module
+ marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
+ is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
+ check. For instructions on what to do in this situation, see
+ :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
+
+The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
+time interval for automatic migration. If a device is expected to fail within
+the specified time interval, it will be automatically marked ``out``.