update ceph source to reef 18.1.2

[ceph.git] / ceph / doc / rados / operations / devices.rst
diff --git a/ceph/doc/rados/operations/devices.rst b/ceph/doc/rados/operations/devices.rst

index 1b6eaebdea069194ce8a1663c5af3377f853a92a..f92f622d53003f9cb32213e27d1bdc893cc4da61 100644 (file)
--- a/ceph/doc/rados/operations/devices.rst
+++ b/ceph/doc/rados/operations/devices.rst
@@ -3,28 +3,32 @@
  Device Management
  =================
  
-Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
-which daemons, and collects health metrics about those devices in order to
-provide tools to predict and/or automatically respond to hardware failure.
+Device management allows Ceph to address hardware failure. Ceph tracks hardware
+storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
+Ceph also collects health metrics about these devices. By doing so, Ceph can
+provide tools that predict hardware failure and can automatically respond to
+hardware failure.
  
  Device tracking
  ---------------
  
-You can query which storage devices are in use with:
+To see a list of the storage devices that are in use, run the following
+command:
  
  .. prompt:: bash $
  
     ceph device ls
  
-You can also list devices by daemon or by host:
+Alternatively, to list devices by daemon or by host, run a command of one of
+the following forms:
  
  .. prompt:: bash $
  
     ceph device ls-by-daemon <daemon>
     ceph device ls-by-host <host>
  
-For any individual device, you can query information about its
-location and how it is being consumed with:
+To see information about the location of an specific device and about how the
+device is being consumed, run a command of the following form:
  
  .. prompt:: bash $
  
@@ -33,103 +37,107 @@ location and how it is being consumed with:
  Identifying physical devices
  ----------------------------
  
-You can blink the drive LEDs on hardware enclosures to make the replacement of
-failed disks easy and less error-prone.  Use the following command::
+To make the replacement of failed disks easier and less error-prone, you can
+(in some cases) "blink" the drive's LEDs on hardware enclosures by running a
+command of the following form::
  
    device light on|off <devid> [ident|fault] [--force]
  
-The ``<devid>`` parameter is the device identification. You can obtain this
-information using the following command:
+.. note:: Using this command to blink the lights might not work. Whether it
+   works will depend upon such factors as your kernel revision, your SES
+   firmware, or the setup of your HBA.
+
+The ``<devid>`` parameter is the device identification. To retrieve this
+information, run the following command:
  
  .. prompt:: bash $
  
     ceph device ls
  
-The ``[ident|fault]`` parameter is used to set the kind of light to blink.
-By default, the `identification` light is used.
+The ``[ident|fault]`` parameter determines which kind of light will blink.  By
+default, the `identification` light is used.
  
-.. note::
-   This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
-   The orchestrator module enabled is shown by executing the following command:
+.. note:: This command works only if the Cephadm or the Rook `orchestrator
+   <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
+   module is enabled.  To see which orchestrator module is enabled, run the
+   following command:
  
     .. prompt:: bash $
  
        ceph orch status
  
-The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
-to customize this command you can configure this via a Jinja2 template::
+The command that makes the drive's LEDs blink is `lsmcli`. To customize this
+command, configure it via a Jinja2 template by running commands of the
+following forms::
  
     ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
     ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
  
-The Jinja2 template is rendered using the following arguments:
+The following arguments can be used to customize the Jinja2 template:
  
  * ``on``
      A boolean value.
  * ``ident_fault``
-    A string containing `ident` or `fault`.
+    A string that contains `ident` or `fault`.
  * ``dev``
-    A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
+    A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
  * ``path``
-    A string containing the device path, e.g. `/dev/sda`.
+    A string that contains the device path: for example, `/dev/sda`.
  
  .. _enabling-monitoring:
  
  Enabling monitoring
  -------------------
  
-Ceph can also monitor health metrics associated with your device.  For
-example, SATA hard disks implement a standard called SMART that
-provides a wide range of internal metrics about the device's usage and
-health, like the number of hours powered on, number of power cycles,
-or unrecoverable read errors.  Other device types like SAS and NVMe
-implement a similar set of metrics (via slightly different standards).
-All of these can be collected by Ceph via the ``smartctl`` tool.
+Ceph can also monitor the health metrics associated with your device. For
+example, SATA drives implement a standard called SMART that provides a wide
+range of internal metrics about the device's usage and health (for example: the
+number of hours powered on, the number of power cycles, the number of
+unrecoverable read errors). Other device types such as SAS and NVMe present a
+similar set of metrics (via slightly different standards).  All of these
+metrics can be collected by Ceph via the ``smartctl`` tool.
  
-You can enable or disable health monitoring with:
+You can enable or disable health monitoring by running one of the following
+commands:
  
  .. prompt:: bash $
  
     ceph device monitoring on
-
-or:
-
-.. prompt:: bash $
-
     ceph device monitoring off
  
-
  Scraping
  --------
  
-If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with:
+If monitoring is enabled, device metrics will be scraped automatically at
+regular intervals. To configure that interval, run a command of the following
+form:
  
  .. prompt:: bash $
  
     ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
  
-The default is to scrape once every 24 hours.
+By default, device metrics are scraped once every 24 hours.
  
-You can manually trigger a scrape of all devices with:
+To manually scrape all devices, run the following command:
     
  .. prompt:: bash $
  
     ceph device scrape-health-metrics
  
-A single device can be scraped with:
+To scrape a single device, run a command of the following form:
  
  .. prompt:: bash $
  
     ceph device scrape-health-metrics <device-id>
  
-Or a single daemon's devices can be scraped with:
+To scrape a single daemon's devices, run a command of the following form:
  
  .. prompt:: bash $
  
     ceph device scrape-daemon-health-metrics <who>
  
-The stored health metrics for a device can be retrieved (optionally
-for a specific timestamp) with:
+To retrieve the stored health metrics for a device (optionally for a specific
+timestamp),  run a command of the following form:
  
  .. prompt:: bash $
  
@@ -138,71 +146,82 @@ for a specific timestamp) with:
  Failure prediction
  ------------------
  
-Ceph can predict life expectancy and device failures based on the
-health metrics it collects.  There are three modes:
+Ceph can predict drive life expectancy and device failures by analyzing the
+health metrics that it collects. The prediction modes are as follows:
  
  * *none*: disable device failure prediction.
-* *local*: use a pre-trained prediction model from the ceph-mgr daemon
+* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.
  
-The prediction mode can be configured with:
+To configure the prediction mode, run a command of the following form:
  
  .. prompt:: bash $
  
     ceph config set global device_failure_prediction_mode <mode>
  
-Prediction normally runs in the background on a periodic basis, so it
-may take some time before life expectancy values are populated.  You
-can see the life expectancy of all devices in output from:
+Under normal conditions, failure prediction runs periodically in the
+background.  For this reason, life expectancy values might be populated only
+after a significant amount of time has passed.  The life expectancy of all
+devices is displayed in the output of the following command:
  
  .. prompt:: bash $
  
     ceph device ls
  
-You can also query the metadata for a specific device with:
+To see the metadata of a specific device, run a command of the following form:
  
  .. prompt:: bash $
  
     ceph device info <devid>
  
-You can explicitly force prediction of a device's life expectancy with:
+To explicitly force prediction of a specific device's life expectancy, run a
+command of the following form:
  
  .. prompt:: bash $
  
     ceph device predict-life-expectancy <devid>
  
-If you are not using Ceph's internal device failure prediction but
-have some external source of information about device failures, you
-can inform Ceph of a device's life expectancy with:
+In addition to Ceph's internal device failure prediction, you might have an
+external source of information about device failures. To inform Ceph of a
+specific device's life expectancy, run a command of the following form:
  
  .. prompt:: bash $
  
     ceph device set-life-expectancy <devid> <from> [<to>]
  
-Life expectancies are expressed as a time interval so that
-uncertainty can be expressed in the form of a wide interval. The
-interval end can also be left unspecified.
+Life expectancies are expressed as a time interval. This means that the
+uncertainty of the life expectancy can be expressed in the form of a range of
+time, and perhaps a wide range of time. The interval's end can be left
+unspecified.
  
  Health alerts
  -------------
  
-The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
-device failure must be before we generate a health warning.
+The ``mgr/devicehealth/warn_threshold`` configuration option controls the
+health check for an expected device failure. If the device is expected to fail
+within the specified time interval, an alert is raised.
  
-The stored life expectancy of all devices can be checked, and any
-appropriate health alerts generated, with:
+To check the stored life expectancy of all devices and generate any appropriate
+health alert, run the following command:
  
  .. prompt:: bash $
  
     ceph device check-health
  
-Automatic Mitigation
---------------------
-
-If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
-default), then for devices that are expected to fail soon the module
-will automatically migrate data away from them by marking the devices
-"out".
+Automatic Migration
+-------------------
  
-The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
-expected device failure must be before we automatically mark an osd
-"out".
+The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
+migrates data away from devices that are expected to fail soon. If this option
+is enabled, the module marks such devices ``out`` so that automatic migration
+will occur.
+
+.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
+   this process from cascading to total failure. If the "self heal" module
+   marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
+   is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
+   check. For instructions on what to do in this situation, see
+   :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
+
+The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
+time interval for automatic migration. If a device is expected to fail within
+the specified time interval, it will be automatically marked ``out``.