ceph/doc/rados/operations/devices.rst

   1
   2 .. _devices:
   3
   4 Device Management
   5 =================
   6
   7 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
   8 which daemons, and collects health metrics about those devices in order to
   9 provide tools to predict and/or automatically respond to hardware failure.
  10
  11 Device tracking
  12 ---------------
  13
  14 You can query which storage devices are in use with::
  15
  16   ceph device ls
  17
  18 You can also list devices by daemon or by host::
  19
  20   ceph device ls-by-daemon <daemon>
  21   ceph device ls-by-host <host>
  22
  23 For any individual device, you can query information about its
  24 location and how it is being consumed with::
  25
  26   ceph device info <devid>
  27
  28 Identifying physical devices
  29 ----------------------------
  30
  31 You can blink the drive LEDs on hardware enclosures to make the replacement of
  32 failed disks easy and less error-prone.  Use the following command::
  33
  34   device light on|off <devid> [ident|fault] [--force]
  35
  36 The ``<devid>`` parameter is the device identification. You can obtain this
  37 information using the following command::
  38
  39   ceph device ls
  40
  41 The ``[ident|fault]`` parameter is used to set the kind of light to blink.
  42 By default, the `identification` light is used.
  43
  44 .. note::
  45    This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
  46    The orchestrator module enabled is shown by executing the following command::
  47
  48      ceph orch status
  49
  50 The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
  51 to customize this command you can configure this via a Jinja2 template::
  52
  53    ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
  54    ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
  55
  56 The Jinja2 template is rendered using the following arguments:
  57
  58 * ``on``
  59     A boolean value.
  60 * ``ident_fault``
  61     A string containing `ident` or `fault`.
  62 * ``dev``
  63     A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
  64 * ``path``
  65     A string containing the device path, e.g. `/dev/sda`.
  66
  67 .. _enabling-monitoring:
  68
  69 Enabling monitoring
  70 -------------------
  71
  72 Ceph can also monitor health metrics associated with your device.  For
  73 example, SATA hard disks implement a standard called SMART that
  74 provides a wide range of internal metrics about the device's usage and
  75 health, like the number of hours powered on, number of power cycles,
  76 or unrecoverable read errors.  Other device types like SAS and NVMe
  77 implement a similar set of metrics (via slightly different standards).
  78 All of these can be collected by Ceph via the ``smartctl`` tool.
  79
  80 You can enable or disable health monitoring with::
  81
  82   ceph device monitoring on
  83
  84 or::
  85
  86   ceph device monitoring off
  87
  88
  89 Scraping
  90 --------
  91
  92 If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
  93
  94   ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
  95
  96 The default is to scrape once every 24 hours.
  97
  98 You can manually trigger a scrape of all devices with::
  99
 100   ceph device scrape-health-metrics
 101
 102 A single device can be scraped with::
 103
 104   ceph device scrape-health-metrics <device-id>
 105
 106 Or a single daemon's devices can be scraped with::
 107
 108   ceph device scrape-daemon-health-metrics <who>
 109
 110 The stored health metrics for a device can be retrieved (optionally
 111 for a specific timestamp) with::
 112
 113   ceph device get-health-metrics <devid> [sample-timestamp]
 114
 115 Failure prediction
 116 ------------------
 117
 118 Ceph can predict life expectancy and device failures based on the
 119 health metrics it collects.  There are three modes:
 120
 121 * *none*: disable device failure prediction.
 122 * *local*: use a pre-trained prediction model from the ceph-mgr daemon
 123
 124 The prediction mode can be configured with::
 125
 126   ceph config set global device_failure_prediction_mode <mode>
 127
 128 Prediction normally runs in the background on a periodic basis, so it
 129 may take some time before life expectancy values are populated.  You
 130 can see the life expectancy of all devices in output from::
 131
 132   ceph device ls
 133
 134 You can also query the metadata for a specific device with::
 135
 136   ceph device info <devid>
 137
 138 You can explicitly force prediction of a device's life expectancy with::
 139
 140   ceph device predict-life-expectancy <devid>
 141
 142 If you are not using Ceph's internal device failure prediction but
 143 have some external source of information about device failures, you
 144 can inform Ceph of a device's life expectancy with::
 145
 146   ceph device set-life-expectancy <devid> <from> [<to>]
 147
 148 Life expectancies are expressed as a time interval so that
 149 uncertainty can be expressed in the form of a wide interval. The
 150 interval end can also be left unspecified.
 151
 152 Health alerts
 153 -------------
 154
 155 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
 156 device failure must be before we generate a health warning.
 157
 158 The stored life expectancy of all devices can be checked, and any
 159 appropriate health alerts generated, with::
 160
 161   ceph device check-health
 162
 163 Automatic Mitigation
 164 --------------------
 165
 166 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
 167 default), then for devices that are expected to fail soon the module
 168 will automatically migrate data away from them by marking the devices
 169 "out".
 170
 171 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
 172 expected device failure must be before we automatically mark an osd
 173 "out".