ceph/doc/rados/operations/devices.rst

   1
   2 .. _devices:
   3
   4 Device Management
   5 =================
   6
   7 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
   8 which daemons, and collects health metrics about those devices in order to
   9 provide tools to predict and/or automatically respond to hardware failure.
  10
  11 Device tracking
  12 ---------------
  13
  14 You can query which storage devices are in use with::
  15
  16   ceph device ls
  17
  18 You can also list devices by daemon or by host::
  19
  20   ceph device ls-by-daemon <daemon>
  21   ceph device ls-by-host <host>
  22
  23 For any individual device, you can query information about its
  24 location and how it is being consumed with::
  25
  26   ceph device info <devid>
  27
  28 Identifying physical devices
  29 ----------------------------
  30
  31 You can blink the drive LEDs on hardware enclosures to make the replacement of
  32 failed disks easy and less error-prone.  Use the following command::
  33
  34   device light on|off <devid> [ident|fault] [--force]
  35
  36 The ``<devid>`` parameter is the device identification. You can obtain this
  37 information using the following command::
  38
  39   ceph device ls
  40
  41 The ``[ident|fault]`` parameter is used to set the kind of light to blink.
  42 By default, the `identification` light is used.
  43
  44 .. note::
  45    This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
  46    The orchestrator module enabled is shown by executing the following command::
  47
  48      ceph orch status
  49
  50 Enabling monitoring
  51 -------------------
  52
  53 Ceph can also monitor health metrics associated with your device.  For
  54 example, SATA hard disks implement a standard called SMART that
  55 provides a wide range of internal metrics about the device's usage and
  56 health, like the number of hours powered on, number of power cycles,
  57 or unrecoverable read errors.  Other device types like SAS and NVMe
  58 implement a similar set of metrics (via slightly different standards).
  59 All of these can be collected by Ceph via the ``smartctl`` tool.
  60
  61 You can enable or disable health monitoring with::
  62
  63   ceph device monitoring on
  64
  65 or::
  66
  67   ceph device monitoring off
  68
  69
  70 Scraping
  71 --------
  72
  73 If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
  74
  75   ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
  76
  77 The default is to scrape once every 24 hours.
  78
  79 You can manually trigger a scrape of all devices with::
  80
  81   ceph device scrape-health-metrics
  82
  83 A single device can be scraped with::
  84
  85   ceph device scrape-health-metrics <device-id>
  86
  87 Or a single daemon's devices can be scraped with::
  88
  89   ceph device scrape-daemon-health-metrics <who>
  90
  91 The stored health metrics for a device can be retrieved (optionally
  92 for a specific timestamp) with::
  93
  94   ceph device get-health-metrics <devid> [sample-timestamp]
  95
  96 Failure prediction
  97 ------------------
  98
  99 Ceph can predict life expectancy and device failures based on the
 100 health metrics it collects.  There are three modes:
 101
 102 * *none*: disable device failure prediction.
 103 * *local*: use a pre-trained prediction model from the ceph-mgr daemon
 104 * *cloud*: share device health and performance metrics an external
 105   cloud service run by ProphetStor, using either their free service or
 106   a paid service with more accurate predictions
 107
 108 The prediction mode can be configured with::
 109
 110   ceph config set global device_failure_prediction_mode <mode>
 111
 112 Prediction normally runs in the background on a periodic basis, so it
 113 may take some time before life expectancy values are populated.  You
 114 can see the life expectancy of all devices in output from::
 115
 116   ceph device ls
 117
 118 You can also query the metadata for a specific device with::
 119
 120   ceph device info <devid>
 121
 122 You can explicitly force prediction of a device's life expectancy with::
 123
 124   ceph device predict-life-expectancy <devid>
 125
 126 If you are not using Ceph's internal device failure prediction but
 127 have some external source of information about device failures, you
 128 can inform Ceph of a device's life expectancy with::
 129
 130   ceph device set-life-expectancy <devid> <from> [<to>]
 131
 132 Life expectancies are expressed as a time interval so that
 133 uncertainty can be expressed in the form of a wide interval. The
 134 interval end can also be left unspecified.
 135
 136 Health alerts
 137 -------------
 138
 139 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
 140 device failure must be before we generate a health warning.
 141
 142 The stored life expectancy of all devices can be checked, and any
 143 appropriate health alerts generated, with::
 144
 145   ceph device check-health
 146
 147 Automatic Mitigation
 148 --------------------
 149
 150 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
 151 default), then for devices that are expected to fail soon the module
 152 will automatically migrate data away from them by marking the devices
 153 "out".
 154
 155 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
 156 expected device failure must be before we automatically mark an osd
 157 "out".