ceph/doc/rados/operations/devices.rst

   1
   2 .. _devices:
   3
   4 Device Management
   5 =================
   6
   7 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
   8 which daemons, and collects health metrics about those devices in order to
   9 provide tools to predict and/or automatically respond to hardware failure.
  10
  11 Device tracking
  12 ---------------
  13
  14 You can query which storage devices are in use with::
  15
  16   ceph device ls
  17
  18 You can also list devices by daemon or by host::
  19
  20   ceph device ls-by-daemon <daemon>
  21   ceph device ls-by-host <host>
  22
  23 For any individual device, you can query information about its
  24 location and how it is being consumed with::
  25
  26   ceph device info <devid>
  27
  28
  29 Enabling monitoring
  30 -------------------
  31
  32 Ceph can also monitor health metrics associated with your device.  For
  33 example, SATA hard disks implement a standard called SMART that
  34 provides a wide range of internal metrics about the device's usage and
  35 health, like the number of hours powered on, number of power cycles,
  36 or unrecoverable read errors.  Other device types like SAS and NVMe
  37 implement a similar set of metrics (via slightly different standards).
  38 All of these can be collected by Ceph via the ``smartctl`` tool.
  39
  40 You can enable or disable health monitoring with::
  41
  42   ceph device monitoring on
  43
  44 or::
  45
  46   ceph device monitoring off
  47
  48
  49 Scraping
  50 --------
  51
  52 If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
  53
  54   ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
  55
  56 The default is to scrape once every 24 hours.
  57
  58 You can manually trigger a scrape of all devices with::
  59
  60   ceph device scrape-health-metrics
  61
  62 A single device can be scraped with::
  63
  64   ceph device scrape-health-metrics <device-id>
  65
  66 Or a single daemon's devices can be scraped with::
  67
  68   ceph device scrape-daemon-health-metrics <who>
  69
  70 The stored health metrics for a device can be retrieved (optionally
  71 for a specific timestamp) with::
  72
  73   ceph device get-health-metrics <devid> [sample-timestamp]
  74
  75 Failure prediction
  76 ------------------
  77
  78 Ceph can predict life expectancy and device failures based on the
  79 health metrics it collects.  There are three modes:
  80
  81 * *none*: disable device failure prediction.
  82 * *local*: use a pre-trained prediction model from the ceph-mgr daemon
  83 * *cloud*: share device health and performance metrics an external
  84   cloud service run by ProphetStor, using either their free service or
  85   a paid service with more accurate predictions
  86
  87 The prediction mode can be configured with::
  88
  89   ceph config set global device_failure_prediction_mode <mode>
  90
  91 Prediction normally runs in the background on a periodic basis, so it
  92 may take some time before life expectancy values are populated.  You
  93 can see the life expectancy of all devices in output from::
  94
  95   ceph device ls
  96
  97 You can also query the metadata for a specific device with::
  98
  99   ceph device info <devid>
 100
 101 You can explicitly force prediction of a device's life expectancy with::
 102
 103   ceph device predict-life-expectancy <devid>
 104
 105 If you are not using Ceph's internal device failure prediction but
 106 have some external source of information about device failures, you
 107 can inform Ceph of a device's life expectancy with::
 108
 109   ceph device set-life-expectancy <devid> <from> [<to>]
 110
 111 Life expectancies are expressed as a time interval so that
 112 uncertainty can be expressed in the form of a wide interval. The
 113 interval end can also be left unspecified.
 114
 115 Health alerts
 116 -------------
 117
 118 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
 119 device failure must be before we generate a health warning.
 120
 121 The stored life expectancy of all devices can be checked, and any
 122 appropriate health alerts generated, with::
 123
 124   ceph device check-health
 125
 126 Automatic Mitigation
 127 --------------------
 128
 129 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
 130 default), then for devices that are expected to fail soon the module
 131 will automatically migrate data away from them by marking the devices
 132 "out".
 133
 134 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
 135 expected device failure must be before we automatically mark an osd
 136 "out".