ceph/doc/rados/operations/devices.rst

   1 Device Management
   2 =================
   3
   4 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
   5 which daemons, and collects health metrics about those devices in order to
   6 provide tools to predict and/or automatically respond to hardware failure.
   7
   8 Device tracking
   9 ---------------
  10
  11 You can query which storage devices are in use with::
  12
  13   ceph device ls
  14
  15 You can also list devices by daemon or by host::
  16
  17   ceph device ls-by-daemon <daemon>
  18   ceph device ls-by-host <host>
  19
  20 For any individual device, you can query information about its
  21 location and how it is being consumed with::
  22
  23   ceph device info <devid>
  24
  25
  26 Enabling monitoring
  27 -------------------
  28
  29 Ceph can also monitor health metrics associated with your device.  For
  30 example, SATA hard disks implement a standard called SMART that
  31 provides a wide range of internal metrics about the device's usage and
  32 health, like the number of hours powered on, number of power cycles,
  33 or unrecoverable read errors.  Other device types like SAS and NVMe
  34 implement a similar set of metrics (via slightly different standards).
  35 All of these can be collected by Ceph via the ``smartctl`` tool.
  36
  37 You can enable or disable health monitoring with::
  38
  39   ceph device monitoring on
  40
  41 or::
  42
  43   ceph device monitoring off
  44
  45
  46 Scraping
  47 --------
  48
  49 If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
  50
  51   ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
  52
  53 The default is to scrape once every 24 hours.
  54
  55 You can manually trigger a scrape of all devices with::
  56
  57   ceph device scrape-health-metrics
  58
  59 A single device can be scraped with::
  60
  61   ceph device scrape-health-metrics <device-id>
  62
  63 Or a single daemon's devices can be scraped with::
  64
  65   ceph device scrape-daemon-health-metrics <who>
  66
  67 The stored health metrics for a device can be retrieved (optionally
  68 for a specific timestamp) with::
  69
  70   ceph device get-health-metrics <devid> [sample-timestamp]
  71
  72 Failure prediction
  73 ------------------
  74
  75 TBD
  76
  77 Health alerts
  78 -------------
  79
  80 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
  81 device failure must be before we generate a health warning.
  82
  83 The stored life expectancy of all devices can be checked, and any
  84 appropriate health alerts generated, with::
  85
  86   ceph device check-health
  87
  88 Automatic Mitigation
  89 --------------------
  90
  91 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
  92 default), then for devices that are expected to fail soon the module
  93 will automatically migrate data away from them by marking the devices
  94 "out".
  95
  96 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
  97 expected device failure must be before we automatically mark an osd
  98 "out".