7 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
8 which daemons, and collects health metrics about those devices in order to
9 provide tools to predict and/or automatically respond to hardware failure.
14 You can query which storage devices are in use with::
18 You can also list devices by daemon or by host::
20 ceph device ls-by-daemon <daemon>
21 ceph device ls-by-host <host>
23 For any individual device, you can query information about its
24 location and how it is being consumed with::
26 ceph device info <devid>
28 Identifying physical devices
29 ----------------------------
31 You can blink the drive LEDs on hardware enclosures to make the replacement of
32 failed disks easy and less error-prone. Use the following command::
34 device light on|off <devid> [ident|fault] [--force]
36 The ``<devid>`` parameter is the device identification. You can obtain this
37 information using the following command::
41 The ``[ident|fault]`` parameter is used to set the kind of light to blink.
42 By default, the `identification` light is used.
45 This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
46 The orchestrator module enabled is shown by executing the following command::
50 The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
51 to customize this command you can configure this via a Jinja2 template::
53 ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
54 ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
56 The Jinja2 template is rendered using the following arguments:
61 A string containing `ident` or `fault`.
63 A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
65 A string containing the device path, e.g. `/dev/sda`.
67 .. _enabling-monitoring:
72 Ceph can also monitor health metrics associated with your device. For
73 example, SATA hard disks implement a standard called SMART that
74 provides a wide range of internal metrics about the device's usage and
75 health, like the number of hours powered on, number of power cycles,
76 or unrecoverable read errors. Other device types like SAS and NVMe
77 implement a similar set of metrics (via slightly different standards).
78 All of these can be collected by Ceph via the ``smartctl`` tool.
80 You can enable or disable health monitoring with::
82 ceph device monitoring on
86 ceph device monitoring off
92 If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with::
94 ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
96 The default is to scrape once every 24 hours.
98 You can manually trigger a scrape of all devices with::
100 ceph device scrape-health-metrics
102 A single device can be scraped with::
104 ceph device scrape-health-metrics <device-id>
106 Or a single daemon's devices can be scraped with::
108 ceph device scrape-daemon-health-metrics <who>
110 The stored health metrics for a device can be retrieved (optionally
111 for a specific timestamp) with::
113 ceph device get-health-metrics <devid> [sample-timestamp]
118 Ceph can predict life expectancy and device failures based on the
119 health metrics it collects. There are three modes:
121 * *none*: disable device failure prediction.
122 * *local*: use a pre-trained prediction model from the ceph-mgr daemon
124 The prediction mode can be configured with::
126 ceph config set global device_failure_prediction_mode <mode>
128 Prediction normally runs in the background on a periodic basis, so it
129 may take some time before life expectancy values are populated. You
130 can see the life expectancy of all devices in output from::
134 You can also query the metadata for a specific device with::
136 ceph device info <devid>
138 You can explicitly force prediction of a device's life expectancy with::
140 ceph device predict-life-expectancy <devid>
142 If you are not using Ceph's internal device failure prediction but
143 have some external source of information about device failures, you
144 can inform Ceph of a device's life expectancy with::
146 ceph device set-life-expectancy <devid> <from> [<to>]
148 Life expectancies are expressed as a time interval so that
149 uncertainty can be expressed in the form of a wide interval. The
150 interval end can also be left unspecified.
155 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
156 device failure must be before we generate a health warning.
158 The stored life expectancy of all devices can be checked, and any
159 appropriate health alerts generated, with::
161 ceph device check-health
166 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
167 default), then for devices that are expected to fail soon the module
168 will automatically migrate data away from them by marking the devices
171 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
172 expected device failure must be before we automatically mark an osd