]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/devices.rst
d404c3f10e776dca9bc12a94ae993b7a039aa6ce
[ceph.git] / ceph / doc / rados / operations / devices.rst
1
2 .. _devices:
3
4 Device Management
5 =================
6
7 Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
8 which daemons, and collects health metrics about those devices in order to
9 provide tools to predict and/or automatically respond to hardware failure.
10
11 Device tracking
12 ---------------
13
14 You can query which storage devices are in use with::
15
16 ceph device ls
17
18 You can also list devices by daemon or by host::
19
20 ceph device ls-by-daemon <daemon>
21 ceph device ls-by-host <host>
22
23 For any individual device, you can query information about its
24 location and how it is being consumed with::
25
26 ceph device info <devid>
27
28 Identifying physical devices
29 ----------------------------
30
31 You can blink the drive LEDs on hardware enclosures to make the replacement of
32 failed disks easy and less error-prone. Use the following command::
33
34 device light on|off <devid> [ident|fault] [--force]
35
36 The ``<devid>`` parameter is the device identification. You can obtain this
37 information using the following command::
38
39 ceph device ls
40
41 The ``[ident|fault]`` parameter is used to set the kind of light to blink.
42 By default, the `identification` light is used.
43
44 .. note::
45 This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
46 The orchestrator module enabled is shown by executing the following command::
47
48 ceph orch status
49
50 The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
51 to customize this command you can configure this via a Jinja2 template::
52
53 ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
54 ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
55
56 The Jinja2 template is rendered using the following arguments:
57
58 * ``on``
59 A boolean value.
60 * ``ident_fault``
61 A string containing `ident` or `fault`.
62 * ``dev``
63 A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
64 * ``path``
65 A string containing the device path, e.g. `/dev/sda`.
66
67 .. _enabling-monitoring:
68
69 Enabling monitoring
70 -------------------
71
72 Ceph can also monitor health metrics associated with your device. For
73 example, SATA hard disks implement a standard called SMART that
74 provides a wide range of internal metrics about the device's usage and
75 health, like the number of hours powered on, number of power cycles,
76 or unrecoverable read errors. Other device types like SAS and NVMe
77 implement a similar set of metrics (via slightly different standards).
78 All of these can be collected by Ceph via the ``smartctl`` tool.
79
80 You can enable or disable health monitoring with::
81
82 ceph device monitoring on
83
84 or::
85
86 ceph device monitoring off
87
88
89 Scraping
90 --------
91
92 If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with::
93
94 ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
95
96 The default is to scrape once every 24 hours.
97
98 You can manually trigger a scrape of all devices with::
99
100 ceph device scrape-health-metrics
101
102 A single device can be scraped with::
103
104 ceph device scrape-health-metrics <device-id>
105
106 Or a single daemon's devices can be scraped with::
107
108 ceph device scrape-daemon-health-metrics <who>
109
110 The stored health metrics for a device can be retrieved (optionally
111 for a specific timestamp) with::
112
113 ceph device get-health-metrics <devid> [sample-timestamp]
114
115 Failure prediction
116 ------------------
117
118 Ceph can predict life expectancy and device failures based on the
119 health metrics it collects. There are three modes:
120
121 * *none*: disable device failure prediction.
122 * *local*: use a pre-trained prediction model from the ceph-mgr daemon
123
124 The prediction mode can be configured with::
125
126 ceph config set global device_failure_prediction_mode <mode>
127
128 Prediction normally runs in the background on a periodic basis, so it
129 may take some time before life expectancy values are populated. You
130 can see the life expectancy of all devices in output from::
131
132 ceph device ls
133
134 You can also query the metadata for a specific device with::
135
136 ceph device info <devid>
137
138 You can explicitly force prediction of a device's life expectancy with::
139
140 ceph device predict-life-expectancy <devid>
141
142 If you are not using Ceph's internal device failure prediction but
143 have some external source of information about device failures, you
144 can inform Ceph of a device's life expectancy with::
145
146 ceph device set-life-expectancy <devid> <from> [<to>]
147
148 Life expectancies are expressed as a time interval so that
149 uncertainty can be expressed in the form of a wide interval. The
150 interval end can also be left unspecified.
151
152 Health alerts
153 -------------
154
155 The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
156 device failure must be before we generate a health warning.
157
158 The stored life expectancy of all devices can be checked, and any
159 appropriate health alerts generated, with::
160
161 ceph device check-health
162
163 Automatic Mitigation
164 --------------------
165
166 If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
167 default), then for devices that are expected to fail soon the module
168 will automatically migrate data away from them by marking the devices
169 "out".
170
171 The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
172 expected device failure must be before we automatically mark an osd
173 "out".