]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | |
2 | .. _devices: | |
3 | ||
11fdf7f2 TL |
4 | Device Management |
5 | ================= | |
6 | ||
7 | Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by | |
8 | which daemons, and collects health metrics about those devices in order to | |
9 | provide tools to predict and/or automatically respond to hardware failure. | |
10 | ||
11 | Device tracking | |
12 | --------------- | |
13 | ||
14 | You can query which storage devices are in use with:: | |
15 | ||
16 | ceph device ls | |
17 | ||
18 | You can also list devices by daemon or by host:: | |
19 | ||
20 | ceph device ls-by-daemon <daemon> | |
21 | ceph device ls-by-host <host> | |
22 | ||
23 | For any individual device, you can query information about its | |
24 | location and how it is being consumed with:: | |
25 | ||
26 | ceph device info <devid> | |
27 | ||
e306af50 TL |
28 | Identifying physical devices |
29 | ---------------------------- | |
30 | ||
31 | You can blink the drive LEDs on hardware enclosures to make the replacement of | |
32 | failed disks easy and less error-prone. Use the following command:: | |
33 | ||
34 | device light on|off <devid> [ident|fault] [--force] | |
35 | ||
36 | The ``<devid>`` parameter is the device identification. You can obtain this | |
37 | information using the following command:: | |
38 | ||
39 | ceph device ls | |
40 | ||
41 | The ``[ident|fault]`` parameter is used to set the kind of light to blink. | |
42 | By default, the `identification` light is used. | |
43 | ||
44 | .. note:: | |
45 | This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled. | |
46 | The orchestrator module enabled is shown by executing the following command:: | |
47 | ||
48 | ceph orch status | |
11fdf7f2 TL |
49 | |
50 | Enabling monitoring | |
51 | ------------------- | |
52 | ||
53 | Ceph can also monitor health metrics associated with your device. For | |
54 | example, SATA hard disks implement a standard called SMART that | |
55 | provides a wide range of internal metrics about the device's usage and | |
56 | health, like the number of hours powered on, number of power cycles, | |
57 | or unrecoverable read errors. Other device types like SAS and NVMe | |
58 | implement a similar set of metrics (via slightly different standards). | |
59 | All of these can be collected by Ceph via the ``smartctl`` tool. | |
60 | ||
61 | You can enable or disable health monitoring with:: | |
62 | ||
63 | ceph device monitoring on | |
64 | ||
65 | or:: | |
66 | ||
67 | ceph device monitoring off | |
68 | ||
69 | ||
70 | Scraping | |
71 | -------- | |
72 | ||
73 | If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:: | |
74 | ||
75 | ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> | |
76 | ||
77 | The default is to scrape once every 24 hours. | |
78 | ||
79 | You can manually trigger a scrape of all devices with:: | |
80 | ||
81 | ceph device scrape-health-metrics | |
82 | ||
83 | A single device can be scraped with:: | |
84 | ||
85 | ceph device scrape-health-metrics <device-id> | |
86 | ||
87 | Or a single daemon's devices can be scraped with:: | |
88 | ||
89 | ceph device scrape-daemon-health-metrics <who> | |
90 | ||
91 | The stored health metrics for a device can be retrieved (optionally | |
92 | for a specific timestamp) with:: | |
93 | ||
94 | ceph device get-health-metrics <devid> [sample-timestamp] | |
95 | ||
96 | Failure prediction | |
97 | ------------------ | |
98 | ||
81eedcae TL |
99 | Ceph can predict life expectancy and device failures based on the |
100 | health metrics it collects. There are three modes: | |
101 | ||
102 | * *none*: disable device failure prediction. | |
103 | * *local*: use a pre-trained prediction model from the ceph-mgr daemon | |
104 | * *cloud*: share device health and performance metrics an external | |
105 | cloud service run by ProphetStor, using either their free service or | |
106 | a paid service with more accurate predictions | |
107 | ||
108 | The prediction mode can be configured with:: | |
109 | ||
110 | ceph config set global device_failure_prediction_mode <mode> | |
111 | ||
112 | Prediction normally runs in the background on a periodic basis, so it | |
113 | may take some time before life expectancy values are populated. You | |
114 | can see the life expectancy of all devices in output from:: | |
115 | ||
116 | ceph device ls | |
117 | ||
118 | You can also query the metadata for a specific device with:: | |
119 | ||
120 | ceph device info <devid> | |
121 | ||
122 | You can explicitly force prediction of a device's life expectancy with:: | |
123 | ||
124 | ceph device predict-life-expectancy <devid> | |
125 | ||
126 | If you are not using Ceph's internal device failure prediction but | |
127 | have some external source of information about device failures, you | |
128 | can inform Ceph of a device's life expectancy with:: | |
129 | ||
130 | ceph device set-life-expectancy <devid> <from> [<to>] | |
131 | ||
132 | Life expectancies are expressed as a time interval so that | |
133 | uncertainty can be expressed in the form of a wide interval. The | |
134 | interval end can also be left unspecified. | |
11fdf7f2 TL |
135 | |
136 | Health alerts | |
137 | ------------- | |
138 | ||
139 | The ``mgr/devicehealth/warn_threshold`` controls how soon an expected | |
140 | device failure must be before we generate a health warning. | |
141 | ||
142 | The stored life expectancy of all devices can be checked, and any | |
143 | appropriate health alerts generated, with:: | |
144 | ||
145 | ceph device check-health | |
146 | ||
147 | Automatic Mitigation | |
148 | -------------------- | |
149 | ||
150 | If the ``mgr/devicehealth/self_heal`` option is enabled (it is by | |
151 | default), then for devices that are expected to fail soon the module | |
152 | will automatically migrate data away from them by marking the devices | |
153 | "out". | |
154 | ||
155 | The ``mgr/devicehealth/mark_out_threshold`` controls how soon an | |
156 | expected device failure must be before we automatically mark an osd | |
157 | "out". |