]> git.proxmox.com Git - ceph.git/blame - ceph/doc/mgr/diskprediction.rst
import 15.2.0 Octopus source
[ceph.git] / ceph / doc / mgr / diskprediction.rst
CommitLineData
9f95a23c
TL
1.. _diskprediction:
2
11fdf7f2
TL
3=====================
4Diskprediction Module
5=====================
6
7The *diskprediction* module supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters.
8
9Local mode doesn't require any external server for data analysis and output results. In local mode, the *diskprediction* module uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.
10
11| Local predictor: 70% accuracy
12| Cloud predictor for free: 95% accuracy
13
14Enabling
15========
16
17Run the following command to enable the *diskprediction* module in the Ceph
18environment::
19
20 ceph mgr module enable diskprediction_cloud
21 ceph mgr module enable diskprediction_local
22
23
24Select the prediction mode::
25
26 ceph config set global device_failure_prediction_mode local
27
28or::
29
30 ceph config set global device_failure_prediction_mode cloud
31
32To disable prediction,::
33
34 ceph config set global device_failure_prediction_mode none
35
36
37Connection settings
38===================
39The connection settings are used for connection between Ceph and DiskPrediction server.
40
41Local Mode
42----------
43
44The *diskprediction* module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.
45
46Run the following command to use local predictor predict device life expectancy.
47
48::
49
50 ceph device predict-life-expectancy <device id>
51
52
53Cloud Mode
54----------
55
56The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings.
57
58**Certificate file path**: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.
59
60**DiskPrediction server**: The DiskPrediction server name. It could be an IP address if required.
61
62**Connection account**: An account name used to set up the connection between Ceph and DiskPrediction server
63
64**Connection password**: The password used to set up the connection between Ceph and DiskPrediction server
65
66Run the following command to complete connection setup.
67
68::
69
70 ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>
71
72
73You can use the following command to display the connection settings:
74
75::
76
77 ceph device show-prediction-config
78
79
80Additional optional configuration settings are the following:
81
82:diskprediction_upload_metrics_interval: Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes.
83:diskprediction_upload_smart_interval: Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours.
84:diskprediction_retrieve_prediction_interval: Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours.
85
86
87
88Diskprediction Data
89===================
90
91The *diskprediction* module actively sends/retrieves the following data to/from DiskPrediction server.
92
93
94Metrics Data
95-------------
96- Ceph cluster status
97
98+----------------------+-----------------------------------------+
99|key |Description |
100+======================+=========================================+
101|cluster_health |Ceph health check status |
102+----------------------+-----------------------------------------+
103|num_mon |Number of monitor node |
104+----------------------+-----------------------------------------+
105|num_mon_quorum |Number of monitors in quorum |
106+----------------------+-----------------------------------------+
107|num_osd |Total number of OSD |
108+----------------------+-----------------------------------------+
109|num_osd_up |Number of OSDs that are up |
110+----------------------+-----------------------------------------+
111|num_osd_in |Number of OSDs that are in cluster |
112+----------------------+-----------------------------------------+
113|osd_epoch |Current epoch of OSD map |
114+----------------------+-----------------------------------------+
115|osd_bytes |Total capacity of cluster in bytes |
116+----------------------+-----------------------------------------+
117|osd_bytes_used |Number of used bytes on cluster |
118+----------------------+-----------------------------------------+
119|osd_bytes_avail |Number of available bytes on cluster |
120+----------------------+-----------------------------------------+
121|num_pool |Number of pools |
122+----------------------+-----------------------------------------+
123|num_pg |Total number of placement groups |
124+----------------------+-----------------------------------------+
125|num_pg_active_clean |Number of placement groups in |
126| |active+clean state |
127+----------------------+-----------------------------------------+
128|num_pg_active |Number of placement groups in active |
129| |state |
130+----------------------+-----------------------------------------+
131|num_pg_peering |Number of placement groups in peering |
132| |state |
133+----------------------+-----------------------------------------+
134|num_object |Total number of objects on cluster |
135+----------------------+-----------------------------------------+
136|num_object_degraded |Number of degraded (missing replicas) |
137| |objects |
138+----------------------+-----------------------------------------+
139|num_object_misplaced |Number of misplaced (wrong location in |
140| |the cluster) objects |
141+----------------------+-----------------------------------------+
142|num_object_unfound |Number of unfound objects |
143+----------------------+-----------------------------------------+
144|num_bytes |Total number of bytes of all objects |
145+----------------------+-----------------------------------------+
146|num_mds_up |Number of MDSs that are up |
147+----------------------+-----------------------------------------+
148|num_mds_in |Number of MDS that are in cluster |
149+----------------------+-----------------------------------------+
150|num_mds_failed |Number of failed MDS |
151+----------------------+-----------------------------------------+
152|mds_epoch |Current epoch of MDS map |
153+----------------------+-----------------------------------------+
154
155
156- Ceph mon/osd performance counts
157
158Mon:
159
160+----------------------+-----------------------------------------+
161|key |Description |
162+======================+=========================================+
163|num_sessions |Current number of opened monitor sessions|
164+----------------------+-----------------------------------------+
165|session_add |Number of created monitor sessions |
166+----------------------+-----------------------------------------+
167|session_rm |Number of remove_session calls in monitor|
168+----------------------+-----------------------------------------+
169|session_trim |Number of trimed monitor sessions |
170+----------------------+-----------------------------------------+
171|num_elections |Number of elections monitor took part in |
172+----------------------+-----------------------------------------+
173|election_call |Number of elections started by monitor |
174+----------------------+-----------------------------------------+
175|election_win |Number of elections won by monitor |
176+----------------------+-----------------------------------------+
177|election_lose |Number of elections lost by monitor |
178+----------------------+-----------------------------------------+
179
180Osd:
181
182+----------------------+-----------------------------------------+
183|key |Description |
184+======================+=========================================+
185|op_wip |Replication operations currently being |
186| |processed (primary) |
187+----------------------+-----------------------------------------+
188|op_in_bytes |Client operations total write size |
189+----------------------+-----------------------------------------+
190|op_r |Client read operations |
191+----------------------+-----------------------------------------+
192|op_out_bytes |Client operations total read size |
193+----------------------+-----------------------------------------+
194|op_w |Client write operations |
195+----------------------+-----------------------------------------+
196|op_latency |Latency of client operations (including |
197| |queue time) |
198+----------------------+-----------------------------------------+
199|op_process_latency |Latency of client operations (excluding |
200| |queue time) |
201+----------------------+-----------------------------------------+
202|op_r_latency |Latency of read operation (including |
203| |queue time) |
204+----------------------+-----------------------------------------+
205|op_r_process_latency |Latency of read operation (excluding |
206| |queue time) |
207+----------------------+-----------------------------------------+
208|op_w_in_bytes |Client data written |
209+----------------------+-----------------------------------------+
210|op_w_latency |Latency of write operation (including |
211| |queue time) |
212+----------------------+-----------------------------------------+
213|op_w_process_latency |Latency of write operation (excluding |
214| |queue time) |
215+----------------------+-----------------------------------------+
216|op_rw |Client read-modify-write operations |
217+----------------------+-----------------------------------------+
218|op_rw_in_bytes |Client read-modify-write operations write|
219| |in |
220+----------------------+-----------------------------------------+
221|op_rw_out_bytes |Client read-modify-write operations read |
222| |out |
223+----------------------+-----------------------------------------+
224|op_rw_latency |Latency of read-modify-write operation |
225| |(including queue time) |
226+----------------------+-----------------------------------------+
227|op_rw_process_latency |Latency of read-modify-write operation |
228| |(excluding queue time) |
229+----------------------+-----------------------------------------+
230
231
232- Ceph pool statistics
233
234+----------------------+-----------------------------------------+
235|key |Description |
236+======================+=========================================+
237|bytes_used |Per pool bytes used |
238+----------------------+-----------------------------------------+
239|max_avail |Max available number of bytes in the pool|
240+----------------------+-----------------------------------------+
241|objects |Number of objects in the pool |
242+----------------------+-----------------------------------------+
243|wr_bytes |Number of bytes written in the pool |
244+----------------------+-----------------------------------------+
245|dirty |Number of bytes dirty in the pool |
246+----------------------+-----------------------------------------+
247|rd_bytes |Number of bytes read in the pool |
248+----------------------+-----------------------------------------+
249|stored_raw |Bytes used in pool including copies made |
250+----------------------+-----------------------------------------+
251
252- Ceph physical device metadata
253
254+----------------------+-----------------------------------------+
255|key |Description |
256+======================+=========================================+
257|disk_domain_id |Physical device identify id |
258+----------------------+-----------------------------------------+
9f95a23c 259|disk_name |Device attachment name |
11fdf7f2
TL
260+----------------------+-----------------------------------------+
261|disk_wwn |Device wwn |
262+----------------------+-----------------------------------------+
263|model |Device model name |
264+----------------------+-----------------------------------------+
265|serial_number |Device serial number |
266+----------------------+-----------------------------------------+
267|size |Device size |
268+----------------------+-----------------------------------------+
269|vendor |Device vendor name |
270+----------------------+-----------------------------------------+
271
272- Ceph each objects correlation information
273- The module agent information
274- The module agent cluster information
275- The module agent host information
276
277
278SMART Data
279-----------
280- Ceph physical device SMART data (provided by Ceph *devicehealth* module)
281
282
283Prediction Data
284----------------
285- Ceph physical device prediction data
286
287
288Receiving predicted health status from a Ceph OSD disk drive
289============================================================
290
291You can receive predicted health status from Ceph OSD disk drive by using the
292following command.
293
294::
295
296 ceph device get-predicted-status <device id>
297
298
299The get-predicted-status command returns:
300
301
302::
303
304 {
305 "near_failure": "Good",
306 "disk_wwn": "5000011111111111",
307 "serial_number": "111111111",
308 "predicted": "2018-05-30 18:33:12",
309 "attachment": "sdb"
310 }
311
312
313+--------------------+-----------------------------------------------------+
314|Attribute | Description |
315+====================+=====================================================+
316|near_failure | The disk failure prediction state: |
317| | Good/Warning/Bad/Unknown |
318+--------------------+-----------------------------------------------------+
319|disk_wwn | Disk WWN number |
320+--------------------+-----------------------------------------------------+
321|serial_number | Disk serial number |
322+--------------------+-----------------------------------------------------+
323|predicted | Predicted date |
324+--------------------+-----------------------------------------------------+
325|attachment | device name on the local system |
326+--------------------+-----------------------------------------------------+
327
328The *near_failure* attribute for disk failure prediction state indicates disk life expectancy in the following table.
329
330+--------------------+-----------------------------------------------------+
331|near_failure | Life expectancy (weeks) |
332+====================+=====================================================+
333|Good | > 6 weeks |
334+--------------------+-----------------------------------------------------+
335|Warning | 2 weeks ~ 6 weeks |
336+--------------------+-----------------------------------------------------+
337|Bad | < 2 weeks |
338+--------------------+-----------------------------------------------------+
339
340
341Debugging
342=========
343
344If you want to debug the DiskPrediction module mapping to Ceph logging level,
345use the following command.
346
347::
348
349 [mgr]
350
351 debug mgr = 20
352
353With logging set to debug for the manager the module will print out logging
354message with prefix *mgr[diskprediction]* for easy filtering.
355