]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/devices.rst
import 15.2.4
[ceph.git] / ceph / doc / rados / operations / devices.rst
CommitLineData
9f95a23c
TL
1
2.. _devices:
3
11fdf7f2
TL
4Device Management
5=================
6
7Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
8which daemons, and collects health metrics about those devices in order to
9provide tools to predict and/or automatically respond to hardware failure.
10
11Device tracking
12---------------
13
14You can query which storage devices are in use with::
15
16 ceph device ls
17
18You can also list devices by daemon or by host::
19
20 ceph device ls-by-daemon <daemon>
21 ceph device ls-by-host <host>
22
23For any individual device, you can query information about its
24location and how it is being consumed with::
25
26 ceph device info <devid>
27
e306af50
TL
28Identifying physical devices
29----------------------------
30
31You can blink the drive LEDs on hardware enclosures to make the replacement of
32failed disks easy and less error-prone. Use the following command::
33
34 device light on|off <devid> [ident|fault] [--force]
35
36The ``<devid>`` parameter is the device identification. You can obtain this
37information using the following command::
38
39 ceph device ls
40
41The ``[ident|fault]`` parameter is used to set the kind of light to blink.
42By default, the `identification` light is used.
43
44.. note::
45 This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
46 The orchestrator module enabled is shown by executing the following command::
47
48 ceph orch status
11fdf7f2
TL
49
50Enabling monitoring
51-------------------
52
53Ceph can also monitor health metrics associated with your device. For
54example, SATA hard disks implement a standard called SMART that
55provides a wide range of internal metrics about the device's usage and
56health, like the number of hours powered on, number of power cycles,
57or unrecoverable read errors. Other device types like SAS and NVMe
58implement a similar set of metrics (via slightly different standards).
59All of these can be collected by Ceph via the ``smartctl`` tool.
60
61You can enable or disable health monitoring with::
62
63 ceph device monitoring on
64
65or::
66
67 ceph device monitoring off
68
69
70Scraping
71--------
72
73If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with::
74
75 ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
76
77The default is to scrape once every 24 hours.
78
79You can manually trigger a scrape of all devices with::
80
81 ceph device scrape-health-metrics
82
83A single device can be scraped with::
84
85 ceph device scrape-health-metrics <device-id>
86
87Or a single daemon's devices can be scraped with::
88
89 ceph device scrape-daemon-health-metrics <who>
90
91The stored health metrics for a device can be retrieved (optionally
92for a specific timestamp) with::
93
94 ceph device get-health-metrics <devid> [sample-timestamp]
95
96Failure prediction
97------------------
98
81eedcae
TL
99Ceph can predict life expectancy and device failures based on the
100health metrics it collects. There are three modes:
101
102* *none*: disable device failure prediction.
103* *local*: use a pre-trained prediction model from the ceph-mgr daemon
104* *cloud*: share device health and performance metrics an external
105 cloud service run by ProphetStor, using either their free service or
106 a paid service with more accurate predictions
107
108The prediction mode can be configured with::
109
110 ceph config set global device_failure_prediction_mode <mode>
111
112Prediction normally runs in the background on a periodic basis, so it
113may take some time before life expectancy values are populated. You
114can see the life expectancy of all devices in output from::
115
116 ceph device ls
117
118You can also query the metadata for a specific device with::
119
120 ceph device info <devid>
121
122You can explicitly force prediction of a device's life expectancy with::
123
124 ceph device predict-life-expectancy <devid>
125
126If you are not using Ceph's internal device failure prediction but
127have some external source of information about device failures, you
128can inform Ceph of a device's life expectancy with::
129
130 ceph device set-life-expectancy <devid> <from> [<to>]
131
132Life expectancies are expressed as a time interval so that
133uncertainty can be expressed in the form of a wide interval. The
134interval end can also be left unspecified.
11fdf7f2
TL
135
136Health alerts
137-------------
138
139The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
140device failure must be before we generate a health warning.
141
142The stored life expectancy of all devices can be checked, and any
143appropriate health alerts generated, with::
144
145 ceph device check-health
146
147Automatic Mitigation
148--------------------
149
150If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
151default), then for devices that are expected to fail soon the module
152will automatically migrate data away from them by marking the devices
153"out".
154
155The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
156expected device failure must be before we automatically mark an osd
157"out".