]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | Monitoring a Cluster | |
3 | ====================== | |
4 | ||
1e59de90 TL |
5 | After you have a running cluster, you can use the ``ceph`` tool to monitor your |
6 | cluster. Monitoring a cluster typically involves checking OSD status, monitor | |
7 | status, placement group status, and metadata server status. | |
7c673cae | 8 | |
c07f9fc5 FG |
9 | Using the command line |
10 | ====================== | |
11 | ||
12 | Interactive mode | |
13 | ---------------- | |
7c673cae FG |
14 | |
15 | To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line | |
1e59de90 | 16 | with no arguments. For example: |
39ae355f TL |
17 | |
18 | .. prompt:: bash $ | |
7c673cae | 19 | |
1e59de90 | 20 | ceph |
39ae355f TL |
21 | |
22 | .. prompt:: ceph> | |
23 | :prompts: ceph> | |
24 | ||
25 | health | |
26 | status | |
27 | quorum_status | |
28 | mon stat | |
7c673cae | 29 | |
c07f9fc5 FG |
30 | Non-default paths |
31 | ----------------- | |
7c673cae | 32 | |
1e59de90 TL |
33 | If you specified non-default locations for your configuration or keyring when |
34 | you install the cluster, you may specify their locations to the ``ceph`` tool | |
35 | by running the following command: | |
39ae355f TL |
36 | |
37 | .. prompt:: bash $ | |
7c673cae FG |
38 | |
39 | ceph -c /path/to/conf -k /path/to/keyring health | |
40 | ||
c07f9fc5 FG |
41 | Checking a Cluster's Status |
42 | =========================== | |
43 | ||
1e59de90 TL |
44 | After you start your cluster, and before you start reading and/or writing data, |
45 | you should check your cluster's status. | |
7c673cae | 46 | |
1e59de90 | 47 | To check a cluster's status, run the following command: |
39ae355f TL |
48 | |
49 | .. prompt:: bash $ | |
7c673cae | 50 | |
39ae355f | 51 | ceph status |
1e59de90 TL |
52 | |
53 | Alternatively, you can run the following command: | |
7c673cae | 54 | |
39ae355f | 55 | .. prompt:: bash $ |
c07f9fc5 | 56 | |
39ae355f | 57 | ceph -s |
c07f9fc5 | 58 | |
1e59de90 TL |
59 | In interactive mode, this operation is performed by typing ``status`` and |
60 | pressing **Enter**: | |
39ae355f TL |
61 | |
62 | .. prompt:: ceph> | |
63 | :prompts: ceph> | |
1e59de90 | 64 | |
39ae355f | 65 | status |
c07f9fc5 | 66 | |
1e59de90 TL |
67 | Ceph will print the cluster status. For example, a tiny Ceph "demonstration |
68 | cluster" that is running one instance of each service (monitor, manager, and | |
69 | OSD) might print the following: | |
c07f9fc5 FG |
70 | |
71 | :: | |
72 | ||
73 | cluster: | |
74 | id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 | |
75 | health: HEALTH_OK | |
76 | ||
77 | services: | |
11fdf7f2 | 78 | mon: 3 daemons, quorum a,b,c |
c07f9fc5 | 79 | mgr: x(active) |
11fdf7f2 TL |
80 | mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby |
81 | osd: 3 osds: 3 up, 3 in | |
c07f9fc5 FG |
82 | |
83 | data: | |
84 | pools: 2 pools, 16 pgs | |
11fdf7f2 | 85 | objects: 21 objects, 2.19K |
c07f9fc5 FG |
86 | usage: 546 GB used, 384 GB / 931 GB avail |
87 | pgs: 16 active+clean | |
7c673cae | 88 | |
7c673cae | 89 | |
1e59de90 TL |
90 | How Ceph Calculates Data Usage |
91 | ------------------------------ | |
7c673cae | 92 | |
1e59de90 TL |
93 | The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx |
94 | GB / xxx GB`` value means the amount available (the lesser number) of the | |
95 | overall storage capacity of the cluster. The notional number reflects the size | |
96 | of the stored data before it is replicated, cloned or snapshotted. Therefore, | |
97 | the amount of data actually stored typically exceeds the notional amount | |
98 | stored, because Ceph creates replicas of the data and may also use storage | |
99 | capacity for cloning and snapshotting. | |
7c673cae FG |
100 | |
101 | ||
c07f9fc5 FG |
102 | Watching a Cluster |
103 | ================== | |
104 | ||
1e59de90 TL |
105 | Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster |
106 | itself maintains a *cluster log* that records high-level events about the | |
107 | entire Ceph cluster. These events are logged to disk on monitor servers (in | |
108 | the default location ``/var/log/ceph/ceph.log``), and they can be monitored via | |
109 | the command line. | |
c07f9fc5 | 110 | |
1e59de90 | 111 | To follow the cluster log, run the following command: |
c07f9fc5 | 112 | |
39ae355f | 113 | .. prompt:: bash $ |
c07f9fc5 | 114 | |
39ae355f | 115 | ceph -w |
c07f9fc5 | 116 | |
1e59de90 TL |
117 | Ceph will print the status of the system, followed by each log message as it is |
118 | added. For example: | |
c07f9fc5 FG |
119 | |
120 | :: | |
121 | ||
122 | cluster: | |
123 | id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 | |
124 | health: HEALTH_OK | |
125 | ||
126 | services: | |
11fdf7f2 | 127 | mon: 3 daemons, quorum a,b,c |
c07f9fc5 | 128 | mgr: x(active) |
11fdf7f2 TL |
129 | mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby |
130 | osd: 3 osds: 3 up, 3 in | |
c07f9fc5 FG |
131 | |
132 | data: | |
133 | pools: 2 pools, 16 pgs | |
11fdf7f2 | 134 | objects: 21 objects, 2.19K |
c07f9fc5 FG |
135 | usage: 546 GB used, 384 GB / 931 GB avail |
136 | pgs: 16 active+clean | |
137 | ||
138 | ||
139 | 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot | |
140 | 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x | |
141 | 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available | |
142 | ||
1e59de90 TL |
143 | Instead of printing log lines as they are added, you might want to print only |
144 | the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n`` | |
145 | lines from the cluster log. | |
c07f9fc5 FG |
146 | |
147 | Monitoring Health Checks | |
148 | ======================== | |
149 | ||
1e59de90 TL |
150 | Ceph continuously runs various *health checks*. When |
151 | a health check fails, this failure is reflected in the output of ``ceph status`` and | |
152 | ``ceph health``. The cluster log receives messages that | |
153 | indicate when a check has failed and when the cluster has recovered. | |
c07f9fc5 FG |
154 | |
155 | For example, when an OSD goes down, the ``health`` section of the status | |
1e59de90 | 156 | output is updated as follows: |
c07f9fc5 FG |
157 | |
158 | :: | |
159 | ||
160 | health: HEALTH_WARN | |
161 | 1 osds down | |
162 | Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded | |
163 | ||
1e59de90 | 164 | At the same time, cluster log messages are emitted to record the failure of the |
c07f9fc5 FG |
165 | health checks: |
166 | ||
167 | :: | |
168 | ||
169 | 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) | |
170 | 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) | |
171 | ||
172 | When the OSD comes back online, the cluster log records the cluster's return | |
1e59de90 | 173 | to a healthy state: |
c07f9fc5 FG |
174 | |
175 | :: | |
176 | ||
177 | 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) | |
178 | 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) | |
179 | 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy | |
180 | ||
eafe8130 TL |
181 | Network Performance Checks |
182 | -------------------------- | |
183 | ||
1e59de90 TL |
184 | Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon |
185 | availability and network performance. If a single delayed response is detected, | |
186 | this might indicate nothing more than a busy OSD. But if multiple delays | |
187 | between distinct pairs of OSDs are detected, this might indicate a failed | |
188 | network switch, a NIC failure, or a layer 1 failure. | |
eafe8130 | 189 | |
1e59de90 TL |
190 | By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a |
191 | health check (a ``HEALTH_WARN``. For example: | |
eafe8130 TL |
192 | |
193 | :: | |
194 | ||
9f95a23c | 195 | HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) |
eafe8130 | 196 | |
1e59de90 TL |
197 | In the output of the ``ceph health detail`` command, you can see which OSDs are |
198 | experiencing delays and how long the delays are. The output of ``ceph health | |
199 | detail`` is limited to ten lines. Here is an example of the output you can | |
200 | expect from the ``ceph health detail`` command:: | |
eafe8130 | 201 | |
9f95a23c TL |
202 | [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) |
203 | Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving | |
204 | Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec | |
205 | Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec | |
206 | Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec | |
eafe8130 | 207 | |
1e59de90 TL |
208 | To see more detail and to collect a complete dump of network performance |
209 | information, use the ``dump_osd_network`` command. This command is usually sent | |
210 | to a Ceph Manager Daemon, but it can be used to collect information about a | |
211 | specific OSD's interactions by sending it to that OSD. The default threshold | |
212 | for a slow heartbeat is 1 second (1000 milliseconds), but this can be | |
213 | overridden by providing a number of milliseconds as an argument. | |
eafe8130 | 214 | |
1e59de90 TL |
215 | To show all network performance data with a specified threshold of 0, send the |
216 | following command to the mgr: | |
eafe8130 | 217 | |
39ae355f TL |
218 | .. prompt:: bash $ |
219 | ||
220 | ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 | |
221 | ||
eafe8130 TL |
222 | :: |
223 | ||
eafe8130 TL |
224 | { |
225 | "threshold": 0, | |
226 | "entries": [ | |
227 | { | |
228 | "last update": "Wed Sep 4 17:04:49 2019", | |
229 | "stale": false, | |
230 | "from osd": 2, | |
231 | "to osd": 0, | |
232 | "interface": "front", | |
233 | "average": { | |
234 | "1min": 1.023, | |
235 | "5min": 0.860, | |
236 | "15min": 0.883 | |
237 | }, | |
238 | "min": { | |
239 | "1min": 0.818, | |
240 | "5min": 0.607, | |
241 | "15min": 0.607 | |
242 | }, | |
243 | "max": { | |
244 | "1min": 1.164, | |
245 | "5min": 1.173, | |
246 | "15min": 1.544 | |
247 | }, | |
248 | "last": 0.924 | |
249 | }, | |
250 | { | |
251 | "last update": "Wed Sep 4 17:04:49 2019", | |
252 | "stale": false, | |
253 | "from osd": 2, | |
254 | "to osd": 0, | |
255 | "interface": "back", | |
256 | "average": { | |
257 | "1min": 0.968, | |
258 | "5min": 0.897, | |
259 | "15min": 0.830 | |
260 | }, | |
261 | "min": { | |
262 | "1min": 0.860, | |
263 | "5min": 0.563, | |
264 | "15min": 0.502 | |
265 | }, | |
266 | "max": { | |
267 | "1min": 1.171, | |
268 | "5min": 1.216, | |
269 | "15min": 1.456 | |
270 | }, | |
271 | "last": 0.845 | |
272 | }, | |
273 | { | |
274 | "last update": "Wed Sep 4 17:04:48 2019", | |
275 | "stale": false, | |
276 | "from osd": 0, | |
277 | "to osd": 1, | |
278 | "interface": "front", | |
279 | "average": { | |
280 | "1min": 0.965, | |
281 | "5min": 0.811, | |
282 | "15min": 0.850 | |
283 | }, | |
284 | "min": { | |
285 | "1min": 0.650, | |
286 | "5min": 0.488, | |
287 | "15min": 0.466 | |
288 | }, | |
289 | "max": { | |
290 | "1min": 1.252, | |
291 | "5min": 1.252, | |
292 | "15min": 1.362 | |
293 | }, | |
294 | "last": 0.791 | |
295 | }, | |
296 | ... | |
297 | ||
c07f9fc5 | 298 | |
9f95a23c | 299 | |
1e59de90 | 300 | Muting Health Checks |
9f95a23c TL |
301 | -------------------- |
302 | ||
1e59de90 TL |
303 | Health checks can be muted so that they have no effect on the overall |
304 | reported status of the cluster. For example, if the cluster has raised a | |
305 | single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``. | |
306 | To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and | |
307 | run the following command: | |
39ae355f TL |
308 | |
309 | .. prompt:: bash $ | |
9f95a23c | 310 | |
39ae355f | 311 | ceph health mute <code> |
9f95a23c | 312 | |
1e59de90 | 313 | For example, to mute an ``OSD_DOWN`` health check, run the following command: |
39ae355f TL |
314 | |
315 | .. prompt:: bash $ | |
9f95a23c | 316 | |
39ae355f | 317 | ceph health mute OSD_DOWN |
9f95a23c | 318 | |
1e59de90 | 319 | Mutes are reported as part of the short and long form of the ``ceph health`` command's output. |
39ae355f | 320 | For example, in the above scenario, the cluster would report: |
9f95a23c | 321 | |
39ae355f | 322 | .. prompt:: bash $ |
9f95a23c | 323 | |
39ae355f TL |
324 | ceph health |
325 | ||
326 | :: | |
9f95a23c | 327 | |
39ae355f TL |
328 | HEALTH_OK (muted: OSD_DOWN) |
329 | ||
330 | .. prompt:: bash $ | |
331 | ||
332 | ceph health detail | |
333 | ||
334 | :: | |
9f95a23c | 335 | |
39ae355f TL |
336 | HEALTH_OK (muted: OSD_DOWN) |
337 | (MUTED) OSD_DOWN 1 osds down | |
338 | osd.1 is down | |
9f95a23c | 339 | |
1e59de90 | 340 | A mute can be removed by running the following command: |
39ae355f TL |
341 | |
342 | .. prompt:: bash $ | |
343 | ||
344 | ceph health unmute <code> | |
345 | ||
346 | For example: | |
347 | ||
348 | .. prompt:: bash $ | |
349 | ||
350 | ceph health unmute OSD_DOWN | |
9f95a23c | 351 | |
1e59de90 TL |
352 | A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive) |
353 | associated with it: this means that the mute will automatically expire | |
354 | after a specified period of time. The TTL is specified as an optional | |
355 | duration argument, as seen in the following examples: | |
9f95a23c | 356 | |
39ae355f TL |
357 | .. prompt:: bash $ |
358 | ||
359 | ceph health mute OSD_DOWN 4h # mute for 4 hours | |
1e59de90 | 360 | ceph health mute MON_DOWN 15m # mute for 15 minutes |
9f95a23c | 361 | |
1e59de90 TL |
362 | Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check |
363 | in the example above has come back up), the mute goes away. If the health check comes | |
9f95a23c TL |
364 | back later, it will be reported in the usual way. |
365 | ||
1e59de90 TL |
366 | It is possible to make a health mute "sticky": this means that the mute will remain even if the |
367 | health check clears. For example, to make a health mute "sticky", you might run the following command: | |
39ae355f TL |
368 | |
369 | .. prompt:: bash $ | |
9f95a23c | 370 | |
39ae355f | 371 | ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour |
9f95a23c | 372 | |
1e59de90 TL |
373 | Most health mutes disappear if the unhealthy condition that triggered the health check gets worse. |
374 | For example, suppose that there is one OSD down and the health check is muted. In that case, if | |
375 | one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value. | |
9f95a23c | 376 | |
c07f9fc5 | 377 | |
7c673cae FG |
378 | Checking a Cluster's Usage Stats |
379 | ================================ | |
380 | ||
1e59de90 TL |
381 | To check a cluster's data usage and data distribution among pools, use the |
382 | ``df`` command. This option is similar to Linux's ``df`` command. Run the | |
383 | following command: | |
7c673cae | 384 | |
39ae355f TL |
385 | .. prompt:: bash $ |
386 | ||
387 | ceph df | |
7c673cae | 388 | |
1e59de90 | 389 | The output of ``ceph df`` resembles the following:: |
7c673cae | 390 | |
f67539c2 TL |
391 | CLASS SIZE AVAIL USED RAW USED %RAW USED |
392 | ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 | |
393 | TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 | |
394 | ||
395 | --- POOLS --- | |
396 | POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR | |
397 | device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B | |
398 | cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B | |
399 | cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B | |
400 | test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B | |
1e59de90 TL |
401 | |
402 | - **CLASS:** For example, "ssd" or "hdd". | |
11fdf7f2 | 403 | - **SIZE:** The amount of storage capacity managed by the cluster. |
7c673cae | 404 | - **AVAIL:** The amount of free space available in the cluster. |
f67539c2 | 405 | - **USED:** The amount of raw storage consumed by user data (excluding |
1e59de90 | 406 | BlueStore's database). |
f67539c2 | 407 | - **RAW USED:** The amount of raw storage consumed by user data, internal |
1e59de90 TL |
408 | overhead, and reserved capacity. |
409 | - **%RAW USED:** The percentage of raw storage used. Watch this number in | |
410 | conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when | |
411 | your cluster approaches the fullness thresholds. See `Storage Capacity`_. | |
7c673cae | 412 | |
f67539c2 | 413 | |
1e59de90 | 414 | **POOLS:** |
f67539c2 | 415 | |
1e59de90 TL |
416 | The POOLS section of the output provides a list of pools and the *notional* |
417 | usage of each pool. This section of the output **DOES NOT** reflect replicas, | |
418 | clones, or snapshots. For example, if you store an object with 1MB of data, | |
419 | then the notional usage will be 1MB, but the actual usage might be 2MB or more | |
420 | depending on the number of replicas, clones, and snapshots. | |
f67539c2 | 421 | |
1e59de90 TL |
422 | - **ID:** The number of the specific node within the pool. |
423 | - **STORED:** The actual amount of data that the user has stored in a pool. | |
424 | This is similar to the USED column in earlier versions of Ceph, but the | |
425 | calculations (for BlueStore!) are more precise (in that gaps are properly | |
426 | handled). | |
f67539c2 | 427 | |
1e59de90 | 428 | - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW |
f67539c2 | 429 | (RADOS Gateway) object data. |
1e59de90 | 430 | - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS |
f67539c2 TL |
431 | Gateway) for metadata storage. |
432 | ||
1e59de90 TL |
433 | - **OBJECTS:** The notional number of objects stored per pool (that is, the |
434 | number of objects other than replicas, clones, or snapshots). | |
435 | - **USED:** The space allocated for a pool over all OSDs. This includes space | |
436 | for replication, space for allocation granularity, and space for the overhead | |
437 | associated with erasure-coding. Compression savings and object-content gaps | |
438 | are also taken into account. However, BlueStore's database is not included in | |
439 | the amount reported under USED. | |
440 | ||
441 | - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data, | |
442 | and RGW (RADOS Gateway) object data. | |
443 | - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS | |
f67539c2 | 444 | Gateway) for metadata storage. |
7c673cae | 445 | |
7c673cae FG |
446 | - **%USED:** The notional percentage of storage used per pool. |
447 | - **MAX AVAIL:** An estimate of the notional amount of data that can be written | |
448 | to this pool. | |
f67539c2 TL |
449 | - **QUOTA OBJECTS:** The number of quota objects. |
450 | - **QUOTA BYTES:** The number of bytes in the quota objects. | |
522d829b | 451 | - **DIRTY:** The number of objects in the cache pool that have been written to |
1e59de90 TL |
452 | the cache pool but have not yet been flushed to the base pool. This field is |
453 | available only when cache tiering is in use. | |
454 | - **USED COMPR:** The amount of space allocated for compressed data. This | |
455 | includes compressed data in addition to all of the space required for | |
456 | replication, allocation granularity, and erasure- coding overhead. | |
457 | - **UNDER COMPR:** The amount of data that has passed through compression | |
458 | (summed over all replicas) and that is worth storing in a compressed form. | |
7c673cae | 459 | |
7c673cae | 460 | |
1e59de90 TL |
461 | .. note:: The numbers in the POOLS section are notional. They do not include |
462 | the number of replicas, clones, or snapshots. As a result, the sum of the | |
463 | USED and %USED amounts in the POOLS section of the output will not be equal | |
464 | to the sum of the USED and %USED amounts in the RAW section of the output. | |
7c673cae | 465 | |
1e59de90 TL |
466 | .. note:: The MAX AVAIL value is a complicated function of the replication or |
467 | the kind of erasure coding used, the CRUSH rule that maps storage to | |
468 | devices, the utilization of those devices, and the configured | |
469 | ``mon_osd_full_ratio`` setting. | |
7c673cae | 470 | |
7c673cae FG |
471 | |
472 | Checking OSD Status | |
473 | =================== | |
474 | ||
1e59de90 | 475 | To check if OSDs are ``up`` and ``in``, run the |
f67539c2 TL |
476 | following command: |
477 | ||
478 | .. prompt:: bash # | |
7c673cae | 479 | |
f67539c2 | 480 | ceph osd stat |
1e59de90 TL |
481 | |
482 | Alternatively, you can run the following command: | |
f67539c2 TL |
483 | |
484 | .. prompt:: bash # | |
7c673cae | 485 | |
f67539c2 | 486 | ceph osd dump |
1e59de90 TL |
487 | |
488 | To view OSDs according to their position in the CRUSH map, run the following | |
489 | command: | |
7c673cae | 490 | |
f67539c2 TL |
491 | .. prompt:: bash # |
492 | ||
493 | ceph osd tree | |
7c673cae | 494 | |
1e59de90 TL |
495 | To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are |
496 | ``up``, and the weight of the OSDs, run the following command: | |
f67539c2 TL |
497 | |
498 | .. code-block:: bash | |
499 | ||
500 | #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
501 | -1 3.00000 pool default | |
502 | -3 3.00000 rack mainrack | |
503 | -2 3.00000 host osd-host | |
504 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
505 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
506 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
7c673cae | 507 | |
1e59de90 | 508 | See `Monitoring OSDs and Placement Groups`_. |
7c673cae FG |
509 | |
510 | Checking Monitor Status | |
511 | ======================= | |
512 | ||
1e59de90 TL |
513 | If your cluster has multiple monitors, then you need to perform certain |
514 | "monitor status" checks. After starting the cluster and before reading or | |
515 | writing data, you should check quorum status. A quorum must be present when | |
516 | multiple monitors are running to ensure proper functioning of your Ceph | |
517 | cluster. Check monitor status regularly in order to ensure that all of the | |
518 | monitors are running. | |
7c673cae | 519 | |
1e59de90 | 520 | To display the monitor map, run the following command: |
39ae355f TL |
521 | |
522 | .. prompt:: bash $ | |
7c673cae | 523 | |
39ae355f | 524 | ceph mon stat |
1e59de90 TL |
525 | |
526 | Alternatively, you can run the following command: | |
7c673cae | 527 | |
39ae355f TL |
528 | .. prompt:: bash $ |
529 | ||
530 | ceph mon dump | |
1e59de90 TL |
531 | |
532 | To check the quorum status for the monitor cluster, run the following command: | |
533 | ||
39ae355f TL |
534 | .. prompt:: bash $ |
535 | ||
536 | ceph quorum_status | |
7c673cae | 537 | |
1e59de90 TL |
538 | Ceph returns the quorum status. For example, a Ceph cluster that consists of |
539 | three monitors might return the following: | |
7c673cae FG |
540 | |
541 | .. code-block:: javascript | |
542 | ||
1e59de90 TL |
543 | { "election_epoch": 10, |
544 | "quorum": [ | |
545 | 0, | |
546 | 1, | |
547 | 2], | |
548 | "quorum_names": [ | |
549 | "a", | |
550 | "b", | |
551 | "c"], | |
552 | "quorum_leader_name": "a", | |
553 | "monmap": { "epoch": 1, | |
554 | "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", | |
555 | "modified": "2011-12-12 13:28:27.505520", | |
556 | "created": "2011-12-12 13:28:27.505520", | |
557 | "features": {"persistent": [ | |
558 | "kraken", | |
559 | "luminous", | |
560 | "mimic"], | |
561 | "optional": [] | |
562 | }, | |
563 | "mons": [ | |
564 | { "rank": 0, | |
565 | "name": "a", | |
566 | "addr": "127.0.0.1:6789/0", | |
567 | "public_addr": "127.0.0.1:6789/0"}, | |
568 | { "rank": 1, | |
569 | "name": "b", | |
570 | "addr": "127.0.0.1:6790/0", | |
571 | "public_addr": "127.0.0.1:6790/0"}, | |
572 | { "rank": 2, | |
573 | "name": "c", | |
574 | "addr": "127.0.0.1:6791/0", | |
575 | "public_addr": "127.0.0.1:6791/0"} | |
576 | ] | |
577 | } | |
578 | } | |
7c673cae FG |
579 | |
580 | Checking MDS Status | |
581 | =================== | |
582 | ||
1e59de90 TL |
583 | Metadata servers provide metadata services for CephFS. Metadata servers have |
584 | two sets of states: ``up | down`` and ``active | inactive``. To check if your | |
585 | metadata servers are ``up`` and ``active``, run the following command: | |
7c673cae | 586 | |
39ae355f TL |
587 | .. prompt:: bash $ |
588 | ||
589 | ceph mds stat | |
1e59de90 TL |
590 | |
591 | To display details of the metadata servers, run the following command: | |
39ae355f TL |
592 | |
593 | .. prompt:: bash $ | |
7c673cae | 594 | |
39ae355f | 595 | ceph fs dump |
7c673cae FG |
596 | |
597 | ||
598 | Checking Placement Group States | |
599 | =============================== | |
600 | ||
1e59de90 TL |
601 | Placement groups (PGs) map objects to OSDs. PGs are monitored in order to |
602 | ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and | |
603 | Placement Groups`_. | |
7c673cae FG |
604 | |
605 | .. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg | |
606 | ||
e306af50 | 607 | .. _rados-monitoring-using-admin-socket: |
7c673cae FG |
608 | |
609 | Using the Admin Socket | |
610 | ====================== | |
611 | ||
1e59de90 TL |
612 | The Ceph admin socket allows you to query a daemon via a socket interface. By |
613 | default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via | |
614 | the admin socket, log in to the host that is running the daemon and run one of | |
615 | the two following commands: | |
39ae355f TL |
616 | |
617 | .. prompt:: bash $ | |
618 | ||
619 | ceph daemon {daemon-name} | |
620 | ceph daemon {path-to-socket-file} | |
621 | ||
1e59de90 | 622 | For example, the following commands are equivalent to each other: |
7c673cae | 623 | |
39ae355f | 624 | .. prompt:: bash $ |
7c673cae | 625 | |
39ae355f TL |
626 | ceph daemon osd.0 foo |
627 | ceph daemon /var/run/ceph/ceph-osd.0.asok foo | |
7c673cae | 628 | |
1e59de90 | 629 | To view the available admin-socket commands, run the following command: |
7c673cae | 630 | |
39ae355f | 631 | .. prompt:: bash $ |
7c673cae | 632 | |
39ae355f | 633 | ceph daemon {daemon-name} help |
7c673cae | 634 | |
1e59de90 TL |
635 | Admin-socket commands enable you to view and set your configuration at runtime. |
636 | For more on viewing your configuration, see `Viewing a Configuration at | |
637 | Runtime`_. There are two methods of setting configuration value at runtime: (1) | |
638 | using the admin socket, which bypasses the monitor and requires a direct login | |
639 | to the host in question, and (2) using the ``ceph tell {daemon-type}.{id} | |
640 | config set`` command, which relies on the monitor and does not require a direct | |
641 | login. | |
7c673cae | 642 | |
11fdf7f2 | 643 | .. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime |
7c673cae | 644 | .. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity |