]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | Troubleshooting OSDs | |
3 | ====================== | |
4 | ||
5 | Before troubleshooting your OSDs, check your monitors and network first. If | |
6 | you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns | |
7 | a health status, it means that the monitors have a quorum. | |
8 | If you don't have a monitor quorum or if there are errors with the monitor | |
9 | status, `address the monitor issues first <../troubleshooting-mon>`_. | |
10 | Check your networks to ensure they | |
11 | are running properly, because networks may have a significant impact on OSD | |
12 | operation and performance. | |
13 | ||
14 | ||
15 | ||
16 | Obtaining Data About OSDs | |
17 | ========================= | |
18 | ||
19 | A good first step in troubleshooting your OSDs is to obtain information in | |
20 | addition to the information you collected while `monitoring your OSDs`_ | |
21 | (e.g., ``ceph osd tree``). | |
22 | ||
23 | ||
24 | Ceph Logs | |
25 | --------- | |
26 | ||
27 | If you haven't changed the default path, you can find Ceph log files at | |
28 | ``/var/log/ceph``:: | |
29 | ||
30 | ls /var/log/ceph | |
31 | ||
32 | If you don't get enough log detail, you can change your logging level. See | |
33 | `Logging and Debugging`_ for details to ensure that Ceph performs adequately | |
34 | under high logging volume. | |
35 | ||
36 | ||
37 | Admin Socket | |
38 | ------------ | |
39 | ||
40 | Use the admin socket tool to retrieve runtime information. For details, list | |
41 | the sockets for your Ceph processes:: | |
42 | ||
43 | ls /var/run/ceph | |
44 | ||
45 | Then, execute the following, replacing ``{daemon-name}`` with an actual | |
46 | daemon (e.g., ``osd.0``):: | |
47 | ||
48 | ceph daemon osd.0 help | |
49 | ||
50 | Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: | |
51 | ||
52 | ceph daemon {socket-file} help | |
53 | ||
54 | ||
55 | The admin socket, among other things, allows you to: | |
56 | ||
57 | - List your configuration at runtime | |
58 | - Dump historic operations | |
59 | - Dump the operation priority queue state | |
60 | - Dump operations in flight | |
61 | - Dump perfcounters | |
62 | ||
63 | ||
64 | Display Freespace | |
65 | ----------------- | |
66 | ||
67 | Filesystem issues may arise. To display your filesystem's free space, execute | |
68 | ``df``. :: | |
69 | ||
70 | df -h | |
71 | ||
72 | Execute ``df --help`` for additional usage. | |
73 | ||
74 | ||
75 | I/O Statistics | |
76 | -------------- | |
77 | ||
78 | Use `iostat`_ to identify I/O-related issues. :: | |
79 | ||
80 | iostat -x | |
81 | ||
82 | ||
83 | Diagnostic Messages | |
84 | ------------------- | |
85 | ||
86 | To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` | |
87 | or ``tail``. For example:: | |
88 | ||
89 | dmesg | grep scsi | |
90 | ||
91 | ||
92 | Stopping w/out Rebalancing | |
93 | ========================== | |
94 | ||
95 | Periodically, you may need to perform maintenance on a subset of your cluster, | |
96 | or resolve a problem that affects a failure domain (e.g., a rack). If you do not | |
97 | want CRUSH to automatically rebalance the cluster as you stop OSDs for | |
98 | maintenance, set the cluster to ``noout`` first:: | |
99 | ||
100 | ceph osd set noout | |
101 | ||
102 | Once the cluster is set to ``noout``, you can begin stopping the OSDs within the | |
103 | failure domain that requires maintenance work. :: | |
104 | ||
105 | stop ceph-osd id={num} | |
106 | ||
107 | .. note:: Placement groups within the OSDs you stop will become ``degraded`` | |
108 | while you are addressing issues with within the failure domain. | |
109 | ||
110 | Once you have completed your maintenance, restart the OSDs. :: | |
111 | ||
112 | start ceph-osd id={num} | |
113 | ||
114 | Finally, you must unset the cluster from ``noout``. :: | |
115 | ||
116 | ceph osd unset noout | |
117 | ||
118 | ||
119 | ||
120 | .. _osd-not-running: | |
121 | ||
122 | OSD Not Running | |
123 | =============== | |
124 | ||
125 | Under normal circumstances, simply restarting the ``ceph-osd`` daemon will | |
126 | allow it to rejoin the cluster and recover. | |
127 | ||
128 | An OSD Won't Start | |
129 | ------------------ | |
130 | ||
131 | If you start your cluster and an OSD won't start, check the following: | |
132 | ||
133 | - **Configuration File:** If you were not able to get OSDs running from | |
134 | a new installation, check your configuration file to ensure it conforms | |
135 | (e.g., ``host`` not ``hostname``, etc.). | |
136 | ||
137 | - **Check Paths:** Check the paths in your configuration, and the actual | |
138 | paths themselves for data and journals. If you separate the OSD data from | |
139 | the journal data and there are errors in your configuration file or in the | |
140 | actual mounts, you may have trouble starting OSDs. If you want to store the | |
141 | journal on a block device, you should partition your journal disk and assign | |
142 | one partition per OSD. | |
143 | ||
144 | - **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be | |
145 | hitting the default maximum number of threads (e.g., usually 32k), especially | |
146 | during recovery. You can increase the number of threads using ``sysctl`` to | |
147 | see if increasing the maximum number of threads to the maximum possible | |
148 | number of threads allowed (i.e., 4194303) will help. For example:: | |
149 | ||
150 | sysctl -w kernel.pid_max=4194303 | |
151 | ||
152 | If increasing the maximum thread count resolves the issue, you can make it | |
153 | permanent by including a ``kernel.pid_max`` setting in the | |
154 | ``/etc/sysctl.conf`` file. For example:: | |
155 | ||
156 | kernel.pid_max = 4194303 | |
157 | ||
158 | - **Kernel Version:** Identify the kernel version and distribution you | |
159 | are using. Ceph uses some third party tools by default, which may be | |
160 | buggy or may conflict with certain distributions and/or kernel | |
161 | versions (e.g., Google perftools). Check the `OS recommendations`_ | |
162 | to ensure you have addressed any issues related to your kernel. | |
163 | ||
164 | - **Segment Fault:** If there is a segment fault, turn your logging up | |
165 | (if it isn't already), and try again. If it segment faults again, | |
166 | contact the ceph-devel email list and provide your Ceph configuration | |
167 | file, your monitor output and the contents of your log file(s). | |
168 | ||
169 | ||
170 | ||
171 | An OSD Failed | |
172 | ------------- | |
173 | ||
174 | When a ``ceph-osd`` process dies, the monitor will learn about the failure | |
175 | from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` | |
176 | command:: | |
177 | ||
178 | ceph health | |
179 | HEALTH_WARN 1/3 in osds are down | |
180 | ||
181 | Specifically, you will get a warning whenever there are ``ceph-osd`` | |
182 | processes that are marked ``in`` and ``down``. You can identify which | |
183 | ``ceph-osds`` are ``down`` with:: | |
184 | ||
185 | ceph health detail | |
186 | HEALTH_WARN 1/3 in osds are down | |
187 | osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 | |
188 | ||
189 | If there is a disk | |
190 | failure or other fault preventing ``ceph-osd`` from functioning or | |
191 | restarting, an error message should be present in its log file in | |
192 | ``/var/log/ceph``. | |
193 | ||
194 | If the daemon stopped because of a heartbeat failure, the underlying | |
195 | kernel file system may be unresponsive. Check ``dmesg`` output for disk | |
196 | or other kernel errors. | |
197 | ||
198 | If the problem is a software error (failed assertion or other | |
199 | unexpected error), it should be reported to the `ceph-devel`_ email list. | |
200 | ||
201 | ||
202 | No Free Drive Space | |
203 | ------------------- | |
204 | ||
205 | Ceph prevents you from writing to a full OSD so that you don't lose data. | |
206 | In an operational cluster, you should receive a warning when your cluster | |
207 | is getting near its full ratio. The ``mon osd full ratio`` defaults to | |
208 | ``0.95``, or 95% of capacity before it stops clients from writing data. | |
209 | The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of | |
210 | capacity when it blocks backfills from starting. The | |
211 | ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity | |
212 | when it generates a health warning. | |
213 | ||
214 | Full cluster issues usually arise when testing how Ceph handles an OSD | |
215 | failure on a small cluster. When one node has a high percentage of the | |
216 | cluster's data, the cluster can easily eclipse its nearfull and full ratio | |
217 | immediately. If you are testing how Ceph reacts to OSD failures on a small | |
218 | cluster, you should leave ample free disk space and consider temporarily | |
219 | lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and | |
220 | ``mon osd nearfull ratio``. | |
221 | ||
222 | Full ``ceph-osds`` will be reported by ``ceph health``:: | |
223 | ||
224 | ceph health | |
225 | HEALTH_WARN 1 nearfull osd(s) | |
226 | ||
227 | Or:: | |
228 | ||
229 | ceph health detail | |
230 | HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) | |
231 | osd.3 is full at 97% | |
232 | osd.4 is backfill full at 91% | |
233 | osd.2 is near full at 87% | |
234 | ||
235 | The best way to deal with a full cluster is to add new ``ceph-osds``, allowing | |
236 | the cluster to redistribute data to the newly available storage. | |
237 | ||
238 | If you cannot start an OSD because it is full, you may delete some data by deleting | |
239 | some placement group directories in the full OSD. | |
240 | ||
241 | .. important:: If you choose to delete a placement group directory on a full OSD, | |
242 | **DO NOT** delete the same placement group directory on another full OSD, or | |
243 | **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on | |
244 | at least one OSD. | |
245 | ||
246 | See `Monitor Config Reference`_ for additional details. | |
247 | ||
248 | ||
249 | OSDs are Slow/Unresponsive | |
250 | ========================== | |
251 | ||
252 | A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you | |
253 | have eliminated other troubleshooting possibilities before delving into OSD | |
254 | performance issues. For example, ensure that your network(s) is working properly | |
255 | and your OSDs are running. Check to see if OSDs are throttling recovery traffic. | |
256 | ||
257 | .. tip:: Newer versions of Ceph provide better recovery handling by preventing | |
258 | recovering OSDs from using up system resources so that ``up`` and ``in`` | |
259 | OSDs aren't available or are otherwise slow. | |
260 | ||
261 | ||
262 | Networking Issues | |
263 | ----------------- | |
264 | ||
265 | Ceph is a distributed storage system, so it depends upon networks to peer with | |
266 | OSDs, replicate objects, recover from faults and check heartbeats. Networking | |
267 | issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for | |
268 | details. | |
269 | ||
270 | Ensure that Ceph processes and Ceph-dependent processes are connected and/or | |
271 | listening. :: | |
272 | ||
273 | netstat -a | grep ceph | |
274 | netstat -l | grep ceph | |
275 | sudo netstat -p | grep ceph | |
276 | ||
277 | Check network statistics. :: | |
278 | ||
279 | netstat -s | |
280 | ||
281 | ||
282 | Drive Configuration | |
283 | ------------------- | |
284 | ||
285 | A storage drive should only support one OSD. Sequential read and sequential | |
286 | write throughput can bottleneck if other processes share the drive, including | |
287 | journals, operating systems, monitors, other OSDs and non-Ceph processes. | |
288 | ||
289 | Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive | |
290 | option to accelerate the response time--particularly when using the ``XFS`` or | |
291 | ``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal | |
292 | simultaneously. | |
293 | ||
294 | .. note:: Partitioning a drive does not change its total throughput or | |
295 | sequential read/write limits. Running a journal in a separate partition | |
296 | may help, but you should prefer a separate physical drive. | |
297 | ||
298 | ||
299 | Bad Sectors / Fragmented Disk | |
300 | ----------------------------- | |
301 | ||
302 | Check your disks for bad sectors and fragmentation. This can cause total throughput | |
303 | to drop substantially. | |
304 | ||
305 | ||
306 | Co-resident Monitors/OSDs | |
307 | ------------------------- | |
308 | ||
309 | Monitors are generally light-weight processes, but they do lots of ``fsync()``, | |
310 | which can interfere with other workloads, particularly if monitors run on the | |
311 | same drive as your OSDs. Additionally, if you run monitors on the same host as | |
312 | the OSDs, you may incur performance issues related to: | |
313 | ||
314 | - Running an older kernel (pre-3.0) | |
315 | - Running Argonaut with an old ``glibc`` | |
316 | - Running a kernel with no syncfs(2) syscall. | |
317 | ||
318 | In these cases, multiple OSDs running on the same host can drag each other down | |
319 | by doing lots of commits. That often leads to the bursty writes. | |
320 | ||
321 | ||
322 | Co-resident Processes | |
323 | --------------------- | |
324 | ||
325 | Spinning up co-resident processes such as a cloud-based solution, virtual | |
326 | machines and other applications that write data to Ceph while operating on the | |
327 | same hardware as OSDs can introduce significant OSD latency. Generally, we | |
328 | recommend optimizing a host for use with Ceph and using other hosts for other | |
329 | processes. The practice of separating Ceph operations from other applications | |
330 | may help improve performance and may streamline troubleshooting and maintenance. | |
331 | ||
332 | ||
333 | Logging Levels | |
334 | -------------- | |
335 | ||
336 | If you turned logging levels up to track an issue and then forgot to turn | |
337 | logging levels back down, the OSD may be putting a lot of logs onto the disk. If | |
338 | you intend to keep logging levels high, you may consider mounting a drive to the | |
339 | default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). | |
340 | ||
341 | ||
342 | Recovery Throttling | |
343 | ------------------- | |
344 | ||
345 | Depending upon your configuration, Ceph may reduce recovery rates to maintain | |
346 | performance or it may increase recovery rates to the point that recovery | |
347 | impacts OSD performance. Check to see if the OSD is recovering. | |
348 | ||
349 | ||
350 | Kernel Version | |
351 | -------------- | |
352 | ||
353 | Check the kernel version you are running. Older kernels may not receive | |
354 | new backports that Ceph depends upon for better performance. | |
355 | ||
356 | ||
357 | Kernel Issues with SyncFS | |
358 | ------------------------- | |
359 | ||
360 | Try running one OSD per host to see if performance improves. Old kernels | |
361 | might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. | |
362 | ||
363 | ||
364 | Filesystem Issues | |
365 | ----------------- | |
366 | ||
367 | Currently, we recommend deploying clusters with XFS. The btrfs | |
368 | filesystem has many attractive features, but bugs in the filesystem may | |
369 | lead to performance issues. We do not recommend ext4 because xattr size | |
370 | limitations break our support for long object names (needed for RGW). | |
371 | ||
372 | For more information, see `Filesystem Recommendations`_. | |
373 | ||
374 | .. _Filesystem Recommendations: ../configuration/filesystem-recommendations | |
375 | ||
376 | ||
377 | Insufficient RAM | |
378 | ---------------- | |
379 | ||
380 | We recommend 1GB of RAM per OSD daemon. You may notice that during normal | |
381 | operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). | |
382 | Unused RAM makes it tempting to use the excess RAM for co-resident applications, | |
383 | VMs and so forth. However, when OSDs go into recovery mode, their memory | |
384 | utilization spikes. If there is no RAM available, the OSD performance will slow | |
385 | considerably. | |
386 | ||
387 | ||
388 | Old Requests or Slow Requests | |
389 | ----------------------------- | |
390 | ||
391 | If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages | |
392 | complaining about requests that are taking too long. The warning threshold | |
393 | defaults to 30 seconds, and is configurable via the ``osd op complaint time`` | |
394 | option. When this happens, the cluster log will receive messages. | |
395 | ||
396 | Legacy versions of Ceph complain about 'old requests`:: | |
397 | ||
398 | osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops | |
399 | ||
400 | New versions of Ceph complain about 'slow requests`:: | |
401 | ||
402 | {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs | |
403 | {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] | |
404 | ||
405 | ||
406 | Possible causes include: | |
407 | ||
408 | - A bad drive (check ``dmesg`` output) | |
409 | - A bug in the kernel file system bug (check ``dmesg`` output) | |
410 | - An overloaded cluster (check system load, iostat, etc.) | |
411 | - A bug in the ``ceph-osd`` daemon. | |
412 | ||
413 | Possible solutions | |
414 | ||
415 | - Remove VMs Cloud Solutions from Ceph Hosts | |
416 | - Upgrade Kernel | |
417 | - Upgrade Ceph | |
418 | - Restart OSDs | |
419 | ||
420 | ||
421 | ||
422 | Flapping OSDs | |
423 | ============= | |
424 | ||
425 | We recommend using both a public (front-end) network and a cluster (back-end) | |
426 | network so that you can better meet the capacity requirements of object | |
427 | replication. Another advantage is that you can run a cluster network such that | |
428 | it isn't connected to the internet, thereby preventing some denial of service | |
429 | attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) | |
430 | network when it's available. See `Monitor/OSD Interaction`_ for details. | |
431 | ||
432 | However, if the cluster (back-end) network fails or develops significant latency | |
433 | while the public (front-end) network operates optimally, OSDs currently do not | |
434 | handle this situation well. What happens is that OSDs mark each other ``down`` | |
435 | on the monitor, while marking themselves ``up``. We call this scenario | |
436 | 'flapping`. | |
437 | ||
438 | If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and | |
439 | then ``up`` again), you can force the monitors to stop the flapping with:: | |
440 | ||
441 | ceph osd set noup # prevent OSDs from getting marked up | |
442 | ceph osd set nodown # prevent OSDs from getting marked down | |
443 | ||
444 | These flags are recorded in the osdmap structure:: | |
445 | ||
446 | ceph osd dump | grep flags | |
447 | flags no-up,no-down | |
448 | ||
449 | You can clear the flags with:: | |
450 | ||
451 | ceph osd unset noup | |
452 | ceph osd unset nodown | |
453 | ||
454 | Two other flags are supported, ``noin`` and ``noout``, which prevent | |
455 | booting OSDs from being marked ``in`` (allocated data) or protect OSDs | |
456 | from eventually being marked ``out`` (regardless of what the current value for | |
457 | ``mon osd down out interval`` is). | |
458 | ||
459 | .. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the | |
460 | sense that once the flags are cleared, the action they were blocking | |
461 | should occur shortly after. The ``noin`` flag, on the other hand, | |
462 | prevents OSDs from being marked ``in`` on boot, and any daemons that | |
463 | started while the flag was set will remain that way. | |
464 | ||
465 | ||
466 | ||
467 | ||
468 | ||
469 | ||
470 | .. _iostat: http://en.wikipedia.org/wiki/Iostat | |
471 | .. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging | |
472 | .. _Logging and Debugging: ../log-and-debug | |
473 | .. _Debugging and Logging: ../debug | |
474 | .. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction | |
475 | .. _Monitor Config Reference: ../../configuration/mon-config-ref | |
476 | .. _monitoring your OSDs: ../../operations/monitoring-osd-pg | |
477 | .. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel | |
478 | .. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel | |
479 | .. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com | |
480 | .. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com | |
481 | .. _OS recommendations: ../../../start/os-recommendations | |
482 | .. _ceph-devel: ceph-devel@vger.kernel.org |