]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/troubleshooting/troubleshooting-osd.rst
add subtree-ish sources for 12.0.3
[ceph.git] / ceph / doc / rados / troubleshooting / troubleshooting-osd.rst
CommitLineData
7c673cae
FG
1======================
2 Troubleshooting OSDs
3======================
4
5Before troubleshooting your OSDs, check your monitors and network first. If
6you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
7a health status, it means that the monitors have a quorum.
8If you don't have a monitor quorum or if there are errors with the monitor
9status, `address the monitor issues first <../troubleshooting-mon>`_.
10Check your networks to ensure they
11are running properly, because networks may have a significant impact on OSD
12operation and performance.
13
14
15
16Obtaining Data About OSDs
17=========================
18
19A good first step in troubleshooting your OSDs is to obtain information in
20addition to the information you collected while `monitoring your OSDs`_
21(e.g., ``ceph osd tree``).
22
23
24Ceph Logs
25---------
26
27If you haven't changed the default path, you can find Ceph log files at
28``/var/log/ceph``::
29
30 ls /var/log/ceph
31
32If you don't get enough log detail, you can change your logging level. See
33`Logging and Debugging`_ for details to ensure that Ceph performs adequately
34under high logging volume.
35
36
37Admin Socket
38------------
39
40Use the admin socket tool to retrieve runtime information. For details, list
41the sockets for your Ceph processes::
42
43 ls /var/run/ceph
44
45Then, execute the following, replacing ``{daemon-name}`` with an actual
46daemon (e.g., ``osd.0``)::
47
48 ceph daemon osd.0 help
49
50Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::
51
52 ceph daemon {socket-file} help
53
54
55The admin socket, among other things, allows you to:
56
57- List your configuration at runtime
58- Dump historic operations
59- Dump the operation priority queue state
60- Dump operations in flight
61- Dump perfcounters
62
63
64Display Freespace
65-----------------
66
67Filesystem issues may arise. To display your filesystem's free space, execute
68``df``. ::
69
70 df -h
71
72Execute ``df --help`` for additional usage.
73
74
75I/O Statistics
76--------------
77
78Use `iostat`_ to identify I/O-related issues. ::
79
80 iostat -x
81
82
83Diagnostic Messages
84-------------------
85
86To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
87or ``tail``. For example::
88
89 dmesg | grep scsi
90
91
92Stopping w/out Rebalancing
93==========================
94
95Periodically, you may need to perform maintenance on a subset of your cluster,
96or resolve a problem that affects a failure domain (e.g., a rack). If you do not
97want CRUSH to automatically rebalance the cluster as you stop OSDs for
98maintenance, set the cluster to ``noout`` first::
99
100 ceph osd set noout
101
102Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
103failure domain that requires maintenance work. ::
104
105 stop ceph-osd id={num}
106
107.. note:: Placement groups within the OSDs you stop will become ``degraded``
108 while you are addressing issues with within the failure domain.
109
110Once you have completed your maintenance, restart the OSDs. ::
111
112 start ceph-osd id={num}
113
114Finally, you must unset the cluster from ``noout``. ::
115
116 ceph osd unset noout
117
118
119
120.. _osd-not-running:
121
122OSD Not Running
123===============
124
125Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
126allow it to rejoin the cluster and recover.
127
128An OSD Won't Start
129------------------
130
131If you start your cluster and an OSD won't start, check the following:
132
133- **Configuration File:** If you were not able to get OSDs running from
134 a new installation, check your configuration file to ensure it conforms
135 (e.g., ``host`` not ``hostname``, etc.).
136
137- **Check Paths:** Check the paths in your configuration, and the actual
138 paths themselves for data and journals. If you separate the OSD data from
139 the journal data and there are errors in your configuration file or in the
140 actual mounts, you may have trouble starting OSDs. If you want to store the
141 journal on a block device, you should partition your journal disk and assign
142 one partition per OSD.
143
144- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
145 hitting the default maximum number of threads (e.g., usually 32k), especially
146 during recovery. You can increase the number of threads using ``sysctl`` to
147 see if increasing the maximum number of threads to the maximum possible
148 number of threads allowed (i.e., 4194303) will help. For example::
149
150 sysctl -w kernel.pid_max=4194303
151
152 If increasing the maximum thread count resolves the issue, you can make it
153 permanent by including a ``kernel.pid_max`` setting in the
154 ``/etc/sysctl.conf`` file. For example::
155
156 kernel.pid_max = 4194303
157
158- **Kernel Version:** Identify the kernel version and distribution you
159 are using. Ceph uses some third party tools by default, which may be
160 buggy or may conflict with certain distributions and/or kernel
161 versions (e.g., Google perftools). Check the `OS recommendations`_
162 to ensure you have addressed any issues related to your kernel.
163
164- **Segment Fault:** If there is a segment fault, turn your logging up
165 (if it isn't already), and try again. If it segment faults again,
166 contact the ceph-devel email list and provide your Ceph configuration
167 file, your monitor output and the contents of your log file(s).
168
169
170
171An OSD Failed
172-------------
173
174When a ``ceph-osd`` process dies, the monitor will learn about the failure
175from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
176command::
177
178 ceph health
179 HEALTH_WARN 1/3 in osds are down
180
181Specifically, you will get a warning whenever there are ``ceph-osd``
182processes that are marked ``in`` and ``down``. You can identify which
183``ceph-osds`` are ``down`` with::
184
185 ceph health detail
186 HEALTH_WARN 1/3 in osds are down
187 osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
188
189If there is a disk
190failure or other fault preventing ``ceph-osd`` from functioning or
191restarting, an error message should be present in its log file in
192``/var/log/ceph``.
193
194If the daemon stopped because of a heartbeat failure, the underlying
195kernel file system may be unresponsive. Check ``dmesg`` output for disk
196or other kernel errors.
197
198If the problem is a software error (failed assertion or other
199unexpected error), it should be reported to the `ceph-devel`_ email list.
200
201
202No Free Drive Space
203-------------------
204
205Ceph prevents you from writing to a full OSD so that you don't lose data.
206In an operational cluster, you should receive a warning when your cluster
207is getting near its full ratio. The ``mon osd full ratio`` defaults to
208``0.95``, or 95% of capacity before it stops clients from writing data.
209The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
210capacity when it blocks backfills from starting. The
211``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
212when it generates a health warning.
213
214Full cluster issues usually arise when testing how Ceph handles an OSD
215failure on a small cluster. When one node has a high percentage of the
216cluster's data, the cluster can easily eclipse its nearfull and full ratio
217immediately. If you are testing how Ceph reacts to OSD failures on a small
218cluster, you should leave ample free disk space and consider temporarily
219lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and
220``mon osd nearfull ratio``.
221
222Full ``ceph-osds`` will be reported by ``ceph health``::
223
224 ceph health
225 HEALTH_WARN 1 nearfull osd(s)
226
227Or::
228
229 ceph health detail
230 HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
231 osd.3 is full at 97%
232 osd.4 is backfill full at 91%
233 osd.2 is near full at 87%
234
235The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
236the cluster to redistribute data to the newly available storage.
237
238If you cannot start an OSD because it is full, you may delete some data by deleting
239some placement group directories in the full OSD.
240
241.. important:: If you choose to delete a placement group directory on a full OSD,
242 **DO NOT** delete the same placement group directory on another full OSD, or
243 **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
244 at least one OSD.
245
246See `Monitor Config Reference`_ for additional details.
247
248
249OSDs are Slow/Unresponsive
250==========================
251
252A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
253have eliminated other troubleshooting possibilities before delving into OSD
254performance issues. For example, ensure that your network(s) is working properly
255and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
256
257.. tip:: Newer versions of Ceph provide better recovery handling by preventing
258 recovering OSDs from using up system resources so that ``up`` and ``in``
259 OSDs aren't available or are otherwise slow.
260
261
262Networking Issues
263-----------------
264
265Ceph is a distributed storage system, so it depends upon networks to peer with
266OSDs, replicate objects, recover from faults and check heartbeats. Networking
267issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
268details.
269
270Ensure that Ceph processes and Ceph-dependent processes are connected and/or
271listening. ::
272
273 netstat -a | grep ceph
274 netstat -l | grep ceph
275 sudo netstat -p | grep ceph
276
277Check network statistics. ::
278
279 netstat -s
280
281
282Drive Configuration
283-------------------
284
285A storage drive should only support one OSD. Sequential read and sequential
286write throughput can bottleneck if other processes share the drive, including
287journals, operating systems, monitors, other OSDs and non-Ceph processes.
288
289Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
290option to accelerate the response time--particularly when using the ``XFS`` or
291``ext4`` filesystems. By contrast, the ``btrfs`` filesystem can write and journal
292simultaneously.
293
294.. note:: Partitioning a drive does not change its total throughput or
295 sequential read/write limits. Running a journal in a separate partition
296 may help, but you should prefer a separate physical drive.
297
298
299Bad Sectors / Fragmented Disk
300-----------------------------
301
302Check your disks for bad sectors and fragmentation. This can cause total throughput
303to drop substantially.
304
305
306Co-resident Monitors/OSDs
307-------------------------
308
309Monitors are generally light-weight processes, but they do lots of ``fsync()``,
310which can interfere with other workloads, particularly if monitors run on the
311same drive as your OSDs. Additionally, if you run monitors on the same host as
312the OSDs, you may incur performance issues related to:
313
314- Running an older kernel (pre-3.0)
315- Running Argonaut with an old ``glibc``
316- Running a kernel with no syncfs(2) syscall.
317
318In these cases, multiple OSDs running on the same host can drag each other down
319by doing lots of commits. That often leads to the bursty writes.
320
321
322Co-resident Processes
323---------------------
324
325Spinning up co-resident processes such as a cloud-based solution, virtual
326machines and other applications that write data to Ceph while operating on the
327same hardware as OSDs can introduce significant OSD latency. Generally, we
328recommend optimizing a host for use with Ceph and using other hosts for other
329processes. The practice of separating Ceph operations from other applications
330may help improve performance and may streamline troubleshooting and maintenance.
331
332
333Logging Levels
334--------------
335
336If you turned logging levels up to track an issue and then forgot to turn
337logging levels back down, the OSD may be putting a lot of logs onto the disk. If
338you intend to keep logging levels high, you may consider mounting a drive to the
339default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
340
341
342Recovery Throttling
343-------------------
344
345Depending upon your configuration, Ceph may reduce recovery rates to maintain
346performance or it may increase recovery rates to the point that recovery
347impacts OSD performance. Check to see if the OSD is recovering.
348
349
350Kernel Version
351--------------
352
353Check the kernel version you are running. Older kernels may not receive
354new backports that Ceph depends upon for better performance.
355
356
357Kernel Issues with SyncFS
358-------------------------
359
360Try running one OSD per host to see if performance improves. Old kernels
361might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
362
363
364Filesystem Issues
365-----------------
366
367Currently, we recommend deploying clusters with XFS. The btrfs
368filesystem has many attractive features, but bugs in the filesystem may
369lead to performance issues. We do not recommend ext4 because xattr size
370limitations break our support for long object names (needed for RGW).
371
372For more information, see `Filesystem Recommendations`_.
373
374.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
375
376
377Insufficient RAM
378----------------
379
380We recommend 1GB of RAM per OSD daemon. You may notice that during normal
381operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
382Unused RAM makes it tempting to use the excess RAM for co-resident applications,
383VMs and so forth. However, when OSDs go into recovery mode, their memory
384utilization spikes. If there is no RAM available, the OSD performance will slow
385considerably.
386
387
388Old Requests or Slow Requests
389-----------------------------
390
391If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
392complaining about requests that are taking too long. The warning threshold
393defaults to 30 seconds, and is configurable via the ``osd op complaint time``
394option. When this happens, the cluster log will receive messages.
395
396Legacy versions of Ceph complain about 'old requests`::
397
398 osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
399
400New versions of Ceph complain about 'slow requests`::
401
402 {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
403 {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
404
405
406Possible causes include:
407
408- A bad drive (check ``dmesg`` output)
409- A bug in the kernel file system bug (check ``dmesg`` output)
410- An overloaded cluster (check system load, iostat, etc.)
411- A bug in the ``ceph-osd`` daemon.
412
413Possible solutions
414
415- Remove VMs Cloud Solutions from Ceph Hosts
416- Upgrade Kernel
417- Upgrade Ceph
418- Restart OSDs
419
420
421
422Flapping OSDs
423=============
424
425We recommend using both a public (front-end) network and a cluster (back-end)
426network so that you can better meet the capacity requirements of object
427replication. Another advantage is that you can run a cluster network such that
428it isn't connected to the internet, thereby preventing some denial of service
429attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
430network when it's available. See `Monitor/OSD Interaction`_ for details.
431
432However, if the cluster (back-end) network fails or develops significant latency
433while the public (front-end) network operates optimally, OSDs currently do not
434handle this situation well. What happens is that OSDs mark each other ``down``
435on the monitor, while marking themselves ``up``. We call this scenario
436'flapping`.
437
438If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
439then ``up`` again), you can force the monitors to stop the flapping with::
440
441 ceph osd set noup # prevent OSDs from getting marked up
442 ceph osd set nodown # prevent OSDs from getting marked down
443
444These flags are recorded in the osdmap structure::
445
446 ceph osd dump | grep flags
447 flags no-up,no-down
448
449You can clear the flags with::
450
451 ceph osd unset noup
452 ceph osd unset nodown
453
454Two other flags are supported, ``noin`` and ``noout``, which prevent
455booting OSDs from being marked ``in`` (allocated data) or protect OSDs
456from eventually being marked ``out`` (regardless of what the current value for
457``mon osd down out interval`` is).
458
459.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
460 sense that once the flags are cleared, the action they were blocking
461 should occur shortly after. The ``noin`` flag, on the other hand,
462 prevents OSDs from being marked ``in`` on boot, and any daemons that
463 started while the flag was set will remain that way.
464
465
466
467
468
469
470.. _iostat: http://en.wikipedia.org/wiki/Iostat
471.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
472.. _Logging and Debugging: ../log-and-debug
473.. _Debugging and Logging: ../debug
474.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
475.. _Monitor Config Reference: ../../configuration/mon-config-ref
476.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
477.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
478.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
479.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
480.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
481.. _OS recommendations: ../../../start/os-recommendations
482.. _ceph-devel: ceph-devel@vger.kernel.org