]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ================= |
2 | Troubleshooting | |
3 | ================= | |
4 | ||
5 | Slow/stuck operations | |
6 | ===================== | |
7 | ||
8 | If you are experiencing apparent hung operations, the first task is to identify | |
9 | where the problem is occurring: in the client, the MDS, or the network connecting | |
10 | them. Start by looking to see if either side has stuck operations | |
11 | (:ref:`slow_requests`, below), and narrow it down from there. | |
12 | ||
13 | RADOS Health | |
14 | ============ | |
15 | ||
16 | If part of the CephFS metadata or data pools is unavaible and CephFS isn't | |
17 | responding, it is probably because RADOS itself is unhealthy. Resolve those | |
18 | problems first (:doc:`../../rados/troubleshooting/index`). | |
19 | ||
20 | The MDS | |
21 | ======= | |
22 | ||
23 | If an operation is hung inside the MDS, it will eventually show up in ``ceph health``, | |
24 | identifying "slow requests are blocked". It may also identify clients as | |
25 | "failing to respond" or misbehaving in other ways. If the MDS identifies | |
26 | specific clients as misbehaving, you should investigate why they are doing so. | |
27 | Generally it will be the result of | |
28 | 1) overloading the system (if you have extra RAM, increase the | |
29 | "mds cache size" config from its default 100000; having a larger active file set | |
30 | than your MDS cache is the #1 cause of this!) | |
31 | 2) running an older (misbehaving) client, or | |
32 | 3) underlying RADOS issues. | |
33 | ||
34 | Otherwise, you have probably discovered a new bug and should report it to | |
35 | the developers! | |
36 | ||
37 | .. _slow_requests: | |
38 | ||
39 | Slow requests (MDS) | |
40 | ------------------- | |
41 | You can list current operations via the admin socket by running:: | |
42 | ||
43 | ceph daemon mds.<name> dump_ops_in_flight | |
44 | ||
45 | from the MDS host. Identify the stuck commands and examine why they are stuck. | |
46 | Usually the last "event" will have been an attempt to gather locks, or sending | |
47 | the operation off to the MDS log. If it is waiting on the OSDs, fix them. If | |
48 | operations are stuck on a specific inode, you probably have a client holding | |
49 | caps which prevent others from using it, either because the client is trying | |
50 | to flush out dirty data or because you've encountered a bug in CephFS' | |
51 | distributed file lock code (the file "capabilities" ["caps"] system). | |
52 | ||
53 | If it's a result of a bug in the capabilities code, restarting the MDS | |
54 | is likely to resolve the problem. | |
55 | ||
56 | If there are no slow requests reported on the MDS, and it isn't reporting | |
57 | that clients are misbehaving, either the client has a problem or its | |
58 | requests aren't reaching the MDS. | |
59 | ||
60 | ceph-fuse debugging | |
61 | =================== | |
62 | ||
63 | ceph-fuse also supports dump_ops_in_flight. See if it has any and where they are | |
64 | stuck. | |
65 | ||
66 | Debug output | |
67 | ------------ | |
68 | ||
69 | To get more debugging information from ceph-fuse, try running in the foreground | |
70 | with logging to the console (``-d``) and enabling client debug | |
71 | (``--debug-client=20``), enabling prints for each message sent | |
72 | (``--debug-ms=1``). | |
73 | ||
74 | If you suspect a potential monitor issue, enable monitor debugging as well | |
75 | (``--debug-monc=20``). | |
76 | ||
77 | ||
78 | Kernel mount debugging | |
79 | ====================== | |
80 | ||
81 | Slow requests | |
82 | ------------- | |
83 | ||
84 | Unfortunately the kernel client does not support the admin socket, but it has | |
85 | similar (if limited) interfaces if your kernel has debugfs enabled. There | |
86 | will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will | |
87 | look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``) | |
88 | will contain a variety of files that output interesting output when you ``cat`` | |
89 | them. These files are described below; the most interesting when debugging | |
90 | slow requests are probably the ``mdsc`` and ``osdc`` files. | |
91 | ||
92 | * bdi: BDI info about the Ceph system (blocks dirtied, written, etc) | |
93 | * caps: counts of file "caps" structures in-memory and used | |
94 | * client_options: dumps the options provided to the CephFS mount | |
95 | * dentry_lru: Dumps the CephFS dentries currently in-memory | |
96 | * mdsc: Dumps current requests to the MDS | |
97 | * mdsmap: Dumps the current MDSMap epoch and MDSes | |
98 | * mds_sessions: Dumps the current sessions to MDSes | |
99 | * monc: Dumps the current maps from the monitor, and any "subscriptions" held | |
100 | * monmap: Dumps the current monitor map epoch and monitors | |
101 | * osdc: Dumps the current ops in-flight to OSDs (ie, file data IO) | |
102 | * osdmap: Dumps the current OSDMap epoch, pools, and OSDs | |
103 | ||
104 | If there are no stuck requests but you have file IO which isn't progressing, | |
105 | you might have a... | |
106 | ||
107 | Disconnected+Remounted FS | |
108 | ========================= | |
109 | Because CephFS has a "consistent cache", if your network connection is | |
110 | disrupted for a long enough time, the client will be forcibly | |
111 | disconnected from the system. At this point, the kernel client is in | |
112 | a bind: it can't safely write back dirty data, and many applications | |
113 | do not handle IO errors correctly on close(). | |
114 | At the moment, the kernel client will remount the FS, but outstanding filesystem | |
115 | IO may or may not be satisfied. In these cases, you may need to reboot your | |
116 | client system. | |
117 | ||
118 | You can identify you are in this situation if dmesg/kern.log report something like:: | |
119 | ||
120 | Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session | |
121 | Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start | |
122 | Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied | |
123 | Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631 | |
124 | Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707 | |
125 | Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN) | |
126 | Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset | |
127 | Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0 | |
128 | ||
129 | This is an area of ongoing work to improve the behavior. Kernels will soon | |
130 | be reliably issuing error codes to in-progress IO, although your application(s) | |
131 | may not deal with them well. In the longer-term, we hope to allow reconnect | |
132 | and reclaim of data in cases where it won't violate POSIX semantics (generally, | |
133 | data which hasn't been accessed or modified by other clients). | |
134 | ||
135 | Mounting | |
136 | ======== | |
137 | ||
138 | Mount 5 Error | |
139 | ------------- | |
140 | ||
141 | A mount 5 error typically occurs if a MDS server is laggy or if it crashed. | |
142 | Ensure at least one MDS is up and running, and the cluster is ``active + | |
143 | healthy``. | |
144 | ||
145 | Mount 12 Error | |
146 | -------------- | |
147 | ||
148 | A mount 12 error with ``cannot allocate memory`` usually occurs if you have a | |
149 | version mismatch between the :term:`Ceph Client` version and the :term:`Ceph | |
150 | Storage Cluster` version. Check the versions using:: | |
151 | ||
152 | ceph -v | |
153 | ||
154 | If the Ceph Client is behind the Ceph cluster, try to upgrade it:: | |
155 | ||
156 | sudo apt-get update && sudo apt-get install ceph-common | |
157 | ||
158 | You may need to uninstall, autoclean and autoremove ``ceph-common`` | |
159 | and then reinstall it so that you have the latest version. | |
160 |