]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephfs/troubleshooting.rst
import 15.2.5
[ceph.git] / ceph / doc / cephfs / troubleshooting.rst
1 =================
2 Troubleshooting
3 =================
4
5 Slow/stuck operations
6 =====================
7
8 If you are experiencing apparent hung operations, the first task is to identify
9 where the problem is occurring: in the client, the MDS, or the network connecting
10 them. Start by looking to see if either side has stuck operations
11 (:ref:`slow_requests`, below), and narrow it down from there.
12
13 We can get hints about what's going on by dumping the MDS cache ::
14
15 ceph daemon mds.<name> dump cache /tmp/dump.txt
16
17 .. note:: The file `dump.txt` is on the machine executing the MDS and for systemd
18 controlled MDS services, this is in a tmpfs in the MDS container.
19 Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path.
20
21 If high logging levels are set on the MDS, that will almost certainly hold the
22 information we need to diagnose and solve the issue.
23
24 RADOS Health
25 ============
26
27 If part of the CephFS metadata or data pools is unavailable and CephFS is not
28 responding, it is probably because RADOS itself is unhealthy. Resolve those
29 problems first (:doc:`../../rados/troubleshooting/index`).
30
31 The MDS
32 =======
33
34 If an operation is hung inside the MDS, it will eventually show up in ``ceph health``,
35 identifying "slow requests are blocked". It may also identify clients as
36 "failing to respond" or misbehaving in other ways. If the MDS identifies
37 specific clients as misbehaving, you should investigate why they are doing so.
38
39 Generally it will be the result of
40
41 #. Overloading the system (if you have extra RAM, increase the
42 "mds cache memory limit" config from its default 1GiB; having a larger active
43 file set than your MDS cache is the #1 cause of this!).
44
45 #. Running an older (misbehaving) client.
46
47 #. Underlying RADOS issues.
48
49 Otherwise, you have probably discovered a new bug and should report it to
50 the developers!
51
52 .. _slow_requests:
53
54 Slow requests (MDS)
55 -------------------
56 You can list current operations via the admin socket by running::
57
58 ceph daemon mds.<name> dump_ops_in_flight
59
60 from the MDS host. Identify the stuck commands and examine why they are stuck.
61 Usually the last "event" will have been an attempt to gather locks, or sending
62 the operation off to the MDS log. If it is waiting on the OSDs, fix them. If
63 operations are stuck on a specific inode, you probably have a client holding
64 caps which prevent others from using it, either because the client is trying
65 to flush out dirty data or because you have encountered a bug in CephFS'
66 distributed file lock code (the file "capabilities" ["caps"] system).
67
68 If it's a result of a bug in the capabilities code, restarting the MDS
69 is likely to resolve the problem.
70
71 If there are no slow requests reported on the MDS, and it is not reporting
72 that clients are misbehaving, either the client has a problem or its
73 requests are not reaching the MDS.
74
75 .. _ceph_fuse_debugging:
76
77 ceph-fuse debugging
78 ===================
79
80 ceph-fuse also supports ``dump_ops_in_flight``. See if it has any and where they are
81 stuck.
82
83 Debug output
84 ------------
85
86 To get more debugging information from ceph-fuse, try running in the foreground
87 with logging to the console (``-d``) and enabling client debug
88 (``--debug-client=20``), enabling prints for each message sent
89 (``--debug-ms=1``).
90
91 If you suspect a potential monitor issue, enable monitor debugging as well
92 (``--debug-monc=20``).
93
94 .. _kernel_mount_debugging:
95
96 Kernel mount debugging
97 ======================
98
99 If there is an issue with the kernel client, the most important thing is
100 figuring out whether the problem is with the kernel client or the MDS. Generally,
101 this is easy to work out. If the kernel client broke directly, there will be
102 output in ``dmesg``. Collect it and any inappropriate kernel state.
103
104 Slow requests
105 -------------
106
107 Unfortunately the kernel client does not support the admin socket, but it has
108 similar (if limited) interfaces if your kernel has debugfs enabled. There
109 will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will
110 look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``)
111 will contain a variety of files that output interesting output when you ``cat``
112 them. These files are described below; the most interesting when debugging
113 slow requests are probably the ``mdsc`` and ``osdc`` files.
114
115 * bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
116 * caps: counts of file "caps" structures in-memory and used
117 * client_options: dumps the options provided to the CephFS mount
118 * dentry_lru: Dumps the CephFS dentries currently in-memory
119 * mdsc: Dumps current requests to the MDS
120 * mdsmap: Dumps the current MDSMap epoch and MDSes
121 * mds_sessions: Dumps the current sessions to MDSes
122 * monc: Dumps the current maps from the monitor, and any "subscriptions" held
123 * monmap: Dumps the current monitor map epoch and monitors
124 * osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
125 * osdmap: Dumps the current OSDMap epoch, pools, and OSDs
126
127 If there are no stuck requests but you have file IO which is not progressing,
128 you might have a...
129
130 Disconnected+Remounted FS
131 =========================
132 Because CephFS has a "consistent cache", if your network connection is
133 disrupted for a long enough time, the client will be forcibly
134 disconnected from the system. At this point, the kernel client is in
135 a bind: it cannot safely write back dirty data, and many applications
136 do not handle IO errors correctly on close().
137 At the moment, the kernel client will remount the FS, but outstanding file system
138 IO may or may not be satisfied. In these cases, you may need to reboot your
139 client system.
140
141 You can identify you are in this situation if dmesg/kern.log report something like::
142
143 Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session
144 Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start
145 Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied
146 Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631
147 Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707
148 Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN)
149 Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset
150 Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
151
152 This is an area of ongoing work to improve the behavior. Kernels will soon
153 be reliably issuing error codes to in-progress IO, although your application(s)
154 may not deal with them well. In the longer-term, we hope to allow reconnect
155 and reclaim of data in cases where it won't violate POSIX semantics (generally,
156 data which hasn't been accessed or modified by other clients).
157
158 Mounting
159 ========
160
161 Mount 5 Error
162 -------------
163
164 A mount 5 error typically occurs if a MDS server is laggy or if it crashed.
165 Ensure at least one MDS is up and running, and the cluster is ``active +
166 healthy``.
167
168 Mount 12 Error
169 --------------
170
171 A mount 12 error with ``cannot allocate memory`` usually occurs if you have a
172 version mismatch between the :term:`Ceph Client` version and the :term:`Ceph
173 Storage Cluster` version. Check the versions using::
174
175 ceph -v
176
177 If the Ceph Client is behind the Ceph cluster, try to upgrade it::
178
179 sudo apt-get update && sudo apt-get install ceph-common
180
181 You may need to uninstall, autoclean and autoremove ``ceph-common``
182 and then reinstall it so that you have the latest version.
183
184 Dynamic Debugging
185 =================
186
187 You can enable dynamic debug against the CephFS module.
188
189 Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh
190
191 Reporting Issues
192 ================
193
194 If you have identified a specific issue, please report it with as much
195 information as possible. Especially important information:
196
197 * Ceph versions installed on client and server
198 * Whether you are using the kernel or fuse client
199 * If you are using the kernel client, what kernel version?
200 * How many clients are in play, doing what kind of workload?
201 * If a system is 'stuck', is that affecting all clients or just one?
202 * Any ceph health messages
203 * Any backtraces in the ceph logs from crashes
204
205 If you are satisfied that you have found a bug, please file it on `the bug
206 tracker`. For more general queries, please write to the `ceph-users mailing
207 list`.
208
209 .. _the bug tracker: http://tracker.ceph.com
210 .. _ceph-users mailing list: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/