]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/troubleshooting.rst
update sources to v12.1.2
[ceph.git] / ceph / doc / cephfs / troubleshooting.rst
CommitLineData
7c673cae
FG
1=================
2 Troubleshooting
3=================
4
5Slow/stuck operations
6=====================
7
8If you are experiencing apparent hung operations, the first task is to identify
9where the problem is occurring: in the client, the MDS, or the network connecting
10them. Start by looking to see if either side has stuck operations
11(:ref:`slow_requests`, below), and narrow it down from there.
12
13RADOS Health
14============
15
c07f9fc5 16If part of the CephFS metadata or data pools is unavaible and CephFS is not
7c673cae
FG
17responding, it is probably because RADOS itself is unhealthy. Resolve those
18problems first (:doc:`../../rados/troubleshooting/index`).
19
20The MDS
21=======
22
23If an operation is hung inside the MDS, it will eventually show up in ``ceph health``,
24identifying "slow requests are blocked". It may also identify clients as
25"failing to respond" or misbehaving in other ways. If the MDS identifies
26specific clients as misbehaving, you should investigate why they are doing so.
27Generally it will be the result of
281) overloading the system (if you have extra RAM, increase the
29"mds cache size" config from its default 100000; having a larger active file set
30than your MDS cache is the #1 cause of this!)
312) running an older (misbehaving) client, or
323) underlying RADOS issues.
33
34Otherwise, you have probably discovered a new bug and should report it to
35the developers!
36
37.. _slow_requests:
38
39Slow requests (MDS)
40-------------------
41You can list current operations via the admin socket by running::
42
43 ceph daemon mds.<name> dump_ops_in_flight
44
45from the MDS host. Identify the stuck commands and examine why they are stuck.
46Usually the last "event" will have been an attempt to gather locks, or sending
47the operation off to the MDS log. If it is waiting on the OSDs, fix them. If
48operations are stuck on a specific inode, you probably have a client holding
49caps which prevent others from using it, either because the client is trying
c07f9fc5 50to flush out dirty data or because you have encountered a bug in CephFS'
7c673cae
FG
51distributed file lock code (the file "capabilities" ["caps"] system).
52
53If it's a result of a bug in the capabilities code, restarting the MDS
54is likely to resolve the problem.
55
c07f9fc5 56If there are no slow requests reported on the MDS, and it is not reporting
7c673cae 57that clients are misbehaving, either the client has a problem or its
c07f9fc5 58requests are not reaching the MDS.
7c673cae
FG
59
60ceph-fuse debugging
61===================
62
63ceph-fuse also supports dump_ops_in_flight. See if it has any and where they are
64stuck.
65
66Debug output
67------------
68
69To get more debugging information from ceph-fuse, try running in the foreground
70with logging to the console (``-d``) and enabling client debug
71(``--debug-client=20``), enabling prints for each message sent
72(``--debug-ms=1``).
73
74If you suspect a potential monitor issue, enable monitor debugging as well
75(``--debug-monc=20``).
76
77
78Kernel mount debugging
79======================
80
81Slow requests
82-------------
83
84Unfortunately the kernel client does not support the admin socket, but it has
85similar (if limited) interfaces if your kernel has debugfs enabled. There
86will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will
87look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``)
88will contain a variety of files that output interesting output when you ``cat``
89them. These files are described below; the most interesting when debugging
90slow requests are probably the ``mdsc`` and ``osdc`` files.
91
92* bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
93* caps: counts of file "caps" structures in-memory and used
94* client_options: dumps the options provided to the CephFS mount
95* dentry_lru: Dumps the CephFS dentries currently in-memory
96* mdsc: Dumps current requests to the MDS
97* mdsmap: Dumps the current MDSMap epoch and MDSes
98* mds_sessions: Dumps the current sessions to MDSes
99* monc: Dumps the current maps from the monitor, and any "subscriptions" held
100* monmap: Dumps the current monitor map epoch and monitors
101* osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
102* osdmap: Dumps the current OSDMap epoch, pools, and OSDs
103
c07f9fc5 104If there are no stuck requests but you have file IO which is not progressing,
7c673cae
FG
105you might have a...
106
107Disconnected+Remounted FS
108=========================
109Because CephFS has a "consistent cache", if your network connection is
110disrupted for a long enough time, the client will be forcibly
111disconnected from the system. At this point, the kernel client is in
c07f9fc5 112a bind: it cannot safely write back dirty data, and many applications
7c673cae
FG
113do not handle IO errors correctly on close().
114At the moment, the kernel client will remount the FS, but outstanding filesystem
115IO may or may not be satisfied. In these cases, you may need to reboot your
116client system.
117
118You can identify you are in this situation if dmesg/kern.log report something like::
119
120 Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session
121 Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start
122 Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied
123 Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631
124 Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707
125 Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN)
126 Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset
127 Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
128
129This is an area of ongoing work to improve the behavior. Kernels will soon
130be reliably issuing error codes to in-progress IO, although your application(s)
131may not deal with them well. In the longer-term, we hope to allow reconnect
132and reclaim of data in cases where it won't violate POSIX semantics (generally,
133data which hasn't been accessed or modified by other clients).
134
135Mounting
136========
137
138Mount 5 Error
139-------------
140
141A mount 5 error typically occurs if a MDS server is laggy or if it crashed.
142Ensure at least one MDS is up and running, and the cluster is ``active +
143healthy``.
144
145Mount 12 Error
146--------------
147
148A mount 12 error with ``cannot allocate memory`` usually occurs if you have a
149version mismatch between the :term:`Ceph Client` version and the :term:`Ceph
150Storage Cluster` version. Check the versions using::
151
152 ceph -v
153
154If the Ceph Client is behind the Ceph cluster, try to upgrade it::
155
156 sudo apt-get update && sudo apt-get install ceph-common
157
158You may need to uninstall, autoclean and autoremove ``ceph-common``
159and then reinstall it so that you have the latest version.
160