]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/troubleshooting.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / cephfs / troubleshooting.rst
CommitLineData
7c673cae
FG
1=================
2 Troubleshooting
3=================
4
5Slow/stuck operations
6=====================
7
8If you are experiencing apparent hung operations, the first task is to identify
9where the problem is occurring: in the client, the MDS, or the network connecting
10them. Start by looking to see if either side has stuck operations
11(:ref:`slow_requests`, below), and narrow it down from there.
12
9f95a23c
TL
13We can get hints about what's going on by dumping the MDS cache ::
14
15 ceph daemon mds.<name> dump cache /tmp/dump.txt
16
17.. note:: The file `dump.txt` is on the machine executing the MDS and for systemd
18 controlled MDS services, this is in a tmpfs in the MDS container.
19 Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path.
20
21If high logging levels are set on the MDS, that will almost certainly hold the
22information we need to diagnose and solve the issue.
23
05a536ef
TL
24Stuck during recovery
25=====================
26
27Stuck in up:replay
28------------------
29
30If your MDS is stuck in ``up:replay`` then it is likely that the journal is
31very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is
32behind on trimming its journal? If the journal has grown very large, it can
33take hours to read the journal. There is no working around this but there
34are things you can do to speed things along:
35
36Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
37messages to memory for dumping if a fatal error is encountered. You can avoid
38this:
39
40.. code:: bash
41
42 ceph config set mds debug_mds 0
43 ceph config set mds debug_ms 0
44 ceph config set mds debug_monc 0
45
46Note if the MDS fails then there will be virtually no information to determine
47why. If you can calculate when ``up:replay`` will complete, you should restore
48these configs just prior to entering the next state:
49
50.. code:: bash
51
52 ceph config rm mds debug_mds
53 ceph config rm mds debug_ms
54 ceph config rm mds debug_monc
55
56Once you've got replay moving along faster, you can calculate when the MDS will
57complete. This is done by examining the journal replay status:
58
59.. code:: bash
60
61 $ ceph tell mds.<fs_name>:0 status | jq .replay_status
62 {
63 "journal_read_pos": 4195244,
64 "journal_write_pos": 4195244,
65 "journal_expire_pos": 4194304,
66 "num_events": 2,
67 "num_segments": 2
68 }
69
70Replay completes when the ``journal_read_pos`` reaches the
71``journal_write_pos``. The write position will not change during replay. Track
72the progression of the read position to compute the expected time to complete.
73
74
75Avoiding recovery roadblocks
76----------------------------
77
78When trying to urgently restore your file system during an outage, here are some
79things to do:
80
81* **Deny all reconnect to clients.** This effectively blocklists all existing
82 CephFS sessions so all mounts will hang or become unavailable.
83
84.. code:: bash
85
86 ceph config set mds mds_deny_all_reconnect true
87
88 Remember to undo this after the MDS becomes active.
89
90.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
91
92* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears
93 "stuck" doing some operation. Sometimes recovery of an MDS may involve an
94 operation that may take longer than expected (from the programmer's
95 perspective). This is more likely when recovery is already taking a longer than
96 normal amount of time to complete (indicated by your reading this document).
97 Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
98
99.. code:: bash
100
aee94f69 101 ceph config set mds mds_heartbeat_grace 3600
05a536ef
TL
102
103 This has the effect of having the MDS continue to send beacons to the monitors
104 even when its internal "heartbeat" mechanism has not been reset (beat) in one
105 hour. Note the previous mechanism for achieving this was via the
106 `mds_beacon_grace` monitor setting.
107
108* **Disable open file table prefetch.** Normally, the MDS will prefetch
109 directory contents during recovery to heat up its cache. During long
110 recovery, the cache is probably already hot **and large**. So this behavior
111 can be undesirable. Disable using:
112
113.. code:: bash
114
115 ceph config set mds mds_oft_prefetch_dirfrags false
116
117* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may
118 cause new load on the file system when it's just getting back on its feet.
119 There will likely be some general maintenance to do before workloads should be
120 resumed. For example, expediting journal trim may be advisable if the recovery
121 took a long time because replay was reading a overly large journal.
122
123 You can do this manually or use the new file system tunable:
124
125.. code:: bash
126
127 ceph fs set <fs_name> refuse_client_session true
128
129 That prevents any clients from establishing new sessions with the MDS.
130
131
132
133Expediting MDS journal trim
134===========================
135
136If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
137long time!), you will want to have the MDS trim its journal more frequently.
138You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings.
139
140The main tunable available to do this is to modify the MDS tick interval. The
141"tick" interval drives several upkeep activities in the MDS. It is strongly
142recommended no significant file system load be present when modifying this tick
143interval. This setting only affects an MDS in ``up:active``. The MDS does not
144trim its journal during recovery.
145
146.. code:: bash
147
148 ceph config set mds mds_tick_interval 2
149
150
7c673cae
FG
151RADOS Health
152============
153
11fdf7f2 154If part of the CephFS metadata or data pools is unavailable and CephFS is not
7c673cae
FG
155responding, it is probably because RADOS itself is unhealthy. Resolve those
156problems first (:doc:`../../rados/troubleshooting/index`).
157
158The MDS
159=======
160
161If an operation is hung inside the MDS, it will eventually show up in ``ceph health``,
162identifying "slow requests are blocked". It may also identify clients as
163"failing to respond" or misbehaving in other ways. If the MDS identifies
164specific clients as misbehaving, you should investigate why they are doing so.
9f95a23c 165
7c673cae 166Generally it will be the result of
9f95a23c
TL
167
168#. Overloading the system (if you have extra RAM, increase the
169 "mds cache memory limit" config from its default 1GiB; having a larger active
170 file set than your MDS cache is the #1 cause of this!).
171
172#. Running an older (misbehaving) client.
173
174#. Underlying RADOS issues.
7c673cae
FG
175
176Otherwise, you have probably discovered a new bug and should report it to
177the developers!
178
179.. _slow_requests:
180
181Slow requests (MDS)
182-------------------
183You can list current operations via the admin socket by running::
184
185 ceph daemon mds.<name> dump_ops_in_flight
186
187from the MDS host. Identify the stuck commands and examine why they are stuck.
188Usually the last "event" will have been an attempt to gather locks, or sending
189the operation off to the MDS log. If it is waiting on the OSDs, fix them. If
190operations are stuck on a specific inode, you probably have a client holding
191caps which prevent others from using it, either because the client is trying
c07f9fc5 192to flush out dirty data or because you have encountered a bug in CephFS'
7c673cae
FG
193distributed file lock code (the file "capabilities" ["caps"] system).
194
195If it's a result of a bug in the capabilities code, restarting the MDS
196is likely to resolve the problem.
197
c07f9fc5 198If there are no slow requests reported on the MDS, and it is not reporting
7c673cae 199that clients are misbehaving, either the client has a problem or its
c07f9fc5 200requests are not reaching the MDS.
7c673cae 201
9f95a23c
TL
202.. _ceph_fuse_debugging:
203
7c673cae
FG
204ceph-fuse debugging
205===================
206
9f95a23c 207ceph-fuse also supports ``dump_ops_in_flight``. See if it has any and where they are
7c673cae
FG
208stuck.
209
210Debug output
211------------
212
213To get more debugging information from ceph-fuse, try running in the foreground
214with logging to the console (``-d``) and enabling client debug
215(``--debug-client=20``), enabling prints for each message sent
216(``--debug-ms=1``).
217
218If you suspect a potential monitor issue, enable monitor debugging as well
219(``--debug-monc=20``).
220
9f95a23c 221.. _kernel_mount_debugging:
7c673cae
FG
222
223Kernel mount debugging
224======================
225
9f95a23c
TL
226If there is an issue with the kernel client, the most important thing is
227figuring out whether the problem is with the kernel client or the MDS. Generally,
228this is easy to work out. If the kernel client broke directly, there will be
229output in ``dmesg``. Collect it and any inappropriate kernel state.
230
7c673cae
FG
231Slow requests
232-------------
233
234Unfortunately the kernel client does not support the admin socket, but it has
235similar (if limited) interfaces if your kernel has debugfs enabled. There
236will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will
237look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``)
238will contain a variety of files that output interesting output when you ``cat``
239them. These files are described below; the most interesting when debugging
240slow requests are probably the ``mdsc`` and ``osdc`` files.
241
242* bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
243* caps: counts of file "caps" structures in-memory and used
244* client_options: dumps the options provided to the CephFS mount
245* dentry_lru: Dumps the CephFS dentries currently in-memory
246* mdsc: Dumps current requests to the MDS
247* mdsmap: Dumps the current MDSMap epoch and MDSes
248* mds_sessions: Dumps the current sessions to MDSes
249* monc: Dumps the current maps from the monitor, and any "subscriptions" held
250* monmap: Dumps the current monitor map epoch and monitors
251* osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
252* osdmap: Dumps the current OSDMap epoch, pools, and OSDs
253
20effc67
TL
254If the data pool is in a NEARFULL condition, then the kernel cephfs client
255will switch to doing writes synchronously, which is quite slow.
7c673cae
FG
256
257Disconnected+Remounted FS
258=========================
259Because CephFS has a "consistent cache", if your network connection is
260disrupted for a long enough time, the client will be forcibly
261disconnected from the system. At this point, the kernel client is in
c07f9fc5 262a bind: it cannot safely write back dirty data, and many applications
7c673cae 263do not handle IO errors correctly on close().
9f95a23c 264At the moment, the kernel client will remount the FS, but outstanding file system
7c673cae
FG
265IO may or may not be satisfied. In these cases, you may need to reboot your
266client system.
267
268You can identify you are in this situation if dmesg/kern.log report something like::
269
270 Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session
271 Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start
272 Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied
273 Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631
274 Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707
275 Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN)
276 Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset
277 Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
278
279This is an area of ongoing work to improve the behavior. Kernels will soon
280be reliably issuing error codes to in-progress IO, although your application(s)
281may not deal with them well. In the longer-term, we hope to allow reconnect
282and reclaim of data in cases where it won't violate POSIX semantics (generally,
283data which hasn't been accessed or modified by other clients).
284
285Mounting
286========
287
288Mount 5 Error
289-------------
290
291A mount 5 error typically occurs if a MDS server is laggy or if it crashed.
292Ensure at least one MDS is up and running, and the cluster is ``active +
293healthy``.
294
295Mount 12 Error
296--------------
297
298A mount 12 error with ``cannot allocate memory`` usually occurs if you have a
299version mismatch between the :term:`Ceph Client` version and the :term:`Ceph
300Storage Cluster` version. Check the versions using::
301
302 ceph -v
303
304If the Ceph Client is behind the Ceph cluster, try to upgrade it::
305
306 sudo apt-get update && sudo apt-get install ceph-common
307
308You may need to uninstall, autoclean and autoremove ``ceph-common``
309and then reinstall it so that you have the latest version.
310
9f95a23c
TL
311Dynamic Debugging
312=================
313
314You can enable dynamic debug against the CephFS module.
315
316Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh
317
1e59de90
TL
318In-memory Log Dump
319==================
320
321In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval``
322during a lower level debugging (log level < 10). ``mds_extraordinary_events_dump_interval``
323is the interval in seconds for dumping the recent in-memory logs when there is an Extra-Ordinary event.
324
325The Extra-Ordinary events are classified as:
326
327* Client Eviction
328* Missed Beacon ACK from the monitors
329* Missed Internal Heartbeats
330
331In-memory Log Dump is disabled by default to prevent log file bloat in a production environment.
332The below commands consecutively enables it::
333
334 $ ceph config set mds debug_mds <log_level>/<gather_level>
335 $ ceph config set mds mds_extraordinary_events_dump_interval <seconds>
336
337The ``log_level`` should be < 10 and ``gather_level`` should be >= 10 to enable in-memory log dump.
338When it is enabled, the MDS checks for the extra-ordinary events every
339``mds_extraordinary_events_dump_interval`` seconds and if any of them occurs, MDS dumps the
340in-memory logs containing the relevant event details in ceph-mds log.
341
342.. note:: For higher log levels (log_level >= 10) there is no reason to dump the In-memory Logs and a
343 lower gather level (gather_level < 10) is insufficient to gather In-memory Logs. Thus a
344 log level >=10 or a gather level < 10 in debug_mds would prevent enabling the In-memory Log Dump.
345 In such cases, when there is a failure it's required to reset the value of
346 mds_extraordinary_events_dump_interval to 0 before enabling using the above commands.
347
348The In-memory Log Dump can be disabled using::
349
350 $ ceph config set mds mds_extraordinary_events_dump_interval 0
351
352Filesystems Become Inaccessible After an Upgrade
353================================================
354
355.. note::
356 You can avoid ``operation not permitted`` errors by running this procedure
357 before an upgrade. As of May 2023, it seems that ``operation not permitted``
358 errors of the kind discussed here occur after upgrades after Nautilus
359 (inclusive).
360
361IF
362
363you have CephFS file systems that have data and metadata pools that were
364created by a ``ceph fs new`` command (meaning that they were not created
365with the defaults)
366
367OR
368
369you have an existing CephFS file system and are upgrading to a new post-Nautilus
370major version of Ceph
371
372THEN
373
374in order for the documented ``ceph fs authorize...`` commands to function as
375documented (and to avoid 'operation not permitted' errors when doing file I/O
376or similar security-related problems for all users except the ``client.admin``
377user), you must first run:
378
379.. prompt:: bash $
380
381 ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>
382
383and
384
385.. prompt:: bash $
386
387 ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>
388
389Otherwise, when the OSDs receive a request to read or write data (not the
390directory info, but file data) they will not know which Ceph file system name
391to look up. This is true also of pool names, because the 'defaults' themselves
392changed in the major releases, from::
393
394 data pool=fsname
395 metadata pool=fsname_metadata
396
397to::
398
399 data pool=fsname.data and
400 metadata pool=fsname.meta
401
402Any setup that used ``client.admin`` for all mounts did not run into this
403problem, because the admin key gave blanket permissions.
404
405A temporary fix involves changing mount requests to the 'client.admin' user and
406its associated key. A less drastic but half-fix is to change the osd cap for
407your user to just ``caps osd = "allow rw"`` and delete ``tag cephfs
408data=....``
409
9f95a23c
TL
410Reporting Issues
411================
412
413If you have identified a specific issue, please report it with as much
414information as possible. Especially important information:
415
416* Ceph versions installed on client and server
417* Whether you are using the kernel or fuse client
418* If you are using the kernel client, what kernel version?
419* How many clients are in play, doing what kind of workload?
420* If a system is 'stuck', is that affecting all clients or just one?
421* Any ceph health messages
422* Any backtraces in the ceph logs from crashes
423
424If you are satisfied that you have found a bug, please file it on `the bug
425tracker`. For more general queries, please write to the `ceph-users mailing
426list`.
427
428.. _the bug tracker: http://tracker.ceph.com
429.. _ceph-users mailing list: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/