]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ================= |
2 | Troubleshooting | |
3 | ================= | |
4 | ||
5 | Slow/stuck operations | |
6 | ===================== | |
7 | ||
8 | If you are experiencing apparent hung operations, the first task is to identify | |
9 | where the problem is occurring: in the client, the MDS, or the network connecting | |
10 | them. Start by looking to see if either side has stuck operations | |
11 | (:ref:`slow_requests`, below), and narrow it down from there. | |
12 | ||
9f95a23c TL |
13 | We can get hints about what's going on by dumping the MDS cache :: |
14 | ||
15 | ceph daemon mds.<name> dump cache /tmp/dump.txt | |
16 | ||
17 | .. note:: The file `dump.txt` is on the machine executing the MDS and for systemd | |
18 | controlled MDS services, this is in a tmpfs in the MDS container. | |
19 | Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path. | |
20 | ||
21 | If high logging levels are set on the MDS, that will almost certainly hold the | |
22 | information we need to diagnose and solve the issue. | |
23 | ||
05a536ef TL |
24 | Stuck during recovery |
25 | ===================== | |
26 | ||
27 | Stuck in up:replay | |
28 | ------------------ | |
29 | ||
30 | If your MDS is stuck in ``up:replay`` then it is likely that the journal is | |
31 | very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is | |
32 | behind on trimming its journal? If the journal has grown very large, it can | |
33 | take hours to read the journal. There is no working around this but there | |
34 | are things you can do to speed things along: | |
35 | ||
36 | Reduce MDS debugging to 0. Even at the default settings, the MDS logs some | |
37 | messages to memory for dumping if a fatal error is encountered. You can avoid | |
38 | this: | |
39 | ||
40 | .. code:: bash | |
41 | ||
42 | ceph config set mds debug_mds 0 | |
43 | ceph config set mds debug_ms 0 | |
44 | ceph config set mds debug_monc 0 | |
45 | ||
46 | Note if the MDS fails then there will be virtually no information to determine | |
47 | why. If you can calculate when ``up:replay`` will complete, you should restore | |
48 | these configs just prior to entering the next state: | |
49 | ||
50 | .. code:: bash | |
51 | ||
52 | ceph config rm mds debug_mds | |
53 | ceph config rm mds debug_ms | |
54 | ceph config rm mds debug_monc | |
55 | ||
56 | Once you've got replay moving along faster, you can calculate when the MDS will | |
57 | complete. This is done by examining the journal replay status: | |
58 | ||
59 | .. code:: bash | |
60 | ||
61 | $ ceph tell mds.<fs_name>:0 status | jq .replay_status | |
62 | { | |
63 | "journal_read_pos": 4195244, | |
64 | "journal_write_pos": 4195244, | |
65 | "journal_expire_pos": 4194304, | |
66 | "num_events": 2, | |
67 | "num_segments": 2 | |
68 | } | |
69 | ||
70 | Replay completes when the ``journal_read_pos`` reaches the | |
71 | ``journal_write_pos``. The write position will not change during replay. Track | |
72 | the progression of the read position to compute the expected time to complete. | |
73 | ||
74 | ||
75 | Avoiding recovery roadblocks | |
76 | ---------------------------- | |
77 | ||
78 | When trying to urgently restore your file system during an outage, here are some | |
79 | things to do: | |
80 | ||
81 | * **Deny all reconnect to clients.** This effectively blocklists all existing | |
82 | CephFS sessions so all mounts will hang or become unavailable. | |
83 | ||
84 | .. code:: bash | |
85 | ||
86 | ceph config set mds mds_deny_all_reconnect true | |
87 | ||
88 | Remember to undo this after the MDS becomes active. | |
89 | ||
90 | .. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting. | |
91 | ||
92 | * **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears | |
93 | "stuck" doing some operation. Sometimes recovery of an MDS may involve an | |
94 | operation that may take longer than expected (from the programmer's | |
95 | perspective). This is more likely when recovery is already taking a longer than | |
96 | normal amount of time to complete (indicated by your reading this document). | |
97 | Avoid unnecessary replacement loops by extending the heartbeat graceperiod: | |
98 | ||
99 | .. code:: bash | |
100 | ||
aee94f69 | 101 | ceph config set mds mds_heartbeat_grace 3600 |
05a536ef TL |
102 | |
103 | This has the effect of having the MDS continue to send beacons to the monitors | |
104 | even when its internal "heartbeat" mechanism has not been reset (beat) in one | |
105 | hour. Note the previous mechanism for achieving this was via the | |
106 | `mds_beacon_grace` monitor setting. | |
107 | ||
108 | * **Disable open file table prefetch.** Normally, the MDS will prefetch | |
109 | directory contents during recovery to heat up its cache. During long | |
110 | recovery, the cache is probably already hot **and large**. So this behavior | |
111 | can be undesirable. Disable using: | |
112 | ||
113 | .. code:: bash | |
114 | ||
115 | ceph config set mds mds_oft_prefetch_dirfrags false | |
116 | ||
117 | * **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may | |
118 | cause new load on the file system when it's just getting back on its feet. | |
119 | There will likely be some general maintenance to do before workloads should be | |
120 | resumed. For example, expediting journal trim may be advisable if the recovery | |
121 | took a long time because replay was reading a overly large journal. | |
122 | ||
123 | You can do this manually or use the new file system tunable: | |
124 | ||
125 | .. code:: bash | |
126 | ||
127 | ceph fs set <fs_name> refuse_client_session true | |
128 | ||
129 | That prevents any clients from establishing new sessions with the MDS. | |
130 | ||
131 | ||
132 | ||
133 | Expediting MDS journal trim | |
134 | =========================== | |
135 | ||
136 | If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a | |
137 | long time!), you will want to have the MDS trim its journal more frequently. | |
138 | You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings. | |
139 | ||
140 | The main tunable available to do this is to modify the MDS tick interval. The | |
141 | "tick" interval drives several upkeep activities in the MDS. It is strongly | |
142 | recommended no significant file system load be present when modifying this tick | |
143 | interval. This setting only affects an MDS in ``up:active``. The MDS does not | |
144 | trim its journal during recovery. | |
145 | ||
146 | .. code:: bash | |
147 | ||
148 | ceph config set mds mds_tick_interval 2 | |
149 | ||
150 | ||
7c673cae FG |
151 | RADOS Health |
152 | ============ | |
153 | ||
11fdf7f2 | 154 | If part of the CephFS metadata or data pools is unavailable and CephFS is not |
7c673cae FG |
155 | responding, it is probably because RADOS itself is unhealthy. Resolve those |
156 | problems first (:doc:`../../rados/troubleshooting/index`). | |
157 | ||
158 | The MDS | |
159 | ======= | |
160 | ||
161 | If an operation is hung inside the MDS, it will eventually show up in ``ceph health``, | |
162 | identifying "slow requests are blocked". It may also identify clients as | |
163 | "failing to respond" or misbehaving in other ways. If the MDS identifies | |
164 | specific clients as misbehaving, you should investigate why they are doing so. | |
9f95a23c | 165 | |
7c673cae | 166 | Generally it will be the result of |
9f95a23c TL |
167 | |
168 | #. Overloading the system (if you have extra RAM, increase the | |
169 | "mds cache memory limit" config from its default 1GiB; having a larger active | |
170 | file set than your MDS cache is the #1 cause of this!). | |
171 | ||
172 | #. Running an older (misbehaving) client. | |
173 | ||
174 | #. Underlying RADOS issues. | |
7c673cae FG |
175 | |
176 | Otherwise, you have probably discovered a new bug and should report it to | |
177 | the developers! | |
178 | ||
179 | .. _slow_requests: | |
180 | ||
181 | Slow requests (MDS) | |
182 | ------------------- | |
183 | You can list current operations via the admin socket by running:: | |
184 | ||
185 | ceph daemon mds.<name> dump_ops_in_flight | |
186 | ||
187 | from the MDS host. Identify the stuck commands and examine why they are stuck. | |
188 | Usually the last "event" will have been an attempt to gather locks, or sending | |
189 | the operation off to the MDS log. If it is waiting on the OSDs, fix them. If | |
190 | operations are stuck on a specific inode, you probably have a client holding | |
191 | caps which prevent others from using it, either because the client is trying | |
c07f9fc5 | 192 | to flush out dirty data or because you have encountered a bug in CephFS' |
7c673cae FG |
193 | distributed file lock code (the file "capabilities" ["caps"] system). |
194 | ||
195 | If it's a result of a bug in the capabilities code, restarting the MDS | |
196 | is likely to resolve the problem. | |
197 | ||
c07f9fc5 | 198 | If there are no slow requests reported on the MDS, and it is not reporting |
7c673cae | 199 | that clients are misbehaving, either the client has a problem or its |
c07f9fc5 | 200 | requests are not reaching the MDS. |
7c673cae | 201 | |
9f95a23c TL |
202 | .. _ceph_fuse_debugging: |
203 | ||
7c673cae FG |
204 | ceph-fuse debugging |
205 | =================== | |
206 | ||
9f95a23c | 207 | ceph-fuse also supports ``dump_ops_in_flight``. See if it has any and where they are |
7c673cae FG |
208 | stuck. |
209 | ||
210 | Debug output | |
211 | ------------ | |
212 | ||
213 | To get more debugging information from ceph-fuse, try running in the foreground | |
214 | with logging to the console (``-d``) and enabling client debug | |
215 | (``--debug-client=20``), enabling prints for each message sent | |
216 | (``--debug-ms=1``). | |
217 | ||
218 | If you suspect a potential monitor issue, enable monitor debugging as well | |
219 | (``--debug-monc=20``). | |
220 | ||
9f95a23c | 221 | .. _kernel_mount_debugging: |
7c673cae FG |
222 | |
223 | Kernel mount debugging | |
224 | ====================== | |
225 | ||
9f95a23c TL |
226 | If there is an issue with the kernel client, the most important thing is |
227 | figuring out whether the problem is with the kernel client or the MDS. Generally, | |
228 | this is easy to work out. If the kernel client broke directly, there will be | |
229 | output in ``dmesg``. Collect it and any inappropriate kernel state. | |
230 | ||
7c673cae FG |
231 | Slow requests |
232 | ------------- | |
233 | ||
234 | Unfortunately the kernel client does not support the admin socket, but it has | |
235 | similar (if limited) interfaces if your kernel has debugfs enabled. There | |
236 | will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will | |
237 | look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``) | |
238 | will contain a variety of files that output interesting output when you ``cat`` | |
239 | them. These files are described below; the most interesting when debugging | |
240 | slow requests are probably the ``mdsc`` and ``osdc`` files. | |
241 | ||
242 | * bdi: BDI info about the Ceph system (blocks dirtied, written, etc) | |
243 | * caps: counts of file "caps" structures in-memory and used | |
244 | * client_options: dumps the options provided to the CephFS mount | |
245 | * dentry_lru: Dumps the CephFS dentries currently in-memory | |
246 | * mdsc: Dumps current requests to the MDS | |
247 | * mdsmap: Dumps the current MDSMap epoch and MDSes | |
248 | * mds_sessions: Dumps the current sessions to MDSes | |
249 | * monc: Dumps the current maps from the monitor, and any "subscriptions" held | |
250 | * monmap: Dumps the current monitor map epoch and monitors | |
251 | * osdc: Dumps the current ops in-flight to OSDs (ie, file data IO) | |
252 | * osdmap: Dumps the current OSDMap epoch, pools, and OSDs | |
253 | ||
20effc67 TL |
254 | If the data pool is in a NEARFULL condition, then the kernel cephfs client |
255 | will switch to doing writes synchronously, which is quite slow. | |
7c673cae FG |
256 | |
257 | Disconnected+Remounted FS | |
258 | ========================= | |
259 | Because CephFS has a "consistent cache", if your network connection is | |
260 | disrupted for a long enough time, the client will be forcibly | |
261 | disconnected from the system. At this point, the kernel client is in | |
c07f9fc5 | 262 | a bind: it cannot safely write back dirty data, and many applications |
7c673cae | 263 | do not handle IO errors correctly on close(). |
9f95a23c | 264 | At the moment, the kernel client will remount the FS, but outstanding file system |
7c673cae FG |
265 | IO may or may not be satisfied. In these cases, you may need to reboot your |
266 | client system. | |
267 | ||
268 | You can identify you are in this situation if dmesg/kern.log report something like:: | |
269 | ||
270 | Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session | |
271 | Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start | |
272 | Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied | |
273 | Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631 | |
274 | Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707 | |
275 | Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN) | |
276 | Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset | |
277 | Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0 | |
278 | ||
279 | This is an area of ongoing work to improve the behavior. Kernels will soon | |
280 | be reliably issuing error codes to in-progress IO, although your application(s) | |
281 | may not deal with them well. In the longer-term, we hope to allow reconnect | |
282 | and reclaim of data in cases where it won't violate POSIX semantics (generally, | |
283 | data which hasn't been accessed or modified by other clients). | |
284 | ||
285 | Mounting | |
286 | ======== | |
287 | ||
288 | Mount 5 Error | |
289 | ------------- | |
290 | ||
291 | A mount 5 error typically occurs if a MDS server is laggy or if it crashed. | |
292 | Ensure at least one MDS is up and running, and the cluster is ``active + | |
293 | healthy``. | |
294 | ||
295 | Mount 12 Error | |
296 | -------------- | |
297 | ||
298 | A mount 12 error with ``cannot allocate memory`` usually occurs if you have a | |
299 | version mismatch between the :term:`Ceph Client` version and the :term:`Ceph | |
300 | Storage Cluster` version. Check the versions using:: | |
301 | ||
302 | ceph -v | |
303 | ||
304 | If the Ceph Client is behind the Ceph cluster, try to upgrade it:: | |
305 | ||
306 | sudo apt-get update && sudo apt-get install ceph-common | |
307 | ||
308 | You may need to uninstall, autoclean and autoremove ``ceph-common`` | |
309 | and then reinstall it so that you have the latest version. | |
310 | ||
9f95a23c TL |
311 | Dynamic Debugging |
312 | ================= | |
313 | ||
314 | You can enable dynamic debug against the CephFS module. | |
315 | ||
316 | Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh | |
317 | ||
1e59de90 TL |
318 | In-memory Log Dump |
319 | ================== | |
320 | ||
321 | In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval`` | |
322 | during a lower level debugging (log level < 10). ``mds_extraordinary_events_dump_interval`` | |
323 | is the interval in seconds for dumping the recent in-memory logs when there is an Extra-Ordinary event. | |
324 | ||
325 | The Extra-Ordinary events are classified as: | |
326 | ||
327 | * Client Eviction | |
328 | * Missed Beacon ACK from the monitors | |
329 | * Missed Internal Heartbeats | |
330 | ||
331 | In-memory Log Dump is disabled by default to prevent log file bloat in a production environment. | |
332 | The below commands consecutively enables it:: | |
333 | ||
334 | $ ceph config set mds debug_mds <log_level>/<gather_level> | |
335 | $ ceph config set mds mds_extraordinary_events_dump_interval <seconds> | |
336 | ||
337 | The ``log_level`` should be < 10 and ``gather_level`` should be >= 10 to enable in-memory log dump. | |
338 | When it is enabled, the MDS checks for the extra-ordinary events every | |
339 | ``mds_extraordinary_events_dump_interval`` seconds and if any of them occurs, MDS dumps the | |
340 | in-memory logs containing the relevant event details in ceph-mds log. | |
341 | ||
342 | .. note:: For higher log levels (log_level >= 10) there is no reason to dump the In-memory Logs and a | |
343 | lower gather level (gather_level < 10) is insufficient to gather In-memory Logs. Thus a | |
344 | log level >=10 or a gather level < 10 in debug_mds would prevent enabling the In-memory Log Dump. | |
345 | In such cases, when there is a failure it's required to reset the value of | |
346 | mds_extraordinary_events_dump_interval to 0 before enabling using the above commands. | |
347 | ||
348 | The In-memory Log Dump can be disabled using:: | |
349 | ||
350 | $ ceph config set mds mds_extraordinary_events_dump_interval 0 | |
351 | ||
352 | Filesystems Become Inaccessible After an Upgrade | |
353 | ================================================ | |
354 | ||
355 | .. note:: | |
356 | You can avoid ``operation not permitted`` errors by running this procedure | |
357 | before an upgrade. As of May 2023, it seems that ``operation not permitted`` | |
358 | errors of the kind discussed here occur after upgrades after Nautilus | |
359 | (inclusive). | |
360 | ||
361 | IF | |
362 | ||
363 | you have CephFS file systems that have data and metadata pools that were | |
364 | created by a ``ceph fs new`` command (meaning that they were not created | |
365 | with the defaults) | |
366 | ||
367 | OR | |
368 | ||
369 | you have an existing CephFS file system and are upgrading to a new post-Nautilus | |
370 | major version of Ceph | |
371 | ||
372 | THEN | |
373 | ||
374 | in order for the documented ``ceph fs authorize...`` commands to function as | |
375 | documented (and to avoid 'operation not permitted' errors when doing file I/O | |
376 | or similar security-related problems for all users except the ``client.admin`` | |
377 | user), you must first run: | |
378 | ||
379 | .. prompt:: bash $ | |
380 | ||
381 | ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name> | |
382 | ||
383 | and | |
384 | ||
385 | .. prompt:: bash $ | |
386 | ||
387 | ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name> | |
388 | ||
389 | Otherwise, when the OSDs receive a request to read or write data (not the | |
390 | directory info, but file data) they will not know which Ceph file system name | |
391 | to look up. This is true also of pool names, because the 'defaults' themselves | |
392 | changed in the major releases, from:: | |
393 | ||
394 | data pool=fsname | |
395 | metadata pool=fsname_metadata | |
396 | ||
397 | to:: | |
398 | ||
399 | data pool=fsname.data and | |
400 | metadata pool=fsname.meta | |
401 | ||
402 | Any setup that used ``client.admin`` for all mounts did not run into this | |
403 | problem, because the admin key gave blanket permissions. | |
404 | ||
405 | A temporary fix involves changing mount requests to the 'client.admin' user and | |
406 | its associated key. A less drastic but half-fix is to change the osd cap for | |
407 | your user to just ``caps osd = "allow rw"`` and delete ``tag cephfs | |
408 | data=....`` | |
409 | ||
9f95a23c TL |
410 | Reporting Issues |
411 | ================ | |
412 | ||
413 | If you have identified a specific issue, please report it with as much | |
414 | information as possible. Especially important information: | |
415 | ||
416 | * Ceph versions installed on client and server | |
417 | * Whether you are using the kernel or fuse client | |
418 | * If you are using the kernel client, what kernel version? | |
419 | * How many clients are in play, doing what kind of workload? | |
420 | * If a system is 'stuck', is that affecting all clients or just one? | |
421 | * Any ceph health messages | |
422 | * Any backtraces in the ceph logs from crashes | |
423 | ||
424 | If you are satisfied that you have found a bug, please file it on `the bug | |
425 | tracker`. For more general queries, please write to the `ceph-users mailing | |
426 | list`. | |
427 | ||
428 | .. _the bug tracker: http://tracker.ceph.com | |
429 | .. _ceph-users mailing list: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/ |