]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/troubleshooting/troubleshooting-mon.rst
import 15.2.0 Octopus source
[ceph.git] / ceph / doc / rados / troubleshooting / troubleshooting-mon.rst
1 =================================
2 Troubleshooting Monitors
3 =================================
4
5 .. index:: monitor, high availability
6
7 When a cluster encounters monitor-related troubles there's a tendency to
8 panic, and some times with good reason. You should keep in mind that losing
9 a monitor, or a bunch of them, don't necessarily mean that your cluster is
10 down, as long as a majority is up, running and with a formed quorum.
11 Regardless of how bad the situation is, the first thing you should do is to
12 calm down, take a breath and try answering our initial troubleshooting script.
13
14
15 Initial Troubleshooting
16 ========================
17
18
19 **Are the monitors running?**
20
21 First of all, we need to make sure the monitors are running. You would be
22 amazed by how often people forget to run the monitors, or restart them after
23 an upgrade. There's no shame in that, but let's try not losing a couple of
24 hours chasing an issue that is not there.
25
26 **Are you able to connect to the monitor's servers?**
27
28 Doesn't happen often, but sometimes people do have ``iptables`` rules that
29 block accesses to monitor servers or monitor ports. Usually leftovers from
30 monitor stress-testing that were forgotten at some point. Try ssh'ing into
31 the server and, if that succeeds, try connecting to the monitor's port
32 using you tool of choice (telnet, nc,...).
33
34 **Does ceph -s run and obtain a reply from the cluster?**
35
36 If the answer is yes then your cluster is up and running. One thing you
37 can take for granted is that the monitors will only answer to a ``status``
38 request if there is a formed quorum.
39
40 If ``ceph -s`` blocked however, without obtaining a reply from the cluster
41 or showing a lot of ``fault`` messages, then it is likely that your monitors
42 are either down completely or just a portion is up -- a portion that is not
43 enough to form a quorum (keep in mind that a quorum if formed by a majority
44 of monitors).
45
46 **What if ceph -s doesn't finish?**
47
48 If you haven't gone through all the steps so far, please go back and do.
49
50 You can contact each monitor individually asking them for their status,
51 regardless of a quorum being formed. This can be achieved using
52 ``ceph tell mon.ID mon_status``, ID being the monitor's identifier. You should
53 perform this for each monitor in the cluster. In section `Understanding
54 mon_status`_ we will explain how to interpret the output of this command.
55
56 For the rest of you who don't tread on the bleeding edge, you will need to
57 ssh into the server and use the monitor's admin socket. Please jump to
58 `Using the monitor's admin socket`_.
59
60 For other specific issues, keep on reading.
61
62
63 Using the monitor's admin socket
64 =================================
65
66 The admin socket allows you to interact with a given daemon directly using a
67 Unix socket file. This file can be found in your monitor's ``run`` directory.
68 By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
69 but this can vary if you defined it otherwise. If you don't find it there,
70 please check your ``ceph.conf`` for an alternative path or run::
71
72 ceph-conf --name mon.ID --show-config-value admin_socket
73
74 Please bear in mind that the admin socket will only be available while the
75 monitor is running. When the monitor is properly shutdown, the admin socket
76 will be removed. If however the monitor is not running and the admin socket
77 still persists, it is likely that the monitor was improperly shutdown.
78 Regardless, if the monitor is not running, you will not be able to use the
79 admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
80
81 Accessing the admin socket is as simple as running ``ceph tell`` on the daemon
82 you are interested in. For example::
83
84 ceph tell mon.<id> mon_status
85
86 Under the hood, this passes the command ``help`` to the running MON daemon
87 ``<id>`` via its "admin socket", which is a file ending in ``.asok``
88 somewhere under ``/var/run/ceph``. Once you know the full path to the file,
89 you can even do this yourself::
90
91 ceph --admin-daemon <full_path_to_asok_file> <command>
92
93 Using ``help`` as the command to the ``ceph`` tool will show you the
94 supported commands available through the admin socket. Please take a look
95 at ``config get``, ``config show``, ``mon stat`` and ``quorum_status``,
96 as those can be enlightening when troubleshooting a monitor.
97
98
99 Understanding mon_status
100 =========================
101
102 ``mon_status`` can always be obtained via the admin socket. This command will
103 output a multitude of information about the monitor, including the same output
104 you would get with ``quorum_status``.
105
106 Take the following example output of ``ceph tell mon.c mon_status``::
107
108
109 { "name": "c",
110 "rank": 2,
111 "state": "peon",
112 "election_epoch": 38,
113 "quorum": [
114 1,
115 2],
116 "outside_quorum": [],
117 "extra_probe_peers": [],
118 "sync_provider": [],
119 "monmap": { "epoch": 3,
120 "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
121 "modified": "2013-10-30 04:12:01.945629",
122 "created": "2013-10-29 14:14:41.914786",
123 "mons": [
124 { "rank": 0,
125 "name": "a",
126 "addr": "127.0.0.1:6789\/0"},
127 { "rank": 1,
128 "name": "b",
129 "addr": "127.0.0.1:6790\/0"},
130 { "rank": 2,
131 "name": "c",
132 "addr": "127.0.0.1:6795\/0"}]}}
133
134 A couple of things are obvious: we have three monitors in the monmap (*a*, *b*
135 and *c*), the quorum is formed by only two monitors, and *c* is in the quorum
136 as a *peon*.
137
138 Which monitor is out of the quorum?
139
140 The answer would be **a**.
141
142 Why?
143
144 Take a look at the ``quorum`` set. We have two monitors in this set: *1*
145 and *2*. These are not monitor names. These are monitor ranks, as established
146 in the current monmap. We are missing the monitor with rank 0, and according
147 to the monmap that would be ``mon.a``.
148
149 By the way, how are ranks established?
150
151 Ranks are (re)calculated whenever you add or remove monitors and follow a
152 simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the
153 rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all
154 the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0.
155
156 Most Common Monitor Issues
157 ===========================
158
159 Have Quorum but at least one Monitor is down
160 ---------------------------------------------
161
162 When this happens, depending on the version of Ceph you are running,
163 you should be seeing something similar to::
164
165 $ ceph health detail
166 [snip]
167 mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
168
169 How to troubleshoot this?
170
171 First, make sure ``mon.a`` is running.
172
173 Second, make sure you are able to connect to ``mon.a``'s server from the
174 other monitors' servers. Check the ports as well. Check ``iptables`` on
175 all your monitor nodes and make sure you are not dropping/rejecting
176 connections.
177
178 If this initial troubleshooting doesn't solve your problems, then it's
179 time to go deeper.
180
181 First, check the problematic monitor's ``mon_status`` via the admin
182 socket as explained in `Using the monitor's admin socket`_ and
183 `Understanding mon_status`_.
184
185 Considering the monitor is out of the quorum, its state should be one of
186 ``probing``, ``electing`` or ``synchronizing``. If it happens to be either
187 ``leader`` or ``peon``, then the monitor believes to be in quorum, while
188 the remaining cluster is sure it is not; or maybe it got into the quorum
189 while we were troubleshooting the monitor, so check you ``ceph -s`` again
190 just to make sure. Proceed if the monitor is not yet in the quorum.
191
192 What if the state is ``probing``?
193
194 This means the monitor is still looking for the other monitors. Every time
195 you start a monitor, the monitor will stay in this state for some time
196 while trying to find the rest of the monitors specified in the ``monmap``.
197 The time a monitor will spend in this state can vary. For instance, when on
198 a single-monitor cluster, the monitor will pass through the probing state
199 almost instantaneously, since there are no other monitors around. On a
200 multi-monitor cluster, the monitors will stay in this state until they
201 find enough monitors to form a quorum -- this means that if you have 2 out
202 of 3 monitors down, the one remaining monitor will stay in this state
203 indefinitely until you bring one of the other monitors up.
204
205 If you have a quorum, however, the monitor should be able to find the
206 remaining monitors pretty fast, as long as they can be reached. If your
207 monitor is stuck probing and you have gone through with all the communication
208 troubleshooting, then there is a fair chance that the monitor is trying
209 to reach the other monitors on a wrong address. ``mon_status`` outputs the
210 ``monmap`` known to the monitor: check if the other monitor's locations
211 match reality. If they don't, jump to
212 `Recovering a Monitor's Broken monmap`_; if they do, then it may be related
213 to severe clock skews amongst the monitor nodes and you should refer to
214 `Clock Skews`_ first, but if that doesn't solve your problem then it is
215 the time to prepare some logs and reach out to the community (please refer
216 to `Preparing your logs`_ on how to best prepare your logs).
217
218
219 What if state is ``electing``?
220
221 This means the monitor is in the middle of an election. These should be
222 fast to complete, but at times the monitors can get stuck electing. This
223 is usually a sign of a clock skew among the monitor nodes; jump to
224 `Clock Skews`_ for more infos on that. If all your clocks are properly
225 synchronized, it is best if you prepare some logs and reach out to the
226 community. This is not a state that is likely to persist and aside from
227 (*really*) old bugs there is not an obvious reason besides clock skews on
228 why this would happen.
229
230 What if state is ``synchronizing``?
231
232 This means the monitor is synchronizing with the rest of the cluster in
233 order to join the quorum. The synchronization process is as faster as
234 smaller your monitor store is, so if you have a big store it may
235 take a while. Don't worry, it should be finished soon enough.
236
237 However, if you notice that the monitor jumps from ``synchronizing`` to
238 ``electing`` and then back to ``synchronizing``, then you do have a
239 problem: the cluster state is advancing (i.e., generating new maps) way
240 too fast for the synchronization process to keep up. This used to be a
241 thing in early Cuttlefish, but since then the synchronization process was
242 quite refactored and enhanced to avoid just this sort of behavior. If this
243 happens in later versions let us know. And bring some logs
244 (see `Preparing your logs`_).
245
246 What if state is ``leader`` or ``peon``?
247
248 This should not happen. There is a chance this might happen however, and
249 it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
250 suffering from clock skews, then please prepare your logs (see
251 `Preparing your logs`_) and reach out to us.
252
253
254 Recovering a Monitor's Broken monmap
255 -------------------------------------
256
257 This is how a ``monmap`` usually looks like, depending on the number of
258 monitors::
259
260
261 epoch 3
262 fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
263 last_changed 2013-10-30 04:12:01.945629
264 created 2013-10-29 14:14:41.914786
265 0: 127.0.0.1:6789/0 mon.a
266 1: 127.0.0.1:6790/0 mon.b
267 2: 127.0.0.1:6795/0 mon.c
268
269 This may not be what you have however. For instance, in some versions of
270 early Cuttlefish there was this one bug that could cause your ``monmap``
271 to be nullified. Completely filled with zeros. This means that not even
272 ``monmaptool`` would be able to read it because it would find it hard to
273 make sense of only-zeros. Some other times, you may end up with a monitor
274 with a severely outdated monmap, thus being unable to find the remaining
275 monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
276 then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
277 ``mon.b``; you will end up with a totally different monmap from the one
278 ``mon.c`` knows).
279
280 In this sort of situations, you have two possible solutions:
281
282 Scrap the monitor and create a new one
283
284 You should only take this route if you are positive that you won't
285 lose the information kept by that monitor; that you have other monitors
286 and that they are running just fine so that your new monitor is able
287 to synchronize from the remaining monitors. Keep in mind that destroying
288 a monitor, if there are no other copies of its contents, may lead to
289 loss of data.
290
291 Inject a monmap into the monitor
292
293 Usually the safest path. You should grab the monmap from the remaining
294 monitors and inject it into the monitor with the corrupted/lost monmap.
295
296 These are the basic steps:
297
298 1. Is there a formed quorum? If so, grab the monmap from the quorum::
299
300 $ ceph mon getmap -o /tmp/monmap
301
302 2. No quorum? Grab the monmap directly from another monitor (this
303 assumes the monitor you are grabbing the monmap from has id ID-FOO
304 and has been stopped)::
305
306 $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
307
308 3. Stop the monitor you are going to inject the monmap into.
309
310 4. Inject the monmap::
311
312 $ ceph-mon -i ID --inject-monmap /tmp/monmap
313
314 5. Start the monitor
315
316 Please keep in mind that the ability to inject monmaps is a powerful
317 feature that can cause havoc with your monitors if misused as it will
318 overwrite the latest, existing monmap kept by the monitor.
319
320
321 Clock Skews
322 ------------
323
324 Monitors can be severely affected by significant clock skews across the
325 monitor nodes. This usually translates into weird behavior with no obvious
326 cause. To avoid such issues, you should run a clock synchronization tool
327 on your monitor nodes.
328
329
330 What's the maximum tolerated clock skew?
331
332 By default the monitors will allow clocks to drift up to ``0.05 seconds``.
333
334
335 Can I increase the maximum tolerated clock skew?
336
337 This value is configurable via the ``mon-clock-drift-allowed`` option, and
338 although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
339 is in place because clock skewed monitor may not properly behave. We, as
340 developers and QA aficionados, are comfortable with the current default
341 value, as it will alert the user before the monitors get out hand. Changing
342 this value without testing it first may cause unforeseen effects on the
343 stability of the monitors and overall cluster healthiness, although there is
344 no risk of dataloss.
345
346
347 How do I know there's a clock skew?
348
349 The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
350 detail`` should show something in the form of::
351
352 mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
353
354 That means that ``mon.c`` has been flagged as suffering from a clock skew.
355
356
357 What should I do if there's a clock skew?
358
359 Synchronize your clocks. Running an NTP client may help. If you are already
360 using one and you hit this sort of issues, check if you are using some NTP
361 server remote to your network and consider hosting your own NTP server on
362 your network. This last option tends to reduce the amount of issues with
363 monitor clock skews.
364
365
366 Client Can't Connect or Mount
367 ------------------------------
368
369 Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
370 ``iptables``. The rule rejects all clients trying to connect to the host except
371 for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
372 place, clients connecting from a separate node will fail to mount with a timeout
373 error. You need to address ``iptables`` rules that reject clients trying to
374 connect to Ceph daemons. For example, you would need to address rules that look
375 like this appropriately::
376
377 REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
378
379 You may also need to add rules to IP tables on your Ceph hosts to ensure
380 that clients can access the ports associated with your Ceph monitors (i.e., port
381 6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
382 example::
383
384 iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
385
386 Monitor Store Failures
387 ======================
388
389 Symptoms of store corruption
390 ----------------------------
391
392 Ceph monitor stores the :term:`cluster map` in a key/value store such as LevelDB. If
393 a monitor fails due to the key/value store corruption, following error messages
394 might be found in the monitor log::
395
396 Corruption: error in middle of record
397
398 or::
399
400 Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
401
402 Recovery using healthy monitor(s)
403 ---------------------------------
404
405 If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a
406 new one. After booting up, the new joiner will sync up with a healthy
407 peer, and once it is fully sync'ed, it will be able to serve the clients.
408
409 Recovery using OSDs
410 -------------------
411
412 But what if all monitors fail at the same time? Since users are encouraged to
413 deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous
414 failure is rare. But unplanned power-downs in a data center with improperly
415 configured disk/fs settings could fail the underlying file system, and hence
416 kill all the monitors. In this case, we can recover the monitor store with the
417 information stored in OSDs.::
418
419 ms=/root/mon-store
420 mkdir $ms
421
422 # collect the cluster map from stopped OSDs
423 for host in $hosts; do
424 rsync -avz $ms/. user@$host:$ms.remote
425 rm -rf $ms
426 ssh user@$host <<EOF
427 for osd in /var/lib/ceph/osd/ceph-*; do
428 ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
429 done
430 EOF
431 rsync -avz user@$host:$ms.remote/. $ms
432 done
433
434 # rebuild the monitor store from the collected map, if the cluster does not
435 # use cephx authentication, we can skip the following steps to update the
436 # keyring with the caps, and there is no need to pass the "--keyring" option.
437 # i.e. just use "ceph-monstore-tool $ms rebuild" instead
438 ceph-authtool /path/to/admin.keyring -n mon. \
439 --cap mon 'allow *'
440 ceph-authtool /path/to/admin.keyring -n client.admin \
441 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
442 # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
443 # for mgr.x is added, you can find the encoded key in
444 # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
445 # deployed
446 ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
447 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
448 # if your monitors' ids are not single characters like 'a', 'b', 'c', please
449 # specify them in the command line by passing them as arguments of the "--mon-ids"
450 # option. if you are not sure, please check your ceph.conf to see if there is any
451 # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
452 # using DNS SRV for looking up monitors.
453 ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma
454
455 # make a backup of the corrupted store.db just in case! repeat for
456 # all monitors.
457 mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted
458
459 # move rebuild store.db into place. repeat for all monitors.
460 mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
461 chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
462
463 The steps above
464
465 #. collect the map from all OSD hosts,
466 #. then rebuild the store,
467 #. fill the entities in keyring file with appropriate caps
468 #. replace the corrupted store on ``mon.foo`` with the recovered copy.
469
470 Known limitations
471 ~~~~~~~~~~~~~~~~~
472
473 Following information are not recoverable using the steps above:
474
475 - **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
476 are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
477 using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
478 in the recovered monitor store. You might need to re-add them manually.
479
480 - **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty.
481
482 - **MDS Maps**: the MDS maps are lost.
483
484
485
486 Everything Failed! Now What?
487 =============================
488
489 Reaching out for help
490 ----------------------
491
492 You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
493 and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make
494 sure you have grabbed your logs and have them ready if someone asks: the faster
495 the interaction and lower the latency in response, the better chances everyone's
496 time is optimized.
497
498
499 Preparing your logs
500 ---------------------
501
502 Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
503 may want them. However, your logs may not have the necessary information. If
504 you don't find your monitor logs at their default location, you can check
505 where they should be by running::
506
507 ceph-conf --name mon.FOO --show-config-value log_file
508
509 The amount of information in the logs are subject to the debug levels being
510 enforced by your configuration files. If you have not enforced a specific
511 debug level then Ceph is using the default levels and your logs may not
512 contain important information to track down you issue.
513 A first step in getting relevant information into your logs will be to raise
514 debug levels. In this case we will be interested in the information from the
515 monitor.
516 Similarly to what happens on other components, different parts of the monitor
517 will output their debug information on different subsystems.
518
519 You will have to raise the debug levels of those subsystems more closely
520 related to your issue. This may not be an easy task for someone unfamiliar
521 with troubleshooting Ceph. For most situations, setting the following options
522 on your monitors will be enough to pinpoint a potential source of the issue::
523
524 debug mon = 10
525 debug ms = 1
526
527 If we find that these debug levels are not enough, there's a chance we may
528 ask you to raise them or even define other debug subsystems to obtain infos
529 from -- but at least we started off with some useful information, instead
530 of a massively empty log without much to go on with.
531
532 Do I need to restart a monitor to adjust debug levels?
533 ------------------------------------------------------
534
535 No. You may do it in one of two ways:
536
537 You have quorum
538
539 Either inject the debug option into the monitor you want to debug::
540
541 ceph tell mon.FOO config set debug_mon 10/10
542
543 or into all monitors at once::
544
545 ceph tell mon.* config set debug_mon 10/10
546
547 No quorum
548
549 Use the monitor's admin socket and directly adjust the configuration
550 options::
551
552 ceph daemon mon.FOO config set debug_mon 10/10
553
554
555 Going back to default values is as easy as rerunning the above commands
556 using the debug level ``1/10`` instead. You can check your current
557 values using the admin socket and the following commands::
558
559 ceph daemon mon.FOO config show
560
561 or::
562
563 ceph daemon mon.FOO config get 'OPTION_NAME'
564
565
566 Reproduced the problem with appropriate debug levels. Now what?
567 ----------------------------------------------------------------
568
569 Ideally you would send us only the relevant portions of your logs.
570 We realise that figuring out the corresponding portion may not be the
571 easiest of tasks. Therefore, we won't hold it to you if you provide the
572 full log, but common sense should be employed. If your log has hundreds
573 of thousands of lines, it may get tricky to go through the whole thing,
574 specially if we are not aware at which point, whatever your issue is,
575 happened. For instance, when reproducing, keep in mind to write down
576 current time and date and to extract the relevant portions of your logs
577 based on that.
578
579 Finally, you should reach out to us on the mailing lists, on IRC or file
580 a new issue on the `tracker`_.
581
582 .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new