ceph/doc/rados/troubleshooting/troubleshooting-mon.rst

   1 =================================
   2  Troubleshooting Monitors
   3 =================================
   4
   5 .. index:: monitor, high availability
   6
   7 When a cluster encounters monitor-related troubles there's a tendency to
   8 panic, and some times with good reason. You should keep in mind that losing
   9 a monitor, or a bunch of them, don't necessarily mean that your cluster is
  10 down, as long as a majority is up, running and with a formed quorum.
  11 Regardless of how bad the situation is, the first thing you should do is to
  12 calm down, take a breath and try answering our initial troubleshooting script.
  13
  14
  15 Initial Troubleshooting
  16 ========================
  17
  18
  19 **Are the monitors running?**
  20
  21   First of all, we need to make sure the monitors are running. You would be
  22   amazed by how often people forget to run the monitors, or restart them after
  23   an upgrade. There's no shame in that, but let's try not losing a couple of
  24   hours chasing an issue that is not there.
  25
  26 **Are you able to connect to the monitor's servers?**
  27
  28   Doesn't happen often, but sometimes people do have ``iptables`` rules that
  29   block accesses to monitor servers or monitor ports. Usually leftovers from
  30   monitor stress-testing that were forgotten at some point. Try ssh'ing into
  31   the server and, if that succeeds, try connecting to the monitor's port
  32   using you tool of choice (telnet, nc,...).
  33
  34 **Does ceph -s run and obtain a reply from the cluster?**
  35
  36   If the answer is yes then your cluster is up and running.  One thing you
  37   can take for granted is that the monitors will only answer to a ``status``
  38   request if there is a formed quorum.
  39
  40   If ``ceph -s`` blocked however, without obtaining a reply from the cluster
  41   or showing a lot of ``fault`` messages, then it is likely that your monitors
  42   are either down completely or just a portion is up -- a portion that is not
  43   enough to form a quorum (keep in mind that a quorum if formed by a majority
  44   of monitors).
  45
  46 **What if ceph -s doesn't finish?**
  47
  48   If you haven't gone through all the steps so far, please go back and do.
  49
  50   For those running on Emperor 0.72-rc1 and forward, you will be able to
  51   contact each monitor individually asking them for their status, regardless
  52   of a quorum being formed. This an be achieved using ``ceph ping mon.ID``,
  53   ID being the monitor's identifier. You should perform this for each monitor
  54   in the cluster. In section `Understanding mon_status`_ we will explain how
  55   to interpret the output of this command.
  56
  57   For the rest of you who don't tread on the bleeding edge, you will need to
  58   ssh into the server and use the monitor's admin socket. Please jump to
  59   `Using the monitor's admin socket`_.
  60
  61 For other specific issues, keep on reading.
  62
  63
  64 Using the monitor's admin socket
  65 =================================
  66
  67 The admin socket allows you to interact with a given daemon directly using a
  68 Unix socket file. This file can be found in your monitor's ``run`` directory.
  69 By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
  70 but this can vary if you defined it otherwise. If you don't find it there,
  71 please check your ``ceph.conf`` for an alternative path or run::
  72
  73   ceph-conf --name mon.ID --show-config-value admin_socket
  74
  75 Please bear in mind that the admin socket will only be available while the
  76 monitor is running. When the monitor is properly shutdown, the admin socket
  77 will be removed. If however the monitor is not running and the admin socket
  78 still persists, it is likely that the monitor was improperly shutdown.
  79 Regardless, if the monitor is not running, you will not be able to use the
  80 admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
  81
  82 Accessing the admin socket is as simple as telling the ``ceph`` tool to use
  83 the ``asok`` file.  In pre-Dumpling Ceph, this can be achieved by::
  84
  85   ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command>
  86
  87 while in Dumpling and beyond you can use the alternate (and recommended)
  88 format::
  89
  90   ceph daemon mon.<id> <command>
  91
  92 Using ``help`` as the command to the ``ceph`` tool will show you the
  93 supported commands available through the admin socket. Please take a look
  94 at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``,
  95 as those can be enlightening when troubleshooting a monitor.
  96
  97
  98 Understanding mon_status
  99 =========================
 100
 101 ``mon_status`` can be obtained through the ``ceph`` tool when you have
 102 a formed quorum, or via the admin socket if you don't. This command will
 103 output a multitude of information about the monitor, including the same
 104 output you would get with ``quorum_status``.
 105
 106 Take the following example of ``mon_status``::
 107
 108
 109   { "name": "c",
 110     "rank": 2,
 111     "state": "peon",
 112     "election_epoch": 38,
 113     "quorum": [
 114           1,
 115           2],
 116     "outside_quorum": [],
 117     "extra_probe_peers": [],
 118     "sync_provider": [],
 119     "monmap": { "epoch": 3,
 120         "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
 121         "modified": "2013-10-30 04:12:01.945629",
 122         "created": "2013-10-29 14:14:41.914786",
 123         "mons": [
 124               { "rank": 0,
 125                 "name": "a",
 126                 "addr": "127.0.0.1:6789\/0"},
 127               { "rank": 1,
 128                 "name": "b",
 129                 "addr": "127.0.0.1:6790\/0"},
 130               { "rank": 2,
 131                 "name": "c",
 132                 "addr": "127.0.0.1:6795\/0"}]}}
 133
 134 A couple of things are obvious: we have three monitors in the monmap (*a*, *b*
 135 and *c*), the quorum is formed by only two monitors, and *c* is in the quorum
 136 as a *peon*.
 137
 138 Which monitor is out of the quorum?
 139
 140   The answer would be **a**.
 141
 142 Why?
 143
 144   Take a look at the ``quorum`` set. We have two monitors in this set: *1*
 145   and *2*. These are not monitor names. These are monitor ranks, as established
 146   in the current monmap. We are missing the monitor with rank 0, and according
 147   to the monmap that would be ``mon.a``.
 148
 149 By the way, how are ranks established?
 150
 151   Ranks are (re)calculated whenever you add or remove monitors and follow a
 152   simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the
 153   rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all
 154   the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0.
 155
 156 Most Common Monitor Issues
 157 ===========================
 158
 159 Have Quorum but at least one Monitor is down
 160 ---------------------------------------------
 161
 162 When this happens, depending on the version of Ceph you are running,
 163 you should be seeing something similar to::
 164
 165       $ ceph health detail
 166       [snip]
 167       mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
 168
 169 How to troubleshoot this?
 170
 171   First, make sure ``mon.a`` is running.
 172
 173   Second, make sure you are able to connect to ``mon.a``'s server from the
 174   other monitors' servers. Check the ports as well. Check ``iptables`` on
 175   all your monitor nodes and make sure you are not dropping/rejecting
 176   connections.
 177
 178   If this initial troubleshooting doesn't solve your problems, then it's
 179   time to go deeper.
 180
 181   First, check the problematic monitor's ``mon_status`` via the admin
 182   socket as explained in `Using the monitor's admin socket`_ and
 183   `Understanding mon_status`_.
 184
 185   Considering the monitor is out of the quorum, its state should be one of
 186   ``probing``, ``electing`` or ``synchronizing``. If it happens to be either
 187   ``leader`` or ``peon``, then the monitor believes to be in quorum, while
 188   the remaining cluster is sure it is not; or maybe it got into the quorum
 189   while we were troubleshooting the monitor, so check you ``ceph -s`` again
 190   just to make sure. Proceed if the monitor is not yet in the quorum.
 191
 192 What if the state is ``probing``?
 193
 194   This means the monitor is still looking for the other monitors. Every time
 195   you start a monitor, the monitor will stay in this state for some time
 196   while trying to find the rest of the monitors specified in the ``monmap``.
 197   The time a monitor will spend in this state can vary. For instance, when on
 198   a single-monitor cluster, the monitor will pass through the probing state
 199   almost instantaneously, since there are no other monitors around. On a
 200   multi-monitor cluster, the monitors will stay in this state until they
 201   find enough monitors to form a quorum -- this means that if you have 2 out
 202   of 3 monitors down, the one remaining monitor will stay in this state
 203   indefinitively until you bring one of the other monitors up.
 204
 205   If you have a quorum, however, the monitor should be able to find the
 206   remaining monitors pretty fast, as long as they can be reached. If your
 207   monitor is stuck probing and you have gone through with all the communication
 208   troubleshooting, then there is a fair chance that the monitor is trying
 209   to reach the other monitors on a wrong address. ``mon_status`` outputs the
 210   ``monmap`` known to the monitor: check if the other monitor's locations
 211   match reality. If they don't, jump to
 212   `Recovering a Monitor's Broken monmap`_; if they do, then it may be related
 213   to severe clock skews amongst the monitor nodes and you should refer to
 214   `Clock Skews`_ first, but if that doesn't solve your problem then it is
 215   the time to prepare some logs and reach out to the community (please refer
 216   to `Preparing your logs`_ on how to best prepare your logs).
 217
 218
 219 What if state is ``electing``?
 220
 221   This means the monitor is in the middle of an election. These should be
 222   fast to complete, but at times the monitors can get stuck electing. This
 223   is usually a sign of a clock skew among the monitor nodes; jump to
 224   `Clock Skews`_ for more infos on that. If all your clocks are properly
 225   synchronized, it is best if you prepare some logs and reach out to the
 226   community. This is not a state that is likely to persist and aside from
 227   (*really*) old bugs there is not an obvious reason besides clock skews on
 228   why this would happen.
 229
 230 What if state is ``synchronizing``?
 231
 232   This means the monitor is synchronizing with the rest of the cluster in
 233   order to join the quorum. The synchronization process is as faster as
 234   smaller your monitor store is, so if you have a big store it may
 235   take a while. Don't worry, it should be finished soon enough.
 236
 237   However, if you notice that the monitor jumps from ``synchronizing`` to
 238   ``electing`` and then back to ``synchronizing``, then you do have a
 239   problem: the cluster state is advancing (i.e., generating new maps) way
 240   too fast for the synchronization process to keep up. This used to be a
 241   thing in early Cuttlefish, but since then the synchronization process was
 242   quite refactored and enhanced to avoid just this sort of behavior. If this
 243   happens in later versions let us know. And bring some logs
 244   (see `Preparing your logs`_).
 245
 246 What if state is ``leader`` or ``peon``?
 247
 248   This should not happen. There is a chance this might happen however, and
 249   it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
 250   suffering from clock skews, then please prepare your logs (see
 251   `Preparing your logs`_) and reach out to us.
 252
 253
 254 Recovering a Monitor's Broken monmap
 255 -------------------------------------
 256
 257 This is how a ``monmap`` usually looks like, depending on the number of
 258 monitors::
 259
 260
 261       epoch 3
 262       fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
 263       last_changed 2013-10-30 04:12:01.945629
 264       created 2013-10-29 14:14:41.914786
 265       0: 127.0.0.1:6789/0 mon.a
 266       1: 127.0.0.1:6790/0 mon.b
 267       2: 127.0.0.1:6795/0 mon.c
 268
 269 This may not be what you have however. For instance, in some versions of
 270 early Cuttlefish there was this one bug that could cause your ``monmap``
 271 to be nullified.  Completely filled with zeros. This means that not even
 272 ``monmaptool`` would be able to read it because it would find it hard to
 273 make sense of only-zeros. Some other times, you may end up with a monitor
 274 with a severely outdated monmap, thus being unable to find the remaining
 275 monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
 276 then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
 277 ``mon.b``; you will end up with a totally different monmap from the one
 278 ``mon.c`` knows).
 279
 280 In this sort of situations, you have two possible solutions:
 281
 282 Scrap the monitor and create a new one
 283
 284   You should only take this route if you are positive that you won't
 285   lose the information kept by that monitor; that you have other monitors
 286   and that they are running just fine so that your new monitor is able
 287   to synchronize from the remaining monitors. Keep in mind that destroying
 288   a monitor, if there are no other copies of its contents, may lead to
 289   loss of data.
 290
 291 Inject a monmap into the monitor
 292
 293   Usually the safest path. You should grab the monmap from the remaining
 294   monitors and inject it into the monitor with the corrupted/lost monmap.
 295
 296   These are the basic steps:
 297
 298   1. Is there a formed quorum? If so, grab the monmap from the quorum::
 299
 300       $ ceph mon getmap -o /tmp/monmap
 301
 302   2. No quorum? Grab the monmap directly from another monitor (this
 303      assumes the monitor you are grabbing the monmap from has id ID-FOO
 304      and has been stopped)::
 305
 306       $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
 307
 308   3. Stop the monitor you are going to inject the monmap into.
 309
 310   4. Inject the monmap::
 311
 312       $ ceph-mon -i ID --inject-monmap /tmp/monmap
 313
 314   5. Start the monitor
 315
 316   Please keep in mind that the ability to inject monmaps is a powerful
 317   feature that can cause havoc with your monitors if misused as it will
 318   overwrite the latest, existing monmap kept by the monitor.
 319
 320
 321 Clock Skews
 322 ------------
 323
 324 Monitors can be severely affected by significant clock skews across the
 325 monitor nodes. This usually translates into weird behavior with no obvious
 326 cause. To avoid such issues, you should run a clock synchronization tool
 327 on your monitor nodes.
 328
 329
 330 What's the maximum tolerated clock skew?
 331
 332   By default the monitors will allow clocks to drift up to ``0.05 seconds``.
 333
 334
 335 Can I increase the maximum tolerated clock skew?
 336
 337   This value is configurable via the ``mon-clock-drift-allowed`` option, and
 338   although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
 339   is in place because clock skewed monitor may not properly behave. We, as
 340   developers and QA afficcionados, are comfortable with the current default
 341   value, as it will alert the user before the monitors get out hand. Changing
 342   this value without testing it first may cause unforeseen effects on the
 343   stability of the monitors and overall cluster healthiness, although there is
 344   no risk of dataloss.
 345
 346
 347 How do I know there's a clock skew?
 348
 349   The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
 350   detail`` should show something in the form of::
 351
 352       mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
 353
 354   That means that ``mon.c`` has been flagged as suffering from a clock skew.
 355
 356
 357 What should I do if there's a clock skew?
 358
 359   Synchronize your clocks. Running an NTP client may help. If you are already
 360   using one and you hit this sort of issues, check if you are using some NTP
 361   server remote to your network and consider hosting your own NTP server on
 362   your network.  This last option tends to reduce the amount of issues with
 363   monitor clock skews.
 364
 365
 366 Client Can't Connect or Mount
 367 ------------------------------
 368
 369 Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
 370 ``iptables``. The rule rejects all clients trying to connect to the host except
 371 for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
 372 place, clients connecting from a separate node will fail to mount with a timeout
 373 error. You need to address ``iptables`` rules that reject clients trying to
 374 connect to Ceph daemons.  For example, you would need to address rules that look
 375 like this appropriately::
 376
 377         REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
 378
 379 You may also need to add rules to IP tables on your Ceph hosts to ensure
 380 that clients can access the ports associated with your Ceph monitors (i.e., port
 381 6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
 382 example::
 383
 384         iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
 385
 386 Monitor Store Failures
 387 ======================
 388
 389 Symptoms of store corruption
 390 ----------------------------
 391
 392 Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If
 393 a monitor fails due to the key/value store corruption, following error messages
 394 might be found in the monitor log::
 395
 396   Corruption: error in middle of record
 397
 398 or::
 399
 400   Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb
 401
 402 Recovery using healthy monitor(s)
 403 ---------------------------------
 404
 405 If there is any survivers, we can always `replace`_ the corrupted one with a
 406 new one. And after booting up, the new joiner will sync up with a healthy
 407 peer, and once it is fully sync'ed, it will be able to serve the clients.
 408
 409 Recovery using OSDs
 410 -------------------
 411
 412 But what if all monitors fail at the same time? Since users are encouraged to
 413 deploy at least three monitors in a Ceph cluster, the chance of simultaneous
 414 failure is rare. But unplanned power-downs in a data center with improperly
 415 configured disk/fs settings could fail the underlying filesystem, and hence
 416 kill all the monitors. In this case, we can recover the monitor store with the
 417 information stored in OSDs.::
 418
 419   ms=/tmp/mon-store
 420   mkdir $ms
 421   # collect the cluster map from OSDs
 422   for host in $hosts; do
 423     rsync -avz $ms user@host:$ms
 424     rm -rf $ms
 425     ssh user@host <<EOF
 426       for osd in /var/lib/osd/osd-*; do
 427         ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms
 428       done
 429     EOF
 430     rsync -avz user@host:$ms $ms
 431   done
 432   # rebuild the monitor store from the collected map, if the cluster does not
 433   # use cephx authentication, we can skip the following steps to update the
 434   # keyring with the caps, and there is no need to pass the "--keyring" option.
 435   # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead
 436   ceph-authtool /path/to/admin.keyring -n mon. \
 437     --cap mon 'allow *'
 438   ceph-authtool /path/to/admin.keyring -n client.admin \
 439     --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
 440   ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring
 441   # backup corrupted store.db just in case
 442   mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted
 443   mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db
 444   chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db
 445
 446 The steps above
 447
 448 #. collect the map from all OSD hosts,
 449 #. then rebuild the store,
 450 #. fill the entities in keyring file with appropriate caps
 451 #. replace the corrupted store on ``mon.0`` with the recovered copy.
 452
 453 Known limitations
 454 ~~~~~~~~~~~~~~~~~
 455
 456 Following information are not recoverable using the steps above:
 457
 458 - **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
 459   are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
 460   using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
 461   in the recovered monitor store. You might need to re-add them manually.
 462
 463 - **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using
 464   ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost.
 465
 466 - **MDS Maps**: the MDS maps are lost.
 467
 468
 469 Everything Failed! Now What?
 470 =============================
 471
 472 Reaching out for help
 473 ----------------------
 474
 475 You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
 476 and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make
 477 sure you have grabbed your logs and have them ready if someone asks: the faster
 478 the interaction and lower the latency in response, the better chances everyone's
 479 time is optimized.
 480
 481
 482 Preparing your logs
 483 ---------------------
 484
 485 Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
 486 may want them. However, your logs may not have the necessary information. If
 487 you don't find your monitor logs at their default location, you can check
 488 where they should be by running::
 489
 490   ceph-conf --name mon.FOO --show-config-value log_file
 491
 492 The amount of information in the logs are subject to the debug levels being
 493 enforced by your configuration files. If you have not enforced a specific
 494 debug level then Ceph is using the default levels and your logs may not
 495 contain important information to track down you issue.
 496 A first step in getting relevant information into your logs will be to raise
 497 debug levels. In this case we will be interested in the information from the
 498 monitor.
 499 Similarly to what happens on other components, different parts of the monitor
 500 will output their debug information on different subsystems.
 501
 502 You will have to raise the debug levels of those subsystems more closely
 503 related to your issue. This may not be an easy task for someone unfamiliar
 504 with troubleshooting Ceph. For most situations, setting the following options
 505 on your monitors will be enough to pinpoint a potential source of the issue::
 506
 507       debug mon = 10
 508       debug ms = 1
 509
 510 If we find that these debug levels are not enough, there's a chance we may
 511 ask you to raise them or even define other debug subsystems to obtain infos
 512 from -- but at least we started off with some useful information, instead
 513 of a massively empty log without much to go on with.
 514
 515 Do I need to restart a monitor to adjust debug levels?
 516 ------------------------------------------------------
 517
 518 No. You may do it in one of two ways:
 519
 520 You have quorum
 521
 522   Either inject the debug option into the monitor you want to debug::
 523
 524         ceph tell mon.FOO injectargs --debug_mon 10/10
 525
 526   or into all monitors at once::
 527
 528         ceph tell mon.* injectargs --debug_mon 10/10
 529
 530 No quourm
 531
 532   Use the monitor's admin socket and directly adjust the configuration
 533   options::
 534
 535       ceph daemon mon.FOO config set debug_mon 10/10
 536
 537
 538 Going back to default values is as easy as rerunning the above commands
 539 using the debug level ``1/10`` instead.  You can check your current
 540 values using the admin socket and the following commands::
 541
 542       ceph daemon mon.FOO config show
 543
 544 or::
 545
 546       ceph daemon mon.FOO config get 'OPTION_NAME'
 547
 548
 549 Reproduced the problem with appropriate debug levels. Now what?
 550 ----------------------------------------------------------------
 551
 552 Ideally you would send us only the relevant portions of your logs.
 553 We realise that figuring out the corresponding portion may not be the
 554 easiest of tasks. Therefore, we won't hold it to you if you provide the
 555 full log, but common sense should be employed. If your log has hundreds
 556 of thousands of lines, it may get tricky to go through the whole thing,
 557 specially if we are not aware at which point, whatever your issue is,
 558 happened. For instance, when reproducing, keep in mind to write down
 559 current time and date and to extract the relevant portions of your logs
 560 based on that.
 561
 562 Finally, you should reach out to us on the mailing lists, on IRC or file
 563 a new issue on the `tracker`_.
 564
 565 .. _cluster map: ../../architecture#cluster-map
 566 .. _replace: ../operation/add-or-rm-mons
 567 .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new