ceph/doc/rados/troubleshooting/troubleshooting-mon.rst

   1 .. _rados-troubleshooting-mon:
   2
   3 ==========================
   4  Troubleshooting Monitors
   5 ==========================
   6
   7 .. index:: monitor, high availability
   8
   9 If a cluster encounters monitor-related problems, this does not necessarily
  10 mean that the cluster is in danger of going down. Even if multiple monitors are
  11 lost, the cluster can still be up and running, as long as there are enough
  12 surviving monitors to form a quorum.
  13
  14 However serious your cluster's monitor-related problems might be, we recommend
  15 that you take the following troubleshooting steps.
  16
  17
  18 Initial Troubleshooting
  19 =======================
  20
  21 **Are the monitors running?**
  22
  23   First, make sure that the monitor (*mon*) daemon processes (``ceph-mon``) are
  24   running. Sometimes Ceph admins either forget to start the mons or forget to
  25   restart the mons after an upgrade. Checking for this simple oversight can
  26   save hours of painstaking troubleshooting. It is also important to make sure
  27   that the manager daemons (``ceph-mgr``) are running. Remember that typical
  28   cluster configurations provide one ``ceph-mgr`` for each ``ceph-mon``.
  29
  30   .. note:: Rook will not run more than two managers.
  31
  32 **Can you reach the monitor nodes?**
  33
  34   In certain rare cases, there may be ``iptables`` rules that block access to
  35   monitor nodes or TCP ports. These rules might be left over from earlier
  36   stress testing or rule development. To check for the presence of such rules,
  37   SSH into the server and then try to connect to the monitor's ports
  38   (``tcp/3300`` and ``tcp/6789``) using ``telnet``, ``nc``, or a similar tool.
  39
  40 **Does the ``ceph status`` command run and receive a reply from the cluster?**
  41
  42   If the ``ceph status`` command does receive a reply from the cluster, then the
  43   cluster is up and running. The monitors will answer to a ``status`` request
  44   only if there is a formed quorum. Confirm that one or more ``mgr`` daemons
  45   are reported as running. Under ideal conditions, all ``mgr`` daemons will be
  46   reported as running.
  47
  48
  49   If the ``ceph status`` command does not receive a reply from the cluster, then
  50   there are probably not enough monitors ``up`` to form a quorum.  The ``ceph
  51   -s`` command with no further options specified connects to an arbitrarily
  52   selected monitor. In certain cases, however, it might be helpful to connect
  53   to a specific monitor (or to several specific monitors in sequence) by adding
  54   the ``-m`` flag to the command: for example, ``ceph status -m mymon1``.
  55
  56
  57 **None of this worked. What now?**
  58
  59   If the above solutions have not resolved your problems, you might find it
  60   helpful to examine each individual monitor in turn. Whether or not a quorum
  61   has been formed, it is possible to contact each monitor individually and
  62   request its status by using the ``ceph tell mon.ID mon_status`` command (here
  63   ``ID`` is the monitor's identifier).
  64
  65   Run the ``ceph tell mon.ID mon_status`` command for each monitor in the
  66   cluster. For more on this command's output, see :ref:`Understanding
  67   mon_status
  68   <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.
  69
  70   There is also an alternative method: SSH into each monitor node and query the
  71   daemon's admin socket. See :ref:`Using the Monitor's Admin
  72   Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
  73
  74 .. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:
  75
  76 Using the monitor's admin socket
  77 ================================
  78
  79 A monitor's admin socket allows you to interact directly with a specific daemon
  80 by using a Unix socket file. This file is found in the monitor's ``run``
  81 directory. The admin socket's default directory is
  82 ``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin
  83 socket might be elsewhere, especially if your cluster's daemons are deployed in
  84 containers. If you cannot find it, either check your ``ceph.conf`` for an
  85 alternative path or run the following command:
  86
  87 .. prompt:: bash $
  88
  89    ceph-conf --name mon.ID --show-config-value admin_socket
  90
  91 The admin socket is available for use only when the monitor daemon is running.
  92 Whenever the monitor has been properly shut down, the admin socket is removed.
  93 However, if the monitor is not running and the admin socket persists, it is
  94 likely that the monitor has been improperly shut down.  In any case, if the
  95 monitor is not running, it will be impossible to use the admin socket, and the
  96 ``ceph`` command is likely to return ``Error 111: Connection Refused``.
  97
  98 To access the admin socket, run a ``ceph tell`` command of the following form
  99 (specifying the daemon that you are interested in):
 100
 101 .. prompt:: bash $
 102
 103    ceph tell mon.<id> mon_status
 104
 105 This command passes a ``help`` command to the specific running monitor daemon
 106 ``<id>`` via its admin socket. If you know the full path to the admin socket
 107 file, this can be done more directly by running the following command:
 108
 109 .. prompt:: bash $
 110
 111    ceph --admin-daemon <full_path_to_asok_file> <command>
 112
 113 Running ``ceph help`` shows all supported commands that are available through
 114 the admin socket. See especially ``config get``, ``config show``, ``mon stat``,
 115 and ``quorum_status``.
 116
 117 .. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status:
 118
 119 Understanding mon_status
 120 ========================
 121
 122 The status of the monitor (as reported by the ``ceph tell mon.X mon_status``
 123 command) can always be obtained via the admin socket. This command outputs a
 124 great deal of information about the monitor (including the information found in
 125 the output of the ``quorum_status`` command).
 126
 127 To understand this command's output, let us consider the following example, in
 128 which we see the output of ``ceph tell mon.c mon_status``::
 129
 130   { "name": "c",
 131     "rank": 2,
 132     "state": "peon",
 133     "election_epoch": 38,
 134     "quorum": [
 135           1,
 136           2],
 137     "outside_quorum": [],
 138     "extra_probe_peers": [],
 139     "sync_provider": [],
 140     "monmap": { "epoch": 3,
 141         "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
 142         "modified": "2013-10-30 04:12:01.945629",
 143         "created": "2013-10-29 14:14:41.914786",
 144         "mons": [
 145               { "rank": 0,
 146                 "name": "a",
 147                 "addr": "127.0.0.1:6789\/0"},
 148               { "rank": 1,
 149                 "name": "b",
 150                 "addr": "127.0.0.1:6790\/0"},
 151               { "rank": 2,
 152                 "name": "c",
 153                 "addr": "127.0.0.1:6795\/0"}]}}
 154
 155 It is clear that there are three monitors in the monmap (*a*, *b*, and *c*),
 156 the quorum is formed by only two monitors, and *c* is in the quorum as a
 157 *peon*.
 158
 159 **Which monitor is out of the quorum?**
 160
 161   The answer is **a** (that is, ``mon.a``).
 162
 163 **Why?**
 164
 165   When the ``quorum`` set is examined, there are clearly two monitors in the
 166   set: *1* and *2*. But these are not monitor names. They are monitor ranks, as
 167   established in the current ``monmap``. The ``quorum`` set does not include
 168   the monitor that has rank 0, and according to the ``monmap`` that monitor is
 169   ``mon.a``.
 170
 171 **How are monitor ranks determined?**
 172
 173   Monitor ranks are calculated (or recalculated) whenever monitors are added or
 174   removed. The calculation of ranks follows a simple rule: the **greater** the
 175   ``IP:PORT`` combination, the **lower** the rank. In this case, because
 176   ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
 177   ``mon.a`` has the highest rank: namely, rank 0.
 178
 179
 180 Most Common Monitor Issues
 181 ===========================
 182
 183 Have Quorum but at least one Monitor is down
 184 ---------------------------------------------
 185
 186 When this happens, depending on the version of Ceph you are running,
 187 you should be seeing something similar to::
 188
 189       $ ceph health detail
 190       [snip]
 191       mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
 192
 193 How to troubleshoot this?
 194
 195   First, make sure ``mon.a`` is running.
 196
 197   Second, make sure you are able to connect to ``mon.a``'s node from the
 198   other mon nodes. Check the TCP ports as well. Check ``iptables`` and
 199   ``nf_conntrack`` on all nodes and ensure that you are not
 200   dropping/rejecting connections.
 201
 202   If this initial troubleshooting doesn't solve your problems, then it's
 203   time to go deeper.
 204
 205   First, check the problematic monitor's ``mon_status`` via the admin
 206   socket as explained in `Using the monitor's admin socket`_ and
 207   `Understanding mon_status`_.
 208
 209   If the monitor is out of the quorum, its state should be one of ``probing``,
 210   ``electing`` or ``synchronizing``. If it happens to be either ``leader`` or
 211   ``peon``, then the monitor believes to be in quorum, while the remaining
 212   cluster is sure it is not; or maybe it got into the quorum while we were
 213   troubleshooting the monitor, so check you ``ceph status`` again just to make
 214   sure. Proceed if the monitor is not yet in the quorum.
 215
 216 What if the state is ``probing``?
 217
 218   This means the monitor is still looking for the other monitors. Every time
 219   you start a monitor, the monitor will stay in this state for some time while
 220   trying to connect the rest of the monitors specified in the ``monmap``.  The
 221   time a monitor will spend in this state can vary. For instance, when on a
 222   single-monitor cluster (never do this in production), the monitor will pass
 223   through the probing state almost instantaneously.  In a multi-monitor
 224   cluster, the monitors will stay in this state until they find enough monitors
 225   to form a quorum -- this means that if you have 2 out of 3 monitors down, the
 226   one remaining monitor will stay in this state indefinitely until you bring
 227   one of the other monitors up.
 228
 229   If you have a quorum the starting daemon should be able to find the
 230   other monitors quickly, as long as they can be reached. If your
 231   monitor is stuck probing and you have gone through with all the communication
 232   troubleshooting, then there is a fair chance that the monitor is trying
 233   to reach the other monitors on a wrong address. ``mon_status`` outputs the
 234   ``monmap`` known to the monitor: check if the other monitor's locations
 235   match reality. If they don't, jump to
 236   `Recovering a Monitor's Broken monmap`_; if they do, then it may be related
 237   to severe clock skews amongst the monitor nodes and you should refer to
 238   `Clock Skews`_ first, but if that doesn't solve your problem then it is
 239   the time to prepare some logs and reach out to the community (please refer
 240   to `Preparing your logs`_ on how to best prepare your logs).
 241
 242
 243 What if state is ``electing``?
 244
 245   This means the monitor is in the middle of an election. With recent Ceph
 246   releases these typically complete quickly, but at times the monitors can
 247   get stuck in what is known as an *election storm*. This can indicate
 248   clock skew among the monitor nodes; jump to
 249   `Clock Skews`_ for more information. If all your clocks are properly
 250   synchronized, you should search the mailing lists and tracker.
 251   This is not a state that is likely to persist and aside from
 252   (*really*) old bugs there is not an obvious reason besides clock skews on
 253   why this would happen.  Worst case, if there are enough surviving mons,
 254   down the problematic one while you investigate.
 255
 256 What if state is ``synchronizing``?
 257
 258   This means the monitor is catching up with the rest of the cluster in
 259   order to join the quorum. Time to synchronize is a function of the size
 260   of your monitor store and thus of cluster size and state, so if you have a
 261   large or degraded cluster this may take a while.
 262
 263   If you notice that the monitor jumps from ``synchronizing`` to
 264   ``electing`` and then back to ``synchronizing``, then you do have a
 265   problem: the cluster state may be advancing (i.e., generating new maps)
 266   too fast for the synchronization process to keep up. This was a more common
 267   thing in early days (Cuttlefish), but since then the synchronization process
 268   has been refactored and enhanced to avoid this dynamic. If you experience
 269   this in later versions please let us know via a bug tracker. And bring some logs
 270   (see `Preparing your logs`_).
 271
 272 What if state is ``leader`` or ``peon``?
 273
 274   This should not happen:  famous last words.  If it does, however, it likely
 275   has a lot to do with clock skew -- see `Clock Skews`_. If you are not
 276   suffering from clock skew, then please prepare your logs (see
 277   `Preparing your logs`_) and reach out to the community.
 278
 279
 280 Recovering a Monitor's Broken ``monmap``
 281 ----------------------------------------
 282
 283 This is how a ``monmap`` usually looks, depending on the number of
 284 monitors::
 285
 286
 287       epoch 3
 288       fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
 289       last_changed 2013-10-30 04:12:01.945629
 290       created 2013-10-29 14:14:41.914786
 291       0: 127.0.0.1:6789/0 mon.a
 292       1: 127.0.0.1:6790/0 mon.b
 293       2: 127.0.0.1:6795/0 mon.c
 294
 295 This may not be what you have however. For instance, in some versions of
 296 early Cuttlefish there was a bug that could cause your ``monmap``
 297 to be nullified.  Completely filled with zeros. This means that not even
 298 ``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
 299 It's also possible to end up with a monitor with a severely outdated monmap,
 300 notably if the node has been down for months while you fight with your vendor's
 301 TAC.  The subject ``ceph-mon`` daemon might be unable to find the surviving
 302 monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
 303 then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
 304 ``mon.b``; you will end up with a totally different monmap from the one
 305 ``mon.c`` knows).
 306
 307 In this situation you have two possible solutions:
 308
 309 Scrap the monitor and redeploy
 310
 311   You should only take this route if you are positive that you won't
 312   lose the information kept by that monitor; that you have other monitors
 313   and that they are running just fine so that your new monitor is able
 314   to synchronize from the remaining monitors. Keep in mind that destroying
 315   a monitor, if there are no other copies of its contents, may lead to
 316   loss of data.
 317
 318 Inject a monmap into the monitor
 319
 320   Usually the safest path. You should grab the monmap from the remaining
 321   monitors and inject it into the monitor with the corrupted/lost monmap.
 322
 323   These are the basic steps:
 324
 325   1. Is there a formed quorum? If so, grab the monmap from the quorum::
 326
 327       $ ceph mon getmap -o /tmp/monmap
 328
 329   2. No quorum? Grab the monmap directly from another monitor (this
 330      assumes the monitor you are grabbing the monmap from has id ID-FOO
 331      and has been stopped)::
 332
 333       $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
 334
 335   3. Stop the monitor you are going to inject the monmap into.
 336
 337   4. Inject the monmap::
 338
 339       $ ceph-mon -i ID --inject-monmap /tmp/monmap
 340
 341   5. Start the monitor
 342
 343   Please keep in mind that the ability to inject monmaps is a powerful
 344   feature that can cause havoc with your monitors if misused as it will
 345   overwrite the latest, existing monmap kept by the monitor.
 346
 347
 348 Clock Skews
 349 ------------
 350
 351 Monitor operation can be severely affected by clock skew among the quorum's
 352 mons, as the PAXOS consensus algorithm requires tight time alignment.
 353 Skew can result in weird behavior with no obvious
 354 cause. To avoid such issues, you must run a clock synchronization tool
 355 on your monitor nodes:  ``Chrony`` or the legacy ``ntpd``.  Be sure to
 356 configure the mon nodes with the `iburst` option and multiple peers:
 357
 358 * Each other
 359 * Internal ``NTP`` servers
 360 * Multiple external, public pool servers
 361
 362 For good measure, *all* nodes in your cluster should also sync against
 363 internal and external servers, and perhaps even your mons.  ``NTP`` servers
 364 should run on bare metal; VM virtualized clocks are not suitable for steady
 365 timekeeping.  Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info.  Your
 366 organization may already have quality internal ``NTP`` servers you can use.
 367 Sources for ``NTP`` server appliances include:
 368
 369 * Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
 370 * EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
 371 * Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
 372
 373
 374 What's the maximum tolerated clock skew?
 375
 376   By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).
 377
 378
 379 Can I increase the maximum tolerated clock skew?
 380
 381   The maximum tolerated clock skew is configurable via the
 382   ``mon-clock-drift-allowed`` option, and
 383   although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
 384   is in place because clock-skewed monitors are likely to misbehave. We, as
 385   developers and QA aficionados, are comfortable with the current default
 386   value, as it will alert the user before the monitors get out hand. Changing
 387   this value may cause unforeseen effects on the
 388   stability of the monitors and overall cluster health.
 389
 390 How do I know there's a clock skew?
 391
 392   The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph
 393   health detail`` or ``ceph status`` should show something like::
 394
 395       mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
 396
 397   That means that ``mon.c`` has been flagged as suffering from a clock skew.
 398
 399   On releases beginning with Luminous you can issue the ``ceph
 400   time-sync-status`` command to check status.  Note that the lead mon is
 401   typically the one with the numerically lowest IP address.  It will always
 402   show ``0``: the reported offsets of other mons are relative to the lead mon,
 403   not to any external reference source.
 404
 405
 406 What should I do if there's a clock skew?
 407
 408   Synchronize your clocks. Running an NTP client may help. If you are already
 409   using one and you hit this sort of issues, check if you are using some NTP
 410   server remote to your network and consider hosting your own NTP server on
 411   your network.  This last option tends to reduce the amount of issues with
 412   monitor clock skews.
 413
 414
 415 Client Can't Connect or Mount
 416 ------------------------------
 417
 418 Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
 419 ``iptables``. The rule rejects all clients trying to connect to the host except
 420 for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
 421 place, clients connecting from a separate node will fail to mount with a timeout
 422 error. You need to address ``iptables`` rules that reject clients trying to
 423 connect to Ceph daemons.  For example, you would need to address rules that look
 424 like this appropriately::
 425
 426         REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
 427
 428 You may also need to add rules to IP tables on your Ceph hosts to ensure
 429 that clients can access the ports associated with your Ceph monitors (i.e., port
 430 6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
 431 example::
 432
 433         iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
 434
 435 Monitor Store Failures
 436 ======================
 437
 438 Symptoms of store corruption
 439 ----------------------------
 440
 441 Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If
 442 a monitor fails due to the key/value store corruption, following error messages
 443 might be found in the monitor log::
 444
 445   Corruption: error in middle of record
 446
 447 or::
 448
 449   Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
 450
 451 Recovery using healthy monitor(s)
 452 ---------------------------------
 453
 454 If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a
 455 new one. After booting up, the new joiner will sync up with a healthy
 456 peer, and once it is fully sync'ed, it will be able to serve the clients.
 457
 458 .. _mon-store-recovery-using-osds:
 459
 460 Recovery using OSDs
 461 -------------------
 462
 463 But what if all monitors fail at the same time? Since users are encouraged to
 464 deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous
 465 failure is rare. But unplanned power-downs in a data center with improperly
 466 configured disk/fs settings could fail the underlying file system, and hence
 467 kill all the monitors. In this case, we can recover the monitor store with the
 468 information stored in OSDs.
 469
 470 .. code-block:: bash
 471
 472   ms=/root/mon-store
 473   mkdir $ms
 474
 475   # collect the cluster map from stopped OSDs
 476   for host in $hosts; do
 477     rsync -avz $ms/. user@$host:$ms.remote
 478     rm -rf $ms
 479     ssh user@$host <<EOF
 480       for osd in /var/lib/ceph/osd/ceph-*; do
 481         ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
 482       done
 483   EOF
 484     rsync -avz user@$host:$ms.remote/. $ms
 485   done
 486
 487   # rebuild the monitor store from the collected map, if the cluster does not
 488   # use cephx authentication, we can skip the following steps to update the
 489   # keyring with the caps, and there is no need to pass the "--keyring" option.
 490   # i.e. just use "ceph-monstore-tool $ms rebuild" instead
 491   ceph-authtool /path/to/admin.keyring -n mon. \
 492     --cap mon 'allow *'
 493   ceph-authtool /path/to/admin.keyring -n client.admin \
 494     --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
 495   # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
 496   # for mgr.x is added, you can find the encoded key in
 497   # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
 498   # deployed
 499   ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
 500     --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
 501   # If your monitors' ids are not sorted by ip address, please specify them in order.
 502   # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
 503   # please passing "--mon-ids b a c".
 504   # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
 505   # specify them in the command line by passing them as arguments of the "--mon-ids"
 506   # option. if you are not sure, please check your ceph.conf to see if there is any
 507   # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
 508   # using DNS SRV for looking up monitors.
 509   ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma
 510
 511   # make a backup of the corrupted store.db just in case!  repeat for
 512   # all monitors.
 513   mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted
 514
 515   # move rebuild store.db into place.  repeat for all monitors.
 516   mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
 517   chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
 518
 519 The steps above
 520
 521 #. collect the map from all OSD hosts,
 522 #. then rebuild the store,
 523 #. fill the entities in keyring file with appropriate caps
 524 #. replace the corrupted store on ``mon.foo`` with the recovered copy.
 525
 526 Known limitations
 527 ~~~~~~~~~~~~~~~~~
 528
 529 Following information are not recoverable using the steps above:
 530
 531 - **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
 532   are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
 533   using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
 534   in the recovered monitor store. You might need to re-add them manually.
 535
 536 - **creating pools**: If any RADOS pools were in the process of being creating, that state is lost.  The recovery tool assumes that all pools have been created.  If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command.  Note that this will create an *empty* PG, so only do this if you know the pool is empty.
 537
 538 - **MDS Maps**: the MDS maps are lost.
 539
 540
 541
 542 Everything Failed! Now What?
 543 =============================
 544
 545 Reaching out for help
 546 ----------------------
 547
 548 You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
 549 and on ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
 550 sure you have grabbed your logs and have them ready if someone asks: the faster
 551 the interaction and lower the latency in response, the better chances everyone's
 552 time is optimized.
 553
 554
 555 Preparing your logs
 556 ---------------------
 557
 558 Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
 559 may want them. However, your logs may not have the necessary information. If
 560 you don't find your monitor logs at their default location, you can check
 561 where they should be by running::
 562
 563   ceph-conf --name mon.FOO --show-config-value log_file
 564
 565 The amount of information in the logs are subject to the debug levels being
 566 enforced by your configuration files. If you have not enforced a specific
 567 debug level then Ceph is using the default levels and your logs may not
 568 contain important information to track down you issue.
 569 A first step in getting relevant information into your logs will be to raise
 570 debug levels. In this case we will be interested in the information from the
 571 monitor.
 572 Similarly to what happens on other components, different parts of the monitor
 573 will output their debug information on different subsystems.
 574
 575 You will have to raise the debug levels of those subsystems more closely
 576 related to your issue. This may not be an easy task for someone unfamiliar
 577 with troubleshooting Ceph. For most situations, setting the following options
 578 on your monitors will be enough to pinpoint a potential source of the issue::
 579
 580       debug_mon = 10
 581       debug_ms = 1
 582
 583 If we find that these debug levels are not enough, there's a chance we may
 584 ask you to raise them or even define other debug subsystems to obtain infos
 585 from -- but at least we started off with some useful information, instead
 586 of a massively empty log without much to go on with.
 587
 588 Do I need to restart a monitor to adjust debug levels?
 589 ------------------------------------------------------
 590
 591 No. You may do it in one of two ways:
 592
 593 You have quorum
 594
 595   Either inject the debug option into the monitor you want to debug::
 596
 597         ceph tell mon.FOO config set debug_mon 10/10
 598
 599   or into all monitors at once::
 600
 601         ceph tell mon.* config set debug_mon 10/10
 602
 603 No quorum
 604
 605   Use the monitor's admin socket and directly adjust the configuration
 606   options::
 607
 608       ceph daemon mon.FOO config set debug_mon 10/10
 609
 610
 611 Going back to default values is as easy as rerunning the above commands
 612 using the debug level ``1/10`` instead.  You can check your current
 613 values using the admin socket and the following commands::
 614
 615       ceph daemon mon.FOO config show
 616
 617 or::
 618
 619       ceph daemon mon.FOO config get 'OPTION_NAME'
 620
 621
 622 Reproduced the problem with appropriate debug levels. Now what?
 623 ----------------------------------------------------------------
 624
 625 Ideally you would send us only the relevant portions of your logs.
 626 We realise that figuring out the corresponding portion may not be the
 627 easiest of tasks. Therefore, we won't hold it to you if you provide the
 628 full log, but common sense should be employed. If your log has hundreds
 629 of thousands of lines, it may get tricky to go through the whole thing,
 630 specially if we are not aware at which point, whatever your issue is,
 631 happened. For instance, when reproducing, keep in mind to write down
 632 current time and date and to extract the relevant portions of your logs
 633 based on that.
 634
 635 Finally, you should reach out to us on the mailing lists, on IRC or file
 636 a new issue on the `tracker`_.
 637
 638 .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new