ceph/doc/rados/troubleshooting/troubleshooting-mon.rst

   1 =================================
   2  Troubleshooting Monitors
   3 =================================
   4
   5 .. index:: monitor, high availability
   6
   7 When a cluster encounters monitor-related troubles there's a tendency to
   8 panic, and sometimes with good reason. Losing one or more monitors doesn't
   9 necessarily mean that your cluster is down, so long as a majority are up,
  10 running, and form a quorum.
  11 Regardless of how bad the situation is, the first thing you should do is to
  12 calm down, take a breath, and step through the below troubleshooting steps.
  13
  14
  15 Initial Troubleshooting
  16 ========================
  17
  18
  19 **Are the monitors running?**
  20
  21   First of all, we need to make sure the monitor (*mon*) daemon processes
  22   (``ceph-mon``) are running.  You would be amazed by how often Ceph admins
  23   forget to start the mons, or to restart them after an upgrade. There's no
  24   shame, but try to not lose a couple of hours looking for a deeper problem.
  25   When running Kraken or later releases also ensure that the manager
  26   daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``.
  27
  28
  29 **Are you able to reach to the mon nodes?**
  30
  31   Doesn't happen often, but sometimes there are ``iptables`` rules that
  32   block accesse to mon nodes or TCP ports. These may be leftovers from
  33   prior stress-testing or rule development. Try SSHing into
  34   the server and, if that succeeds, try connecting to the monitor's ports
  35   (``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools.
  36
  37 **Does ceph -s run and obtain a reply from the cluster?**
  38
  39   If the answer is yes then your cluster is up and running.  One thing you
  40   can take for granted is that the monitors will only answer to a ``status``
  41   request if there is a formed quorum.  Also check that at least one ``mgr``
  42   daemon is reported as running, ideally all of them.
  43
  44   If ``ceph -s`` hangs without obtaining a reply from the cluster
  45   or showing ``fault`` messages, then it is likely that your monitors
  46   are either down completely or just a fraction are up -- a fraction
  47   insufficient to form a majority quorum.  This check will connect to an
  48   arbitrary mon; in rare cases it may be illuminating to bind to specific
  49   mons in sequence by adding e.g. ``-m mymon1`` to the command.
  50
  51 **What if ceph -s doesn't come back?**
  52
  53   If you haven't gone through all the steps so far, please go back and do.
  54
  55   You can contact each monitor individually asking them for their status,
  56   regardless of a quorum being formed. This can be achieved using
  57   ``ceph tell mon.ID mon_status``, ID being the monitor's identifier. You should
  58   perform this for each monitor in the cluster. In section `Understanding
  59   mon_status`_ we will explain how to interpret the output of this command.
  60
  61   You may instead SSH into each mon node and query the daemon's admin socket.
  62
  63
  64 Using the monitor's admin socket
  65 =================================
  66
  67 The admin socket allows you to interact with a given daemon directly using a
  68 Unix socket file. This file can be found in your monitor's ``run`` directory.
  69 By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
  70 but this may be elsewhere if you have overridden the default directory. If you
  71 don't find it there, check your ``ceph.conf`` for an alternative path or
  72 run::
  73
  74   ceph-conf --name mon.ID --show-config-value admin_socket
  75
  76 Bear in mind that the admin socket will be available only while the monitor
  77 daemon is running. When the monitor is properly shut down, the admin socket
  78 will be removed. If however the monitor is not running and the admin socket
  79 persists, it is likely that the monitor was improperly shut down.
  80 Regardless, if the monitor is not running, you will not be able to use the
  81 admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
  82
  83 Accessing the admin socket is as simple as running ``ceph tell`` on the daemon
  84 you are interested in. For example::
  85
  86   ceph tell mon.<id> mon_status
  87
  88 Under the hood, this passes the command ``help`` to the running MON daemon
  89 ``<id>`` via its "admin socket", which is a file ending in ``.asok``
  90 somewhere under ``/var/run/ceph``. Once you know the full path to the file,
  91 you can even do this yourself::
  92
  93   ceph --admin-daemon <full_path_to_asok_file> <command>
  94
  95 Using ``help`` as the command to the ``ceph`` tool will show you the
  96 supported commands available through the admin socket. Please take a look
  97 at ``config get``, ``config show``, ``mon stat`` and ``quorum_status``,
  98 as those can be enlightening when troubleshooting a monitor.
  99
 100
 101 Understanding mon_status
 102 =========================
 103
 104 ``mon_status`` can always be obtained via the admin socket. This command will
 105 output a multitude of information about the monitor, including the same output
 106 you would get with ``quorum_status``.
 107
 108 Take the following example output of ``ceph tell mon.c mon_status``::
 109
 110
 111   { "name": "c",
 112     "rank": 2,
 113     "state": "peon",
 114     "election_epoch": 38,
 115     "quorum": [
 116           1,
 117           2],
 118     "outside_quorum": [],
 119     "extra_probe_peers": [],
 120     "sync_provider": [],
 121     "monmap": { "epoch": 3,
 122         "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
 123         "modified": "2013-10-30 04:12:01.945629",
 124         "created": "2013-10-29 14:14:41.914786",
 125         "mons": [
 126               { "rank": 0,
 127                 "name": "a",
 128                 "addr": "127.0.0.1:6789\/0"},
 129               { "rank": 1,
 130                 "name": "b",
 131                 "addr": "127.0.0.1:6790\/0"},
 132               { "rank": 2,
 133                 "name": "c",
 134                 "addr": "127.0.0.1:6795\/0"}]}}
 135
 136 A couple of things are obvious: we have three monitors in the monmap (*a*, *b*
 137 and *c*), the quorum is formed by only two monitors, and *c* is in the quorum
 138 as a *peon*.
 139
 140 Which monitor is out of the quorum?
 141
 142   The answer would be **a**.
 143
 144 Why?
 145
 146   Take a look at the ``quorum`` set. We have two monitors in this set: *1*
 147   and *2*. These are not monitor names. These are monitor ranks, as established
 148   in the current monmap. We are missing the monitor with rank 0, and according
 149   to the monmap that would be ``mon.a``.
 150
 151 By the way, how are ranks established?
 152
 153   Ranks are (re)calculated whenever you add or remove monitors and follow a
 154   simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the
 155   rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all
 156   the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0.
 157
 158 Most Common Monitor Issues
 159 ===========================
 160
 161 Have Quorum but at least one Monitor is down
 162 ---------------------------------------------
 163
 164 When this happens, depending on the version of Ceph you are running,
 165 you should be seeing something similar to::
 166
 167       $ ceph health detail
 168       [snip]
 169       mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
 170
 171 How to troubleshoot this?
 172
 173   First, make sure ``mon.a`` is running.
 174
 175   Second, make sure you are able to connect to ``mon.a``'s node from the
 176   other mon nodes. Check the TCP ports as well. Check ``iptables`` and
 177   ``nf_conntrack`` on all nodes and ensure that you are not
 178   dropping/rejecting connections.
 179
 180   If this initial troubleshooting doesn't solve your problems, then it's
 181   time to go deeper.
 182
 183   First, check the problematic monitor's ``mon_status`` via the admin
 184   socket as explained in `Using the monitor's admin socket`_ and
 185   `Understanding mon_status`_.
 186
 187   If the monitor is out of the quorum, its state should be one of
 188   ``probing``, ``electing`` or ``synchronizing``. If it happens to be either
 189   ``leader`` or ``peon``, then the monitor believes to be in quorum, while
 190   the remaining cluster is sure it is not; or maybe it got into the quorum
 191   while we were troubleshooting the monitor, so check you ``ceph -s`` again
 192   just to make sure. Proceed if the monitor is not yet in the quorum.
 193
 194 What if the state is ``probing``?
 195
 196   This means the monitor is still looking for the other monitors. Every time
 197   you start a monitor, the monitor will stay in this state for some time
 198   while trying to connect the rest of the monitors specified in the ``monmap``.
 199   The time a monitor will spend in this state can vary. For instance, when on
 200   a single-monitor cluster (never do this in production),
 201   the monitor will pass through the probing state almost instantaneously.
 202   In a multi-monitor cluster, the monitors will stay in this state until they
 203   find enough monitors to form a quorum -- this means that if you have 2 out
 204   of 3 monitors down, the one remaining monitor will stay in this state
 205   indefinitely until you bring one of the other monitors up.
 206
 207   If you have a quorum the starting daemon should be able to find the
 208   other monitors quickly, as long as they can be reached. If your
 209   monitor is stuck probing and you have gone through with all the communication
 210   troubleshooting, then there is a fair chance that the monitor is trying
 211   to reach the other monitors on a wrong address. ``mon_status`` outputs the
 212   ``monmap`` known to the monitor: check if the other monitor's locations
 213   match reality. If they don't, jump to
 214   `Recovering a Monitor's Broken monmap`_; if they do, then it may be related
 215   to severe clock skews amongst the monitor nodes and you should refer to
 216   `Clock Skews`_ first, but if that doesn't solve your problem then it is
 217   the time to prepare some logs and reach out to the community (please refer
 218   to `Preparing your logs`_ on how to best prepare your logs).
 219
 220
 221 What if state is ``electing``?
 222
 223   This means the monitor is in the middle of an election. With recent Ceph
 224   releases these typically complete quickly, but at times the monitors can
 225   get stuck in what is known as an *election storm*. This can indicate
 226   clock skew among the monitor nodes; jump to
 227   `Clock Skews`_ for more information. If all your clocks are properly
 228   synchronized, you should search the mailing lists and tracker.
 229   This is not a state that is likely to persist and aside from
 230   (*really*) old bugs there is not an obvious reason besides clock skews on
 231   why this would happen.  Worst case, if there are enough surviving mons,
 232   down the problematic one while you investigate.
 233
 234 What if state is ``synchronizing``?
 235
 236   This means the monitor is catching up with the rest of the cluster in
 237   order to join the quorum. Time to synchronize is a function of the size
 238   of your monitor store and thus of cluster size and state, so if you have a
 239   large or degraded cluster this may take a while.
 240
 241   If you notice that the monitor jumps from ``synchronizing`` to
 242   ``electing`` and then back to ``synchronizing``, then you do have a
 243   problem: the cluster state may be advancing (i.e., generating new maps)
 244   too fast for the synchronization process to keep up. This was a more common
 245   thing in early days (Cuttlefish), but since then the synchronization process
 246   has been refactored and enhanced to avoid this dynamic. If you experience
 247   this in later versions please let us know via a bug tracker. And bring some logs
 248   (see `Preparing your logs`_).
 249
 250 What if state is ``leader`` or ``peon``?
 251
 252   This should not happen:  famous last words.  If it does, however, it likely
 253   has a lot to do with clock skew -- see `Clock Skews`_. If you are not
 254   suffering from clock skew, then please prepare your logs (see
 255   `Preparing your logs`_) and reach out to the community.
 256
 257
 258 Recovering a Monitor's Broken ``monmap``
 259 ----------------------------------------
 260
 261 This is how a ``monmap`` usually looks, depending on the number of
 262 monitors::
 263
 264
 265       epoch 3
 266       fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
 267       last_changed 2013-10-30 04:12:01.945629
 268       created 2013-10-29 14:14:41.914786
 269       0: 127.0.0.1:6789/0 mon.a
 270       1: 127.0.0.1:6790/0 mon.b
 271       2: 127.0.0.1:6795/0 mon.c
 272
 273 This may not be what you have however. For instance, in some versions of
 274 early Cuttlefish there was a bug that could cause your ``monmap``
 275 to be nullified.  Completely filled with zeros. This means that not even
 276 ``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
 277 It's also possible to end up with a monitor with a severely outdated monmap,
 278 notably if the node has been down for months while you fight with your vendor's
 279 TAC.  The subject ``ceph-mon`` daemon might be unable to find the surviving
 280 monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
 281 then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
 282 ``mon.b``; you will end up with a totally different monmap from the one
 283 ``mon.c`` knows).
 284
 285 In this situation you have two possible solutions:
 286
 287 Scrap the monitor and redeploy
 288
 289   You should only take this route if you are positive that you won't
 290   lose the information kept by that monitor; that you have other monitors
 291   and that they are running just fine so that your new monitor is able
 292   to synchronize from the remaining monitors. Keep in mind that destroying
 293   a monitor, if there are no other copies of its contents, may lead to
 294   loss of data.
 295
 296 Inject a monmap into the monitor
 297
 298   Usually the safest path. You should grab the monmap from the remaining
 299   monitors and inject it into the monitor with the corrupted/lost monmap.
 300
 301   These are the basic steps:
 302
 303   1. Is there a formed quorum? If so, grab the monmap from the quorum::
 304
 305       $ ceph mon getmap -o /tmp/monmap
 306
 307   2. No quorum? Grab the monmap directly from another monitor (this
 308      assumes the monitor you are grabbing the monmap from has id ID-FOO
 309      and has been stopped)::
 310
 311       $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
 312
 313   3. Stop the monitor you are going to inject the monmap into.
 314
 315   4. Inject the monmap::
 316
 317       $ ceph-mon -i ID --inject-monmap /tmp/monmap
 318
 319   5. Start the monitor
 320
 321   Please keep in mind that the ability to inject monmaps is a powerful
 322   feature that can cause havoc with your monitors if misused as it will
 323   overwrite the latest, existing monmap kept by the monitor.
 324
 325
 326 Clock Skews
 327 ------------
 328
 329 Monitor operation can be severely affected by clock skew among the quorum's
 330 mons, as the PAXOS consensus algorithm requires tight time alignment.
 331 Skew can result in weird behavior with no obvious
 332 cause. To avoid such issues, you must run a clock synchronization tool
 333 on your monitor nodes:  ``Chrony`` or the legacy ``ntpd``.  Be sure to
 334 configure the mon nodes with the `iburst` option and multiple peers:
 335
 336 * Each other
 337 * Internal ``NTP`` servers
 338 * Multiple external, public pool servers
 339
 340 For good measure, *all* nodes in your cluster should also sync against
 341 internal and external servers, and perhaps even your mons.  ``NTP`` servers
 342 should run on bare metal; VM virtualized clocks are not suitable for steady
 343 timekeeping.  Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info.  Your
 344 organization may already have quality internal ``NTP`` servers you can use.
 345 Sources for ``NTP`` server appliances include:
 346
 347 * Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
 348 * EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
 349 * Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
 350
 351
 352 What's the maximum tolerated clock skew?
 353
 354   By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).
 355
 356
 357 Can I increase the maximum tolerated clock skew?
 358
 359   The maximum tolerated clock skew is configurable via the
 360   ``mon-clock-drift-allowed`` option, and
 361   although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
 362   is in place because clock-skewed monitors are liely to misbehave. We, as
 363   developers and QA aficionados, are comfortable with the current default
 364   value, as it will alert the user before the monitors get out hand. Changing
 365   this value may cause unforeseen effects on the
 366   stability of the monitors and overall cluster health.
 367
 368 How do I know there's a clock skew?
 369
 370   The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health
 371   detail`` or ``ceph status`` should show something like::
 372
 373       mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
 374
 375   That means that ``mon.c`` has been flagged as suffering from a clock skew.
 376
 377   On releases beginning with Luminous you can issue the
 378   ``ceph time-sync-status`` command to check status.  Note that the lead mon
 379   is typically the one with the numerically lowest IP address.  It will always
 380   show ``0``: the reported offsets of other mons are relative to
 381   the lead mon, not to any external reference source.
 382
 383
 384 What should I do if there's a clock skew?
 385
 386   Synchronize your clocks. Running an NTP client may help. If you are already
 387   using one and you hit this sort of issues, check if you are using some NTP
 388   server remote to your network and consider hosting your own NTP server on
 389   your network.  This last option tends to reduce the amount of issues with
 390   monitor clock skews.
 391
 392
 393 Client Can't Connect or Mount
 394 ------------------------------
 395
 396 Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
 397 ``iptables``. The rule rejects all clients trying to connect to the host except
 398 for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
 399 place, clients connecting from a separate node will fail to mount with a timeout
 400 error. You need to address ``iptables`` rules that reject clients trying to
 401 connect to Ceph daemons.  For example, you would need to address rules that look
 402 like this appropriately::
 403
 404         REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
 405
 406 You may also need to add rules to IP tables on your Ceph hosts to ensure
 407 that clients can access the ports associated with your Ceph monitors (i.e., port
 408 6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
 409 example::
 410
 411         iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
 412
 413 Monitor Store Failures
 414 ======================
 415
 416 Symptoms of store corruption
 417 ----------------------------
 418
 419 Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If
 420 a monitor fails due to the key/value store corruption, following error messages
 421 might be found in the monitor log::
 422
 423   Corruption: error in middle of record
 424
 425 or::
 426
 427   Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
 428
 429 Recovery using healthy monitor(s)
 430 ---------------------------------
 431
 432 If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a
 433 new one. After booting up, the new joiner will sync up with a healthy
 434 peer, and once it is fully sync'ed, it will be able to serve the clients.
 435
 436 Recovery using OSDs
 437 -------------------
 438
 439 But what if all monitors fail at the same time? Since users are encouraged to
 440 deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous
 441 failure is rare. But unplanned power-downs in a data center with improperly
 442 configured disk/fs settings could fail the underlying file system, and hence
 443 kill all the monitors. In this case, we can recover the monitor store with the
 444 information stored in OSDs.::
 445
 446   ms=/root/mon-store
 447   mkdir $ms
 448
 449   # collect the cluster map from stopped OSDs
 450   for host in $hosts; do
 451     rsync -avz $ms/. user@$host:$ms.remote
 452     rm -rf $ms
 453     ssh user@$host <<EOF
 454       for osd in /var/lib/ceph/osd/ceph-*; do
 455         ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
 456       done
 457   EOF
 458     rsync -avz user@$host:$ms.remote/. $ms
 459   done
 460
 461   # rebuild the monitor store from the collected map, if the cluster does not
 462   # use cephx authentication, we can skip the following steps to update the
 463   # keyring with the caps, and there is no need to pass the "--keyring" option.
 464   # i.e. just use "ceph-monstore-tool $ms rebuild" instead
 465   ceph-authtool /path/to/admin.keyring -n mon. \
 466     --cap mon 'allow *'
 467   ceph-authtool /path/to/admin.keyring -n client.admin \
 468     --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
 469   # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
 470   # for mgr.x is added, you can find the encoded key in
 471   # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
 472   # deployed
 473   ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
 474     --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
 475   # if your monitors' ids are not single characters like 'a', 'b', 'c', please
 476   # specify them in the command line by passing them as arguments of the "--mon-ids"
 477   # option. if you are not sure, please check your ceph.conf to see if there is any
 478   # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
 479   # using DNS SRV for looking up monitors.
 480   ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma
 481
 482   # make a backup of the corrupted store.db just in case!  repeat for
 483   # all monitors.
 484   mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted
 485
 486   # move rebuild store.db into place.  repeat for all monitors.
 487   mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
 488   chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
 489
 490 The steps above
 491
 492 #. collect the map from all OSD hosts,
 493 #. then rebuild the store,
 494 #. fill the entities in keyring file with appropriate caps
 495 #. replace the corrupted store on ``mon.foo`` with the recovered copy.
 496
 497 Known limitations
 498 ~~~~~~~~~~~~~~~~~
 499
 500 Following information are not recoverable using the steps above:
 501
 502 - **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
 503   are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
 504   using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
 505   in the recovered monitor store. You might need to re-add them manually.
 506
 507 - **creating pools**: If any RADOS pools were in the process of being creating, that state is lost.  The recovery tool assumes that all pools have been created.  If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command.  Note that this will create an *empty* PG, so only do this if you know the pool is empty.
 508
 509 - **MDS Maps**: the MDS maps are lost.
 510
 511
 512
 513 Everything Failed! Now What?
 514 =============================
 515
 516 Reaching out for help
 517 ----------------------
 518
 519 You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
 520 and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make
 521 sure you have grabbed your logs and have them ready if someone asks: the faster
 522 the interaction and lower the latency in response, the better chances everyone's
 523 time is optimized.
 524
 525
 526 Preparing your logs
 527 ---------------------
 528
 529 Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
 530 may want them. However, your logs may not have the necessary information. If
 531 you don't find your monitor logs at their default location, you can check
 532 where they should be by running::
 533
 534   ceph-conf --name mon.FOO --show-config-value log_file
 535
 536 The amount of information in the logs are subject to the debug levels being
 537 enforced by your configuration files. If you have not enforced a specific
 538 debug level then Ceph is using the default levels and your logs may not
 539 contain important information to track down you issue.
 540 A first step in getting relevant information into your logs will be to raise
 541 debug levels. In this case we will be interested in the information from the
 542 monitor.
 543 Similarly to what happens on other components, different parts of the monitor
 544 will output their debug information on different subsystems.
 545
 546 You will have to raise the debug levels of those subsystems more closely
 547 related to your issue. This may not be an easy task for someone unfamiliar
 548 with troubleshooting Ceph. For most situations, setting the following options
 549 on your monitors will be enough to pinpoint a potential source of the issue::
 550
 551       debug mon = 10
 552       debug ms = 1
 553
 554 If we find that these debug levels are not enough, there's a chance we may
 555 ask you to raise them or even define other debug subsystems to obtain infos
 556 from -- but at least we started off with some useful information, instead
 557 of a massively empty log without much to go on with.
 558
 559 Do I need to restart a monitor to adjust debug levels?
 560 ------------------------------------------------------
 561
 562 No. You may do it in one of two ways:
 563
 564 You have quorum
 565
 566   Either inject the debug option into the monitor you want to debug::
 567
 568         ceph tell mon.FOO config set debug_mon 10/10
 569
 570   or into all monitors at once::
 571
 572         ceph tell mon.* config set debug_mon 10/10
 573
 574 No quorum
 575
 576   Use the monitor's admin socket and directly adjust the configuration
 577   options::
 578
 579       ceph daemon mon.FOO config set debug_mon 10/10
 580
 581
 582 Going back to default values is as easy as rerunning the above commands
 583 using the debug level ``1/10`` instead.  You can check your current
 584 values using the admin socket and the following commands::
 585
 586       ceph daemon mon.FOO config show
 587
 588 or::
 589
 590       ceph daemon mon.FOO config get 'OPTION_NAME'
 591
 592
 593 Reproduced the problem with appropriate debug levels. Now what?
 594 ----------------------------------------------------------------
 595
 596 Ideally you would send us only the relevant portions of your logs.
 597 We realise that figuring out the corresponding portion may not be the
 598 easiest of tasks. Therefore, we won't hold it to you if you provide the
 599 full log, but common sense should be employed. If your log has hundreds
 600 of thousands of lines, it may get tricky to go through the whole thing,
 601 specially if we are not aware at which point, whatever your issue is,
 602 happened. For instance, when reproducing, keep in mind to write down
 603 current time and date and to extract the relevant portions of your logs
 604 based on that.
 605
 606 Finally, you should reach out to us on the mailing lists, on IRC or file
 607 a new issue on the `tracker`_.
 608
 609 .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new