]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ================================= |
2 | Troubleshooting Monitors | |
3 | ================================= | |
4 | ||
5 | .. index:: monitor, high availability | |
6 | ||
7 | When a cluster encounters monitor-related troubles there's a tendency to | |
8 | panic, and some times with good reason. You should keep in mind that losing | |
9 | a monitor, or a bunch of them, don't necessarily mean that your cluster is | |
10 | down, as long as a majority is up, running and with a formed quorum. | |
11 | Regardless of how bad the situation is, the first thing you should do is to | |
12 | calm down, take a breath and try answering our initial troubleshooting script. | |
13 | ||
14 | ||
15 | Initial Troubleshooting | |
16 | ======================== | |
17 | ||
18 | ||
19 | **Are the monitors running?** | |
20 | ||
21 | First of all, we need to make sure the monitors are running. You would be | |
22 | amazed by how often people forget to run the monitors, or restart them after | |
23 | an upgrade. There's no shame in that, but let's try not losing a couple of | |
24 | hours chasing an issue that is not there. | |
25 | ||
26 | **Are you able to connect to the monitor's servers?** | |
27 | ||
28 | Doesn't happen often, but sometimes people do have ``iptables`` rules that | |
29 | block accesses to monitor servers or monitor ports. Usually leftovers from | |
30 | monitor stress-testing that were forgotten at some point. Try ssh'ing into | |
31 | the server and, if that succeeds, try connecting to the monitor's port | |
32 | using you tool of choice (telnet, nc,...). | |
33 | ||
34 | **Does ceph -s run and obtain a reply from the cluster?** | |
35 | ||
36 | If the answer is yes then your cluster is up and running. One thing you | |
37 | can take for granted is that the monitors will only answer to a ``status`` | |
38 | request if there is a formed quorum. | |
39 | ||
40 | If ``ceph -s`` blocked however, without obtaining a reply from the cluster | |
41 | or showing a lot of ``fault`` messages, then it is likely that your monitors | |
42 | are either down completely or just a portion is up -- a portion that is not | |
43 | enough to form a quorum (keep in mind that a quorum if formed by a majority | |
44 | of monitors). | |
45 | ||
46 | **What if ceph -s doesn't finish?** | |
47 | ||
48 | If you haven't gone through all the steps so far, please go back and do. | |
49 | ||
50 | For those running on Emperor 0.72-rc1 and forward, you will be able to | |
51 | contact each monitor individually asking them for their status, regardless | |
52 | of a quorum being formed. This an be achieved using ``ceph ping mon.ID``, | |
53 | ID being the monitor's identifier. You should perform this for each monitor | |
54 | in the cluster. In section `Understanding mon_status`_ we will explain how | |
55 | to interpret the output of this command. | |
56 | ||
57 | For the rest of you who don't tread on the bleeding edge, you will need to | |
58 | ssh into the server and use the monitor's admin socket. Please jump to | |
59 | `Using the monitor's admin socket`_. | |
60 | ||
61 | For other specific issues, keep on reading. | |
62 | ||
63 | ||
64 | Using the monitor's admin socket | |
65 | ================================= | |
66 | ||
67 | The admin socket allows you to interact with a given daemon directly using a | |
68 | Unix socket file. This file can be found in your monitor's ``run`` directory. | |
69 | By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` | |
70 | but this can vary if you defined it otherwise. If you don't find it there, | |
71 | please check your ``ceph.conf`` for an alternative path or run:: | |
72 | ||
73 | ceph-conf --name mon.ID --show-config-value admin_socket | |
74 | ||
75 | Please bear in mind that the admin socket will only be available while the | |
76 | monitor is running. When the monitor is properly shutdown, the admin socket | |
77 | will be removed. If however the monitor is not running and the admin socket | |
78 | still persists, it is likely that the monitor was improperly shutdown. | |
79 | Regardless, if the monitor is not running, you will not be able to use the | |
80 | admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. | |
81 | ||
82 | Accessing the admin socket is as simple as telling the ``ceph`` tool to use | |
83 | the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: | |
84 | ||
85 | ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command> | |
86 | ||
87 | while in Dumpling and beyond you can use the alternate (and recommended) | |
88 | format:: | |
89 | ||
90 | ceph daemon mon.<id> <command> | |
91 | ||
92 | Using ``help`` as the command to the ``ceph`` tool will show you the | |
93 | supported commands available through the admin socket. Please take a look | |
94 | at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, | |
95 | as those can be enlightening when troubleshooting a monitor. | |
96 | ||
97 | ||
98 | Understanding mon_status | |
99 | ========================= | |
100 | ||
101 | ``mon_status`` can be obtained through the ``ceph`` tool when you have | |
102 | a formed quorum, or via the admin socket if you don't. This command will | |
103 | output a multitude of information about the monitor, including the same | |
104 | output you would get with ``quorum_status``. | |
105 | ||
106 | Take the following example of ``mon_status``:: | |
107 | ||
108 | ||
109 | { "name": "c", | |
110 | "rank": 2, | |
111 | "state": "peon", | |
112 | "election_epoch": 38, | |
113 | "quorum": [ | |
114 | 1, | |
115 | 2], | |
116 | "outside_quorum": [], | |
117 | "extra_probe_peers": [], | |
118 | "sync_provider": [], | |
119 | "monmap": { "epoch": 3, | |
120 | "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", | |
121 | "modified": "2013-10-30 04:12:01.945629", | |
122 | "created": "2013-10-29 14:14:41.914786", | |
123 | "mons": [ | |
124 | { "rank": 0, | |
125 | "name": "a", | |
126 | "addr": "127.0.0.1:6789\/0"}, | |
127 | { "rank": 1, | |
128 | "name": "b", | |
129 | "addr": "127.0.0.1:6790\/0"}, | |
130 | { "rank": 2, | |
131 | "name": "c", | |
132 | "addr": "127.0.0.1:6795\/0"}]}} | |
133 | ||
134 | A couple of things are obvious: we have three monitors in the monmap (*a*, *b* | |
135 | and *c*), the quorum is formed by only two monitors, and *c* is in the quorum | |
136 | as a *peon*. | |
137 | ||
138 | Which monitor is out of the quorum? | |
139 | ||
140 | The answer would be **a**. | |
141 | ||
142 | Why? | |
143 | ||
144 | Take a look at the ``quorum`` set. We have two monitors in this set: *1* | |
145 | and *2*. These are not monitor names. These are monitor ranks, as established | |
146 | in the current monmap. We are missing the monitor with rank 0, and according | |
147 | to the monmap that would be ``mon.a``. | |
148 | ||
149 | By the way, how are ranks established? | |
150 | ||
151 | Ranks are (re)calculated whenever you add or remove monitors and follow a | |
152 | simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the | |
153 | rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all | |
154 | the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. | |
155 | ||
156 | Most Common Monitor Issues | |
157 | =========================== | |
158 | ||
159 | Have Quorum but at least one Monitor is down | |
160 | --------------------------------------------- | |
161 | ||
162 | When this happens, depending on the version of Ceph you are running, | |
163 | you should be seeing something similar to:: | |
164 | ||
165 | $ ceph health detail | |
166 | [snip] | |
167 | mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) | |
168 | ||
169 | How to troubleshoot this? | |
170 | ||
171 | First, make sure ``mon.a`` is running. | |
172 | ||
173 | Second, make sure you are able to connect to ``mon.a``'s server from the | |
174 | other monitors' servers. Check the ports as well. Check ``iptables`` on | |
c07f9fc5 | 175 | all your monitor nodes and make sure you are not dropping/rejecting |
7c673cae FG |
176 | connections. |
177 | ||
178 | If this initial troubleshooting doesn't solve your problems, then it's | |
179 | time to go deeper. | |
180 | ||
181 | First, check the problematic monitor's ``mon_status`` via the admin | |
182 | socket as explained in `Using the monitor's admin socket`_ and | |
183 | `Understanding mon_status`_. | |
184 | ||
185 | Considering the monitor is out of the quorum, its state should be one of | |
186 | ``probing``, ``electing`` or ``synchronizing``. If it happens to be either | |
187 | ``leader`` or ``peon``, then the monitor believes to be in quorum, while | |
188 | the remaining cluster is sure it is not; or maybe it got into the quorum | |
189 | while we were troubleshooting the monitor, so check you ``ceph -s`` again | |
190 | just to make sure. Proceed if the monitor is not yet in the quorum. | |
191 | ||
192 | What if the state is ``probing``? | |
193 | ||
194 | This means the monitor is still looking for the other monitors. Every time | |
195 | you start a monitor, the monitor will stay in this state for some time | |
196 | while trying to find the rest of the monitors specified in the ``monmap``. | |
197 | The time a monitor will spend in this state can vary. For instance, when on | |
198 | a single-monitor cluster, the monitor will pass through the probing state | |
199 | almost instantaneously, since there are no other monitors around. On a | |
200 | multi-monitor cluster, the monitors will stay in this state until they | |
201 | find enough monitors to form a quorum -- this means that if you have 2 out | |
202 | of 3 monitors down, the one remaining monitor will stay in this state | |
203 | indefinitively until you bring one of the other monitors up. | |
204 | ||
205 | If you have a quorum, however, the monitor should be able to find the | |
206 | remaining monitors pretty fast, as long as they can be reached. If your | |
c07f9fc5 | 207 | monitor is stuck probing and you have gone through with all the communication |
7c673cae FG |
208 | troubleshooting, then there is a fair chance that the monitor is trying |
209 | to reach the other monitors on a wrong address. ``mon_status`` outputs the | |
210 | ``monmap`` known to the monitor: check if the other monitor's locations | |
211 | match reality. If they don't, jump to | |
212 | `Recovering a Monitor's Broken monmap`_; if they do, then it may be related | |
213 | to severe clock skews amongst the monitor nodes and you should refer to | |
214 | `Clock Skews`_ first, but if that doesn't solve your problem then it is | |
215 | the time to prepare some logs and reach out to the community (please refer | |
216 | to `Preparing your logs`_ on how to best prepare your logs). | |
217 | ||
218 | ||
219 | What if state is ``electing``? | |
220 | ||
221 | This means the monitor is in the middle of an election. These should be | |
222 | fast to complete, but at times the monitors can get stuck electing. This | |
223 | is usually a sign of a clock skew among the monitor nodes; jump to | |
224 | `Clock Skews`_ for more infos on that. If all your clocks are properly | |
225 | synchronized, it is best if you prepare some logs and reach out to the | |
226 | community. This is not a state that is likely to persist and aside from | |
c07f9fc5 | 227 | (*really*) old bugs there is not an obvious reason besides clock skews on |
7c673cae FG |
228 | why this would happen. |
229 | ||
230 | What if state is ``synchronizing``? | |
231 | ||
232 | This means the monitor is synchronizing with the rest of the cluster in | |
233 | order to join the quorum. The synchronization process is as faster as | |
234 | smaller your monitor store is, so if you have a big store it may | |
235 | take a while. Don't worry, it should be finished soon enough. | |
236 | ||
237 | However, if you notice that the monitor jumps from ``synchronizing`` to | |
238 | ``electing`` and then back to ``synchronizing``, then you do have a | |
239 | problem: the cluster state is advancing (i.e., generating new maps) way | |
240 | too fast for the synchronization process to keep up. This used to be a | |
241 | thing in early Cuttlefish, but since then the synchronization process was | |
242 | quite refactored and enhanced to avoid just this sort of behavior. If this | |
243 | happens in later versions let us know. And bring some logs | |
244 | (see `Preparing your logs`_). | |
245 | ||
246 | What if state is ``leader`` or ``peon``? | |
247 | ||
248 | This should not happen. There is a chance this might happen however, and | |
c07f9fc5 | 249 | it has a lot to do with clock skews -- see `Clock Skews`_. If you are not |
7c673cae FG |
250 | suffering from clock skews, then please prepare your logs (see |
251 | `Preparing your logs`_) and reach out to us. | |
252 | ||
253 | ||
254 | Recovering a Monitor's Broken monmap | |
255 | ------------------------------------- | |
256 | ||
257 | This is how a ``monmap`` usually looks like, depending on the number of | |
258 | monitors:: | |
259 | ||
260 | ||
261 | epoch 3 | |
262 | fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 | |
263 | last_changed 2013-10-30 04:12:01.945629 | |
264 | created 2013-10-29 14:14:41.914786 | |
265 | 0: 127.0.0.1:6789/0 mon.a | |
266 | 1: 127.0.0.1:6790/0 mon.b | |
267 | 2: 127.0.0.1:6795/0 mon.c | |
268 | ||
269 | This may not be what you have however. For instance, in some versions of | |
270 | early Cuttlefish there was this one bug that could cause your ``monmap`` | |
271 | to be nullified. Completely filled with zeros. This means that not even | |
272 | ``monmaptool`` would be able to read it because it would find it hard to | |
273 | make sense of only-zeros. Some other times, you may end up with a monitor | |
274 | with a severely outdated monmap, thus being unable to find the remaining | |
275 | monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, | |
276 | then remove ``mon.a``, then add a new monitor ``mon.e`` and remove | |
277 | ``mon.b``; you will end up with a totally different monmap from the one | |
278 | ``mon.c`` knows). | |
279 | ||
280 | In this sort of situations, you have two possible solutions: | |
281 | ||
282 | Scrap the monitor and create a new one | |
283 | ||
284 | You should only take this route if you are positive that you won't | |
285 | lose the information kept by that monitor; that you have other monitors | |
286 | and that they are running just fine so that your new monitor is able | |
287 | to synchronize from the remaining monitors. Keep in mind that destroying | |
288 | a monitor, if there are no other copies of its contents, may lead to | |
289 | loss of data. | |
290 | ||
291 | Inject a monmap into the monitor | |
292 | ||
293 | Usually the safest path. You should grab the monmap from the remaining | |
294 | monitors and inject it into the monitor with the corrupted/lost monmap. | |
295 | ||
296 | These are the basic steps: | |
297 | ||
298 | 1. Is there a formed quorum? If so, grab the monmap from the quorum:: | |
299 | ||
300 | $ ceph mon getmap -o /tmp/monmap | |
301 | ||
302 | 2. No quorum? Grab the monmap directly from another monitor (this | |
c07f9fc5 | 303 | assumes the monitor you are grabbing the monmap from has id ID-FOO |
7c673cae FG |
304 | and has been stopped):: |
305 | ||
306 | $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap | |
307 | ||
c07f9fc5 | 308 | 3. Stop the monitor you are going to inject the monmap into. |
7c673cae FG |
309 | |
310 | 4. Inject the monmap:: | |
311 | ||
312 | $ ceph-mon -i ID --inject-monmap /tmp/monmap | |
313 | ||
314 | 5. Start the monitor | |
315 | ||
316 | Please keep in mind that the ability to inject monmaps is a powerful | |
317 | feature that can cause havoc with your monitors if misused as it will | |
318 | overwrite the latest, existing monmap kept by the monitor. | |
319 | ||
320 | ||
321 | Clock Skews | |
322 | ------------ | |
323 | ||
324 | Monitors can be severely affected by significant clock skews across the | |
325 | monitor nodes. This usually translates into weird behavior with no obvious | |
326 | cause. To avoid such issues, you should run a clock synchronization tool | |
327 | on your monitor nodes. | |
328 | ||
329 | ||
330 | What's the maximum tolerated clock skew? | |
331 | ||
332 | By default the monitors will allow clocks to drift up to ``0.05 seconds``. | |
333 | ||
334 | ||
335 | Can I increase the maximum tolerated clock skew? | |
336 | ||
337 | This value is configurable via the ``mon-clock-drift-allowed`` option, and | |
338 | although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism | |
339 | is in place because clock skewed monitor may not properly behave. We, as | |
340 | developers and QA afficcionados, are comfortable with the current default | |
341 | value, as it will alert the user before the monitors get out hand. Changing | |
342 | this value without testing it first may cause unforeseen effects on the | |
343 | stability of the monitors and overall cluster healthiness, although there is | |
344 | no risk of dataloss. | |
345 | ||
346 | ||
347 | How do I know there's a clock skew? | |
348 | ||
349 | The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health | |
350 | detail`` should show something in the form of:: | |
351 | ||
352 | mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) | |
353 | ||
354 | That means that ``mon.c`` has been flagged as suffering from a clock skew. | |
355 | ||
356 | ||
357 | What should I do if there's a clock skew? | |
358 | ||
359 | Synchronize your clocks. Running an NTP client may help. If you are already | |
360 | using one and you hit this sort of issues, check if you are using some NTP | |
361 | server remote to your network and consider hosting your own NTP server on | |
362 | your network. This last option tends to reduce the amount of issues with | |
363 | monitor clock skews. | |
364 | ||
365 | ||
366 | Client Can't Connect or Mount | |
367 | ------------------------------ | |
368 | ||
369 | Check your IP tables. Some OS install utilities add a ``REJECT`` rule to | |
370 | ``iptables``. The rule rejects all clients trying to connect to the host except | |
371 | for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in | |
372 | place, clients connecting from a separate node will fail to mount with a timeout | |
373 | error. You need to address ``iptables`` rules that reject clients trying to | |
374 | connect to Ceph daemons. For example, you would need to address rules that look | |
375 | like this appropriately:: | |
376 | ||
377 | REJECT all -- anywhere anywhere reject-with icmp-host-prohibited | |
378 | ||
379 | You may also need to add rules to IP tables on your Ceph hosts to ensure | |
380 | that clients can access the ports associated with your Ceph monitors (i.e., port | |
381 | 6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For | |
382 | example:: | |
383 | ||
384 | iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT | |
385 | ||
386 | Monitor Store Failures | |
387 | ====================== | |
388 | ||
389 | Symptoms of store corruption | |
390 | ---------------------------- | |
391 | ||
392 | Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If | |
393 | a monitor fails due to the key/value store corruption, following error messages | |
394 | might be found in the monitor log:: | |
395 | ||
396 | Corruption: error in middle of record | |
397 | ||
398 | or:: | |
399 | ||
400 | Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb | |
401 | ||
402 | Recovery using healthy monitor(s) | |
403 | --------------------------------- | |
404 | ||
405 | If there is any survivers, we can always `replace`_ the corrupted one with a | |
406 | new one. And after booting up, the new joiner will sync up with a healthy | |
407 | peer, and once it is fully sync'ed, it will be able to serve the clients. | |
408 | ||
409 | Recovery using OSDs | |
410 | ------------------- | |
411 | ||
412 | But what if all monitors fail at the same time? Since users are encouraged to | |
413 | deploy at least three monitors in a Ceph cluster, the chance of simultaneous | |
414 | failure is rare. But unplanned power-downs in a data center with improperly | |
415 | configured disk/fs settings could fail the underlying filesystem, and hence | |
416 | kill all the monitors. In this case, we can recover the monitor store with the | |
417 | information stored in OSDs.:: | |
418 | ||
419 | ms=/tmp/mon-store | |
420 | mkdir $ms | |
421 | # collect the cluster map from OSDs | |
422 | for host in $hosts; do | |
423 | rsync -avz $ms user@host:$ms | |
424 | rm -rf $ms | |
425 | ssh user@host <<EOF | |
426 | for osd in /var/lib/osd/osd-*; do | |
427 | ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms | |
428 | done | |
429 | EOF | |
430 | rsync -avz user@host:$ms $ms | |
431 | done | |
432 | # rebuild the monitor store from the collected map, if the cluster does not | |
433 | # use cephx authentication, we can skip the following steps to update the | |
434 | # keyring with the caps, and there is no need to pass the "--keyring" option. | |
435 | # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead | |
436 | ceph-authtool /path/to/admin.keyring -n mon. \ | |
c07f9fc5 | 437 | --cap mon 'allow *' |
7c673cae | 438 | ceph-authtool /path/to/admin.keyring -n client.admin \ |
c07f9fc5 | 439 | --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' |
7c673cae FG |
440 | ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring |
441 | # backup corrupted store.db just in case | |
442 | mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted | |
443 | mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db | |
444 | chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db | |
445 | ||
446 | The steps above | |
447 | ||
448 | #. collect the map from all OSD hosts, | |
449 | #. then rebuild the store, | |
450 | #. fill the entities in keyring file with appropriate caps | |
451 | #. replace the corrupted store on ``mon.0`` with the recovered copy. | |
452 | ||
453 | Known limitations | |
454 | ~~~~~~~~~~~~~~~~~ | |
455 | ||
456 | Following information are not recoverable using the steps above: | |
457 | ||
458 | - **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command | |
459 | are recovered from the OSD's copy. And the ``client.admin`` keyring is imported | |
460 | using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing | |
461 | in the recovered monitor store. You might need to re-add them manually. | |
462 | ||
463 | - **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using | |
464 | ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost. | |
465 | ||
466 | - **MDS Maps**: the MDS maps are lost. | |
467 | ||
468 | ||
469 | Everything Failed! Now What? | |
470 | ============================= | |
471 | ||
472 | Reaching out for help | |
473 | ---------------------- | |
474 | ||
475 | You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) | |
476 | and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make | |
477 | sure you have grabbed your logs and have them ready if someone asks: the faster | |
478 | the interaction and lower the latency in response, the better chances everyone's | |
479 | time is optimized. | |
480 | ||
481 | ||
482 | Preparing your logs | |
483 | --------------------- | |
484 | ||
485 | Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We | |
486 | may want them. However, your logs may not have the necessary information. If | |
487 | you don't find your monitor logs at their default location, you can check | |
488 | where they should be by running:: | |
489 | ||
490 | ceph-conf --name mon.FOO --show-config-value log_file | |
491 | ||
492 | The amount of information in the logs are subject to the debug levels being | |
493 | enforced by your configuration files. If you have not enforced a specific | |
494 | debug level then Ceph is using the default levels and your logs may not | |
495 | contain important information to track down you issue. | |
496 | A first step in getting relevant information into your logs will be to raise | |
497 | debug levels. In this case we will be interested in the information from the | |
498 | monitor. | |
499 | Similarly to what happens on other components, different parts of the monitor | |
500 | will output their debug information on different subsystems. | |
501 | ||
502 | You will have to raise the debug levels of those subsystems more closely | |
503 | related to your issue. This may not be an easy task for someone unfamiliar | |
504 | with troubleshooting Ceph. For most situations, setting the following options | |
505 | on your monitors will be enough to pinpoint a potential source of the issue:: | |
506 | ||
507 | debug mon = 10 | |
508 | debug ms = 1 | |
509 | ||
510 | If we find that these debug levels are not enough, there's a chance we may | |
511 | ask you to raise them or even define other debug subsystems to obtain infos | |
512 | from -- but at least we started off with some useful information, instead | |
513 | of a massively empty log without much to go on with. | |
514 | ||
515 | Do I need to restart a monitor to adjust debug levels? | |
516 | ------------------------------------------------------ | |
517 | ||
518 | No. You may do it in one of two ways: | |
519 | ||
520 | You have quorum | |
521 | ||
522 | Either inject the debug option into the monitor you want to debug:: | |
523 | ||
524 | ceph tell mon.FOO injectargs --debug_mon 10/10 | |
525 | ||
526 | or into all monitors at once:: | |
527 | ||
528 | ceph tell mon.* injectargs --debug_mon 10/10 | |
529 | ||
530 | No quourm | |
531 | ||
532 | Use the monitor's admin socket and directly adjust the configuration | |
533 | options:: | |
534 | ||
535 | ceph daemon mon.FOO config set debug_mon 10/10 | |
536 | ||
537 | ||
538 | Going back to default values is as easy as rerunning the above commands | |
539 | using the debug level ``1/10`` instead. You can check your current | |
540 | values using the admin socket and the following commands:: | |
541 | ||
542 | ceph daemon mon.FOO config show | |
543 | ||
544 | or:: | |
545 | ||
546 | ceph daemon mon.FOO config get 'OPTION_NAME' | |
547 | ||
548 | ||
549 | Reproduced the problem with appropriate debug levels. Now what? | |
550 | ---------------------------------------------------------------- | |
551 | ||
552 | Ideally you would send us only the relevant portions of your logs. | |
553 | We realise that figuring out the corresponding portion may not be the | |
554 | easiest of tasks. Therefore, we won't hold it to you if you provide the | |
555 | full log, but common sense should be employed. If your log has hundreds | |
556 | of thousands of lines, it may get tricky to go through the whole thing, | |
557 | specially if we are not aware at which point, whatever your issue is, | |
558 | happened. For instance, when reproducing, keep in mind to write down | |
559 | current time and date and to extract the relevant portions of your logs | |
560 | based on that. | |
561 | ||
562 | Finally, you should reach out to us on the mailing lists, on IRC or file | |
563 | a new issue on the `tracker`_. | |
564 | ||
565 | .. _cluster map: ../../architecture#cluster-map | |
566 | .. _replace: ../operation/add-or-rm-mons | |
567 | .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new |