]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/troubleshooting/troubleshooting-pg.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / rados / troubleshooting / troubleshooting-pg.rst
CommitLineData
7c673cae
FG
1=====================
2 Troubleshooting PGs
3=====================
4
5Placement Groups Never Get Clean
6================================
7
8When you create a cluster and your cluster remains in ``active``,
9``active+remapped`` or ``active+degraded`` status and never achieve an
10``active+clean`` status, you likely have a problem with your configuration.
11
12You may need to review settings in the `Pool, PG and CRUSH Config Reference`_
13and make appropriate adjustments.
14
15As a general rule, you should run your cluster with more than one OSD and a
16pool size greater than 1 object replica.
17
18One Node Cluster
19----------------
20
21Ceph no longer provides documentation for operating on a single node, because
22you would never deploy a system designed for distributed computing on a single
23node. Additionally, mounting client kernel modules on a single node containing a
24Ceph daemon may cause a deadlock due to issues with the Linux kernel itself
25(unless you use VMs for the clients). You can experiment with Ceph in a 1-node
26configuration, in spite of the limitations as described herein.
27
28If you are trying to create a cluster on a single node, you must change the
29default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning
30``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
31file before you create your monitors and OSDs. This tells Ceph that an OSD
32can peer with another OSD on the same host. If you are trying to set up a
331-node cluster and ``osd crush chooseleaf type`` is greater than ``0``,
34Ceph will try to peer the PGs of one OSD with the PGs of another OSD on
35another node, chassis, rack, row, or even datacenter depending on the setting.
36
37.. tip:: DO NOT mount kernel clients directly on the same node as your
38 Ceph Storage Cluster, because kernel conflicts can arise. However, you
39 can mount kernel clients within virtual machines (VMs) on a single node.
40
41If you are creating OSDs using a single disk, you must create directories
42for the data manually first. For example::
43
44 mkdir /var/local/osd0 /var/local/osd1
45 ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1
46 ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1
47
48
49Fewer OSDs than Replicas
50------------------------
51
52If you've brought up two OSDs to an ``up`` and ``in`` state, but you still
53don't see ``active + clean`` placement groups, you may have an
54``osd pool default size`` set to greater than ``2``.
55
56There are a few ways to address this situation. If you want to operate your
57cluster in an ``active + degraded`` state with two replicas, you can set the
58``osd pool default min size`` to ``2`` so that you can write objects in
59an ``active + degraded`` state. You may also set the ``osd pool default size``
60setting to ``2`` so that you only have two stored replicas (the original and
61one replica), in which case the cluster should achieve an ``active + clean``
62state.
63
64.. note:: You can make the changes at runtime. If you make the changes in
65 your Ceph configuration file, you may need to restart your cluster.
66
67
68Pool Size = 1
69-------------
70
71If you have the ``osd pool default size`` set to ``1``, you will only have
72one copy of the object. OSDs rely on other OSDs to tell them which objects
73they should have. If a first OSD has a copy of an object and there is no
74second copy, then no second OSD can tell the first OSD that it should have
75that copy. For each placement group mapped to the first OSD (see
76``ceph pg dump``), you can force the first OSD to notice the placement groups
77it needs by running::
78
79 ceph pg force_create_pg <pgid>
80
81
82CRUSH Map Errors
83----------------
84
85Another candidate for placement groups remaining unclean involves errors
86in your CRUSH map.
87
88
89Stuck Placement Groups
90======================
91
92It is normal for placement groups to enter states like "degraded" or "peering"
93following a failure. Normally these states indicate the normal progression
94through the failure recovery process. However, if a placement group stays in one
95of these states for a long time this may be an indication of a larger problem.
96For this reason, the monitor will warn when placement groups get "stuck" in a
97non-optimal state. Specifically, we check for:
98
99* ``inactive`` - The placement group has not been ``active`` for too long
100 (i.e., it hasn't been able to service read/write requests).
101
102* ``unclean`` - The placement group has not been ``clean`` for too long
103 (i.e., it hasn't been able to completely recover from a previous failure).
104
105* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
106 indicating that all nodes storing this placement group may be ``down``.
107
108You can explicitly list stuck placement groups with one of::
109
110 ceph pg dump_stuck stale
111 ceph pg dump_stuck inactive
112 ceph pg dump_stuck unclean
113
114For stuck ``stale`` placement groups, it is normally a matter of getting the
115right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
116groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
117stuck ``unclean`` placement groups, there is usually something preventing
118recovery from completing, like unfound objects (see
119:ref:`failures-osd-unfound`);
120
121
122
123.. _failures-osd-peering:
124
125Placement Group Down - Peering Failure
126======================================
127
128In certain cases, the ``ceph-osd`` `Peering` process can run into
129problems, preventing a PG from becoming active and usable. For
130example, ``ceph health`` might report::
131
132 ceph health detail
133 HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
134 ...
135 pg 0.5 is down+peering
136 pg 1.4 is down+peering
137 ...
138 osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
139
140We can query the cluster to determine exactly why the PG is marked ``down`` with::
141
142 ceph pg 0.5 query
143
144.. code-block:: javascript
145
146 { "state": "down+peering",
147 ...
148 "recovery_state": [
149 { "name": "Started\/Primary\/Peering\/GetInfo",
150 "enter_time": "2012-03-06 14:40:16.169679",
151 "requested_info_from": []},
152 { "name": "Started\/Primary\/Peering",
153 "enter_time": "2012-03-06 14:40:16.169659",
154 "probing_osds": [
155 0,
156 1],
157 "blocked": "peering is blocked due to down osds",
158 "down_osds_we_would_probe": [
159 1],
160 "peering_blocked_by": [
161 { "osd": 1,
162 "current_lost_at": 0,
163 "comment": "starting or marking this osd lost may let us proceed"}]},
164 { "name": "Started",
165 "enter_time": "2012-03-06 14:40:16.169513"}
166 ]
167 }
168
169The ``recovery_state`` section tells us that peering is blocked due to
170down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
171and things will recover.
172
173Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
174failure), we can tell the cluster that it is ``lost`` and to cope as
175best it can.
176
177.. important:: This is dangerous in that the cluster cannot
178 guarantee that the other copies of the data are consistent
179 and up to date.
180
181To instruct Ceph to continue anyway::
182
183 ceph osd lost 1
184
185Recovery will proceed.
186
187
188.. _failures-osd-unfound:
189
190Unfound Objects
191===============
192
193Under certain combinations of failures Ceph may complain about
194``unfound`` objects::
195
196 ceph health detail
197 HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
198 pg 2.4 is active+degraded, 78 unfound
199
200This means that the storage cluster knows that some objects (or newer
201copies of existing objects) exist, but it hasn't found copies of them.
202One example of how this might come about for a PG whose data is on ceph-osds
2031 and 2:
204
205* 1 goes down
206* 2 handles some writes, alone
207* 1 comes up
208* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
209* Before the new objects are copied, 2 goes down.
210
211Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
212has a copy. In this case, IO to those objects will block, and the
213cluster will hope that the failed node comes back soon; this is
214assumed to be preferable to returning an IO error to the user.
215
216First, you can identify which objects are unfound with::
217
218 ceph pg 2.4 list_missing [starting offset, in json]
219
220.. code-block:: javascript
221
222 { "offset": { "oid": "",
223 "key": "",
224 "snapid": 0,
225 "hash": 0,
226 "max": 0},
227 "num_missing": 0,
228 "num_unfound": 0,
229 "objects": [
230 { "oid": "object 1",
231 "key": "",
232 "hash": 0,
233 "max": 0 },
234 ...
235 ],
236 "more": 0}
237
238If there are too many objects to list in a single result, the ``more``
239field will be true and you can query for more. (Eventually the
240command line tool will hide this from you, but not yet.)
241
242Second, you can identify which OSDs have been probed or might contain
243data::
244
245 ceph pg 2.4 query
246
247.. code-block:: javascript
248
249 "recovery_state": [
250 { "name": "Started\/Primary\/Active",
251 "enter_time": "2012-03-06 15:15:46.713212",
252 "might_have_unfound": [
253 { "osd": 1,
254 "status": "osd is down"}]},
255
256In this case, for example, the cluster knows that ``osd.1`` might have
257data, but it is ``down``. The full range of possible states include:
258
259* already probed
260* querying
261* OSD is down
262* not queried (yet)
263
264Sometimes it simply takes some time for the cluster to query possible
265locations.
266
267It is possible that there are other locations where the object can
268exist that are not listed. For example, if a ceph-osd is stopped and
269taken out of the cluster, the cluster fully recovers, and due to some
270future set of failures ends up with an unfound object, it won't
271consider the long-departed ceph-osd as a potential location to
272consider. (This scenario, however, is unlikely.)
273
274If all possible locations have been queried and objects are still
275lost, you may have to give up on the lost objects. This, again, is
276possible given unusual combinations of failures that allow the cluster
277to learn about writes that were performed before the writes themselves
278are recovered. To mark the "unfound" objects as "lost"::
279
280 ceph pg 2.5 mark_unfound_lost revert|delete
281
282This the final argument specifies how the cluster should deal with
283lost objects.
284
285The "delete" option will forget about them entirely.
286
287The "revert" option (not available for erasure coded pools) will
288either roll back to a previous version of the object or (if it was a
289new object) forget about it entirely. Use this with caution, as it
290may confuse applications that expected the object to exist.
291
292
293Homeless Placement Groups
294=========================
295
296It is possible for all OSDs that had copies of a given placement groups to fail.
297If that's the case, that subset of the object store is unavailable, and the
298monitor will receive no status updates for those placement groups. To detect
299this situation, the monitor marks any placement group whose primary OSD has
300failed as ``stale``. For example::
301
302 ceph health
303 HEALTH_WARN 24 pgs stale; 3/300 in osds are down
304
305You can identify which placement groups are ``stale``, and what the last OSDs to
306store them were, with::
307
308 ceph health detail
309 HEALTH_WARN 24 pgs stale; 3/300 in osds are down
310 ...
311 pg 2.5 is stuck stale+active+remapped, last acting [2,0]
312 ...
313 osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
314 osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
315 osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
316
317If we want to get placement group 2.5 back online, for example, this tells us that
318it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
319daemons will allow the cluster to recover that placement group (and, presumably,
320many others).
321
322
323Only a Few OSDs Receive Data
324============================
325
326If you have many nodes in your cluster and only a few of them receive data,
327`check`_ the number of placement groups in your pool. Since placement groups get
328mapped to OSDs, a small number of placement groups will not distribute across
329your cluster. Try creating a pool with a placement group count that is a
330multiple of the number of OSDs. See `Placement Groups`_ for details. The default
331placement group count for pools isn't useful, but you can change it `here`_.
332
333
334Can't Write Data
335================
336
337If your cluster is up, but some OSDs are down and you cannot write data,
338check to ensure that you have the minimum number of OSDs running for the
339placement group. If you don't have the minimum number of OSDs running,
340Ceph will not allow you to write data because there is no guarantee
341that Ceph can replicate your data. See ``osd pool default min size``
342in the `Pool, PG and CRUSH Config Reference`_ for details.
343
344
345PGs Inconsistent
346================
347
348If you receive an ``active + clean + inconsistent`` state, this may happen
349due to an error during scrubbing. As always, we can identify the inconsistent
350placement group(s) with::
351
352 $ ceph health detail
353 HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
354 pg 0.6 is active+clean+inconsistent, acting [0,1,2]
355 2 scrub errors
356
357Or if you prefer inspecting the output in a programmatic way::
358
359 $ rados list-inconsistent-pg rbd
360 ["0.6"]
361
362There is only one consistent state, but in the worst case, we could have
363different inconsistencies in multiple perspectives found in more than one
364objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have::
365
366 $ rados list-inconsistent-obj 0.6 --format=json-pretty
367
368.. code-block:: javascript
369
370 {
371 "epoch": 14,
372 "inconsistents": [
373 {
374 "object": {
375 "name": "foo",
376 "nspace": "",
377 "locator": "",
378 "snap": "head",
379 "version": 1
380 },
381 "errors": [
382 "data_digest_mismatch",
383 "size_mismatch"
384 ],
385 "union_shard_errors": [
386 "data_digest_mismatch_oi",
387 "size_mismatch_oi"
388 ],
389 "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
390 "shards": [
391 {
392 "osd": 0,
393 "errors": [],
394 "size": 968,
395 "omap_digest": "0xffffffff",
396 "data_digest": "0xe978e67f"
397 },
398 {
399 "osd": 1,
400 "errors": [],
401 "size": 968,
402 "omap_digest": "0xffffffff",
403 "data_digest": "0xe978e67f"
404 },
405 {
406 "osd": 2,
407 "errors": [
408 "data_digest_mismatch_oi",
409 "size_mismatch_oi"
410 ],
411 "size": 0,
412 "omap_digest": "0xffffffff",
413 "data_digest": "0xffffffff"
414 }
415 ]
416 }
417 ]
418 }
419
420In this case, we can learn from the output:
421
422* The only inconsistent object is named ``foo``, and it is its head that has
423 inconsistencies.
424* The inconsistencies fall into two categories:
425
426 * ``errors``: these errors indicate inconsistencies between shards without a
427 determination of which shard(s) are bad. Check for the ``errors`` in the
428 `shards` array, if available, to pinpoint the problem.
429
430 * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is
431 different from the ones of OSD.0 and OSD.1
432 * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while
433 the size reported by OSD.0 and OSD.1 is 968.
434 * ``union_shard_errors``: the union of all shard specific ``errors`` in
435 ``shards`` array. The ``errors`` are set for the given shard that has the
436 problem. They include errors like ``read_error``. The ``errors`` ending in
437 ``oi`` indicate a comparison with ``selected_object_info``. Look at the
438 ``shards`` array to determine which shard has which error(s).
439
440 * ``data_digest_mismatch_oi``: the digest stored in the object-info is not
441 ``0xffffffff``, which is calculated from the shard read from OSD.2
442 * ``size_mismatch_oi``: the size stored in the object-info is different
443 from the one read from OSD.2. The latter is 0.
444
445You can repair the inconsistent placement group by executing::
446
447 ceph pg repair {placement-group-ID}
448
449Which overwrites the `bad` copies with the `authoritative` ones. In most cases,
450Ceph is able to choose authoritative copies from all available replicas using
451some predefined criteria. But this does not always work. For example, the stored
452data digest could be missing, and the calculated digest will be ignored when
453choosing the authoritative copies. So, please use the above command with caution.
454
455If ``read_error`` is listed in the ``errors`` attribute of a shard, the
456inconsistency is likely due to disk errors. You might want to check your disk
457used by that OSD.
458
459If you receive ``active + clean + inconsistent`` states periodically due to
460clock skew, you may consider configuring your `NTP`_ daemons on your
461monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph
462`Clock Settings`_ for additional details.
463
464
465Erasure Coded PGs are not active+clean
466======================================
467
468When CRUSH fails to find enough OSDs to map to a PG, it will show as a
469``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance::
470
471 [2,1,6,0,5,8,2147483647,7,4]
472
473Not enough OSDs
474---------------
475
476If the Ceph cluster only has 8 OSDs and the erasure coded pool needs
4779, that is what it will show. You can either create another erasure
478coded pool that requires less OSDs::
479
480 ceph osd erasure-code-profile set myprofile k=5 m=3
481 ceph osd pool create erasurepool 16 16 erasure myprofile
482
483or add a new OSDs and the PG will automatically use them.
484
485CRUSH constraints cannot be satisfied
486-------------------------------------
487
488If the cluster has enough OSDs, it is possible that the CRUSH ruleset
489imposes constraints that cannot be satisfied. If there are 10 OSDs on
490two hosts and the CRUSH rulesets require that no two OSDs from the
491same host are used in the same PG, the mapping may fail because only
492two OSD will be found. You can check the constraint by displaying the
493ruleset::
494
495 $ ceph osd crush rule ls
496 [
497 "replicated_ruleset",
498 "erasurepool"]
499 $ ceph osd crush rule dump erasurepool
500 { "rule_id": 1,
501 "rule_name": "erasurepool",
502 "ruleset": 1,
503 "type": 3,
504 "min_size": 3,
505 "max_size": 20,
506 "steps": [
507 { "op": "take",
508 "item": -1,
509 "item_name": "default"},
510 { "op": "chooseleaf_indep",
511 "num": 0,
512 "type": "host"},
513 { "op": "emit"}]}
514
515
516You can resolve the problem by creating a new pool in which PGs are allowed
517to have OSDs residing on the same host with::
518
224ce89b 519 ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
7c673cae
FG
520 ceph osd pool create erasurepool 16 16 erasure myprofile
521
522CRUSH gives up too soon
523-----------------------
524
525If the Ceph cluster has just enough OSDs to map the PG (for instance a
526cluster with a total of 9 OSDs and an erasure coded pool that requires
5279 OSDs per PG), it is possible that CRUSH gives up before finding a
528mapping. It can be resolved by:
529
530* lowering the erasure coded pool requirements to use less OSDs per PG
531 (that requires the creation of another pool as erasure code profiles
532 cannot be dynamically modified).
533
534* adding more OSDs to the cluster (that does not require the erasure
535 coded pool to be modified, it will become clean automatically)
536
537* use a hand made CRUSH ruleset that tries more times to find a good
538 mapping. It can be done by setting ``set_choose_tries`` to a value
539 greater than the default.
540
541You should first verify the problem with ``crushtool`` after
542extracting the crushmap from the cluster so your experiments do not
543modify the Ceph cluster and only work on a local files::
544
545 $ ceph osd crush rule dump erasurepool
546 { "rule_name": "erasurepool",
547 "ruleset": 1,
548 "type": 3,
549 "min_size": 3,
550 "max_size": 20,
551 "steps": [
552 { "op": "take",
553 "item": -1,
554 "item_name": "default"},
555 { "op": "chooseleaf_indep",
556 "num": 0,
557 "type": "host"},
558 { "op": "emit"}]}
559 $ ceph osd getcrushmap > crush.map
560 got crush map from osdmap epoch 13
561 $ crushtool -i crush.map --test --show-bad-mappings \
562 --rule 1 \
563 --num-rep 9 \
564 --min-x 1 --max-x $((1024 * 1024))
565 bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
566 bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
567 bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
568
569Where ``--num-rep`` is the number of OSDs the erasure code crush
570ruleset needs, ``--rule`` is the value of the ``ruleset`` field
571displayed by ``ceph osd crush rule dump``. The test will try mapping
572one million values (i.e. the range defined by ``[--min-x,--max-x]``)
573and must display at least one bad mapping. If it outputs nothing it
574means all mappings are successfull and you can stop right there: the
575problem is elsewhere.
576
577The crush ruleset can be edited by decompiling the crush map::
578
579 $ crushtool --decompile crush.map > crush.txt
580
581and adding the following line to the ruleset::
582
583 step set_choose_tries 100
584
585The relevant part of of the ``crush.txt`` file should look something
586like::
587
588 rule erasurepool {
589 ruleset 1
590 type erasure
591 min_size 3
592 max_size 20
593 step set_chooseleaf_tries 5
594 step set_choose_tries 100
595 step take default
596 step chooseleaf indep 0 type host
597 step emit
598 }
599
600It can then be compiled and tested again::
601
602 $ crushtool --compile crush.txt -o better-crush.map
603
604When all mappings succeed, an histogram of the number of tries that
605were necessary to find all of them can be displayed with the
606``--show-choose-tries`` option of ``crushtool``::
607
608 $ crushtool -i better-crush.map --test --show-bad-mappings \
609 --show-choose-tries \
610 --rule 1 \
611 --num-rep 9 \
612 --min-x 1 --max-x $((1024 * 1024))
613 ...
614 11: 42
615 12: 44
616 13: 54
617 14: 45
618 15: 35
619 16: 34
620 17: 30
621 18: 25
622 19: 19
623 20: 22
624 21: 20
625 22: 17
626 23: 13
627 24: 16
628 25: 13
629 26: 11
630 27: 11
631 28: 13
632 29: 11
633 30: 10
634 31: 6
635 32: 5
636 33: 10
637 34: 3
638 35: 7
639 36: 5
640 37: 2
641 38: 5
642 39: 5
643 40: 2
644 41: 5
645 42: 4
646 43: 1
647 44: 2
648 45: 2
649 46: 3
650 47: 1
651 48: 0
652 ...
653 102: 0
654 103: 1
655 104: 0
656 ...
657
658It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).
659
660.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
661.. _here: ../../configuration/pool-pg-config-ref
662.. _Placement Groups: ../../operations/placement-groups
663.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
664.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol
665.. _The Network Time Protocol: http://www.ntp.org/
666.. _Clock Settings: ../../configuration/mon-config-ref/#clock
667
668