]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/troubleshooting/troubleshooting-pg.rst
74d04bd9ffe39ff5c342aa9e0840d86e018f44eb
[ceph.git] / ceph / doc / rados / troubleshooting / troubleshooting-pg.rst
1 ====================
2 Troubleshooting PGs
3 ====================
4
5 Placement Groups Never Get Clean
6 ================================
7
8 If, after you have created your cluster, any Placement Groups (PGs) remain in
9 the ``active`` status, the ``active+remapped`` status or the
10 ``active+degraded`` status and never achieves an ``active+clean`` status, you
11 likely have a problem with your configuration.
12
13 In such a situation, it may be necessary to review the settings in the `Pool,
14 PG and CRUSH Config Reference`_ and make appropriate adjustments.
15
16 As a general rule, run your cluster with more than one OSD and a pool size
17 greater than two object replicas.
18
19 .. _one-node-cluster:
20
21 One Node Cluster
22 ----------------
23
24 Ceph no longer provides documentation for operating on a single node. Systems
25 designed for distributed computing by definition do not run on a single node.
26 The mounting of client kernel modules on a single node that contains a Ceph
27 daemon may cause a deadlock due to issues with the Linux kernel itself (unless
28 VMs are used as clients). You can experiment with Ceph in a one-node
29 configuration, in spite of the limitations as described herein.
30
31 To create a cluster on a single node, you must change the
32 ``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
33 ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
34 file before you create your monitors and OSDs. This tells Ceph that an OSD is
35 permitted to place another OSD on the same host. If you are trying to set up a
36 single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
37 Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
38 another node, chassis, rack, row, or datacenter depending on the setting.
39
40 .. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
41 Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
42 clients within virtual machines (VMs) on a single node.
43
44 If you are creating OSDs using a single disk, you must manually create
45 directories for the data first.
46
47
48 Fewer OSDs than Replicas
49 ------------------------
50
51 If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
52 in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
53 to greater than ``2``.
54
55 There are a few ways to address this situation. If you want to operate your
56 cluster in an ``active + degraded`` state with two replicas, you can set the
57 ``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
58 ``active + degraded`` state. You may also set the ``osd_pool_default_size``
59 setting to ``2`` so that you have only two stored replicas (the original and
60 one replica). In such a case, the cluster should achieve an ``active + clean``
61 state.
62
63 .. note:: You can make the changes while the cluster is running. If you make
64 the changes in your Ceph configuration file, you might need to restart your
65 cluster.
66
67
68 Pool Size = 1
69 -------------
70
71 If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
72 of the object. OSDs rely on other OSDs to tell them which objects they should
73 have. If one OSD has a copy of an object and there is no second copy, then
74 there is no second OSD to tell the first OSD that it should have that copy. For
75 each placement group mapped to the first OSD (see ``ceph pg dump``), you can
76 force the first OSD to notice the placement groups it needs by running a
77 command of the following form:
78
79 .. prompt:: bash
80
81 ceph osd force-create-pg <pgid>
82
83
84 CRUSH Map Errors
85 ----------------
86
87 If any placement groups in your cluster are unclean, then there might be errors
88 in your CRUSH map.
89
90
91 Stuck Placement Groups
92 ======================
93
94 It is normal for placement groups to enter "degraded" or "peering" states after
95 a component failure. Normally, these states reflect the expected progression
96 through the failure recovery process. However, a placement group that stays in
97 one of these states for a long time might be an indication of a larger problem.
98 For this reason, the Ceph Monitors will warn when placement groups get "stuck"
99 in a non-optimal state. Specifically, we check for:
100
101 * ``inactive`` - The placement group has not been ``active`` for too long (that
102 is, it hasn't been able to service read/write requests).
103
104 * ``unclean`` - The placement group has not been ``clean`` for too long (that
105 is, it hasn't been able to completely recover from a previous failure).
106
107 * ``stale`` - The placement group status has not been updated by a
108 ``ceph-osd``. This indicates that all nodes storing this placement group may
109 be ``down``.
110
111 List stuck placement groups by running one of the following commands:
112
113 .. prompt:: bash
114
115 ceph pg dump_stuck stale
116 ceph pg dump_stuck inactive
117 ceph pg dump_stuck unclean
118
119 - Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
120 daemons are not running.
121 - Stuck ``inactive`` placement groups usually indicate a peering problem (see
122 :ref:`failures-osd-peering`).
123 - Stuck ``unclean`` placement groups usually indicate that something is
124 preventing recovery from completing, possibly unfound objects (see
125 :ref:`failures-osd-unfound`);
126
127
128
129 .. _failures-osd-peering:
130
131 Placement Group Down - Peering Failure
132 ======================================
133
134 In certain cases, the ``ceph-osd`` `peering` process can run into problems,
135 which can prevent a PG from becoming active and usable. In such a case, running
136 the command ``ceph health detail`` will report something similar to the following:
137
138 .. prompt:: bash
139
140 ceph health detail
141
142 ::
143
144 HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
145 ...
146 pg 0.5 is down+peering
147 pg 1.4 is down+peering
148 ...
149 osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
150
151 Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
152
153 .. prompt:: bash
154
155 ceph pg 0.5 query
156
157 .. code-block:: javascript
158
159 { "state": "down+peering",
160 ...
161 "recovery_state": [
162 { "name": "Started\/Primary\/Peering\/GetInfo",
163 "enter_time": "2012-03-06 14:40:16.169679",
164 "requested_info_from": []},
165 { "name": "Started\/Primary\/Peering",
166 "enter_time": "2012-03-06 14:40:16.169659",
167 "probing_osds": [
168 0,
169 1],
170 "blocked": "peering is blocked due to down osds",
171 "down_osds_we_would_probe": [
172 1],
173 "peering_blocked_by": [
174 { "osd": 1,
175 "current_lost_at": 0,
176 "comment": "starting or marking this osd lost may let us proceed"}]},
177 { "name": "Started",
178 "enter_time": "2012-03-06 14:40:16.169513"}
179 ]
180 }
181
182 The ``recovery_state`` section tells us that peering is blocked due to down
183 ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
184 particular ``ceph-osd`` and recovery will proceed.
185
186 Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
187 there has been a disk failure), the cluster can be informed that the OSD is
188 ``lost`` and the cluster can be instructed that it must cope as best it can.
189
190 .. important:: Informing the cluster that an OSD has been lost is dangerous
191 because the cluster cannot guarantee that the other copies of the data are
192 consistent and up to date.
193
194 To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
195 anyway, run a command of the following form:
196
197 .. prompt:: bash
198
199 ceph osd lost 1
200
201 Recovery will proceed.
202
203
204 .. _failures-osd-unfound:
205
206 Unfound Objects
207 ===============
208
209 Under certain combinations of failures, Ceph may complain about ``unfound``
210 objects, as in this example:
211
212 .. prompt:: bash
213
214 ceph health detail
215
216 ::
217
218 HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
219 pg 2.4 is active+degraded, 78 unfound
220
221 This means that the storage cluster knows that some objects (or newer copies of
222 existing objects) exist, but it hasn't found copies of them. Here is an
223 example of how this might come about for a PG whose data is on two OSDS, which
224 we will call "1" and "2":
225
226 * 1 goes down
227 * 2 handles some writes, alone
228 * 1 comes up
229 * 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
230 * Before the new objects are copied, 2 goes down.
231
232 At this point, 1 knows that these objects exist, but there is no live
233 ``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
234 will block, and the cluster will hope that the failed node comes back soon.
235 This is assumed to be preferable to returning an IO error to the user.
236
237 .. note:: The situation described immediately above is one reason that setting
238 ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
239 data loss.
240
241 Identify which objects are unfound by running a command of the following form:
242
243 .. prompt:: bash
244
245 ceph pg 2.4 list_unfound [starting offset, in json]
246
247 .. code-block:: javascript
248
249 {
250 "num_missing": 1,
251 "num_unfound": 1,
252 "objects": [
253 {
254 "oid": {
255 "oid": "object",
256 "key": "",
257 "snapid": -2,
258 "hash": 2249616407,
259 "max": 0,
260 "pool": 2,
261 "namespace": ""
262 },
263 "need": "43'251",
264 "have": "0'0",
265 "flags": "none",
266 "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
267 "locations": [
268 "0(3)",
269 "4(2)"
270 ]
271 }
272 ],
273 "state": "NotRecovering",
274 "available_might_have_unfound": true,
275 "might_have_unfound": [
276 {
277 "osd": "2(4)",
278 "status": "osd is down"
279 }
280 ],
281 "more": false
282 }
283
284 If there are too many objects to list in a single result, the ``more`` field
285 will be true and you can query for more. (Eventually the command line tool
286 will hide this from you, but not yet.)
287
288 Now you can identify which OSDs have been probed or might contain data.
289
290 At the end of the listing (before ``more: false``), ``might_have_unfound`` is
291 provided when ``available_might_have_unfound`` is true. This is equivalent to
292 the output of ``ceph pg #.# query``. This eliminates the need to use ``query``
293 directly. The ``might_have_unfound`` information given behaves the same way as
294 that ``query`` does, which is described below. The only difference is that
295 OSDs that have the status of ``already probed`` are ignored.
296
297 Use of ``query``:
298
299 .. prompt:: bash
300
301 ceph pg 2.4 query
302
303 .. code-block:: javascript
304
305 "recovery_state": [
306 { "name": "Started\/Primary\/Active",
307 "enter_time": "2012-03-06 15:15:46.713212",
308 "might_have_unfound": [
309 { "osd": 1,
310 "status": "osd is down"}]},
311
312 In this case, the cluster knows that ``osd.1`` might have data, but it is
313 ``down``. Here is the full range of possible states:
314
315 * already probed
316 * querying
317 * OSD is down
318 * not queried (yet)
319
320 Sometimes it simply takes some time for the cluster to query possible
321 locations.
322
323 It is possible that there are other locations where the object might exist that
324 are not listed. For example: if an OSD is stopped and taken out of the cluster
325 and then the cluster fully recovers, and then through a subsequent set of
326 failures the cluster ends up with an unfound object, the cluster will ignore
327 the removed OSD. (This scenario, however, is unlikely.)
328
329 If all possible locations have been queried and objects are still lost, you may
330 have to give up on the lost objects. This, again, is possible only when unusual
331 combinations of failures have occurred that allow the cluster to learn about
332 writes that were performed before the writes themselves have been recovered. To
333 mark the "unfound" objects as "lost", run a command of the following form:
334
335 .. prompt:: bash
336
337 ceph pg 2.5 mark_unfound_lost revert|delete
338
339 Here the final argument (``revert|delete``) specifies how the cluster should
340 deal with lost objects.
341
342 The ``delete`` option will cause the cluster to forget about them entirely.
343
344 The ``revert`` option (which is not available for erasure coded pools) will
345 either roll back to a previous version of the object or (if it was a new
346 object) forget about the object entirely. Use ``revert`` with caution, as it
347 may confuse applications that expect the object to exist.
348
349 Homeless Placement Groups
350 =========================
351
352 It is possible that every OSD that has copies of a given placement group fails.
353 If this happens, then the subset of the object store that contains those
354 placement groups becomes unavailable and the monitor will receive no status
355 updates for those placement groups. The monitor marks as ``stale`` any
356 placement group whose primary OSD has failed. For example:
357
358 .. prompt:: bash
359
360 ceph health
361
362 ::
363
364 HEALTH_WARN 24 pgs stale; 3/300 in osds are down
365
366 Identify which placement groups are ``stale`` and which were the last OSDs to
367 store the ``stale`` placement groups by running the following command:
368
369 .. prompt:: bash
370
371 ceph health detail
372
373 ::
374
375 HEALTH_WARN 24 pgs stale; 3/300 in osds are down
376 ...
377 pg 2.5 is stuck stale+active+remapped, last acting [2,0]
378 ...
379 osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
380 osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
381 osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
382
383 This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
384 ``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
385 that placement group.
386
387
388 Only a Few OSDs Receive Data
389 ============================
390
391 If only a few of the nodes in the cluster are receiving data, check the number
392 of placement groups in the pool as instructed in the :ref:`Placement Groups
393 <rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
394 OSDs in an operation involving dividing the number of placement groups in the
395 cluster by the number of OSDs in the cluster, a small number of placement
396 groups (the remainder, in this operation) are sometimes not distributed across
397 the cluster. In situations like this, create a pool with a placement group
398 count that is a multiple of the number of OSDs. See `Placement Groups`_ for
399 details. See the :ref:`Pool, PG, and CRUSH Config Reference
400 <rados_config_pool_pg_crush_ref>` for instructions on changing the default
401 values used to determine how many placement groups are assigned to each pool.
402
403
404 Can't Write Data
405 ================
406
407 If the cluster is up, but some OSDs are down and you cannot write data, make
408 sure that you have the minimum number of OSDs running in the pool. If you don't
409 have the minimum number of OSDs running in the pool, Ceph will not allow you to
410 write data to it because there is no guarantee that Ceph can replicate your
411 data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
412 Config Reference <rados_config_pool_pg_crush_ref>` for details.
413
414
415 PGs Inconsistent
416 ================
417
418 If the command ``ceph health detail`` returns an ``active + clean +
419 inconsistent`` state, this might indicate an error during scrubbing. Identify
420 the inconsistent placement group or placement groups by running the following
421 command:
422
423 .. prompt:: bash
424
425 $ ceph health detail
426
427 ::
428
429 HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
430 pg 0.6 is active+clean+inconsistent, acting [0,1,2]
431 2 scrub errors
432
433 Alternatively, run this command if you prefer to inspect the output in a
434 programmatic way:
435
436 .. prompt:: bash
437
438 $ rados list-inconsistent-pg rbd
439
440 ::
441
442 ["0.6"]
443
444 There is only one consistent state, but in the worst case, we could have
445 different inconsistencies in multiple perspectives found in more than one
446 object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
447 ``rados list-inconsistent-pg rbd`` will look something like this:
448
449 .. prompt:: bash
450
451 rados list-inconsistent-obj 0.6 --format=json-pretty
452
453 .. code-block:: javascript
454
455 {
456 "epoch": 14,
457 "inconsistents": [
458 {
459 "object": {
460 "name": "foo",
461 "nspace": "",
462 "locator": "",
463 "snap": "head",
464 "version": 1
465 },
466 "errors": [
467 "data_digest_mismatch",
468 "size_mismatch"
469 ],
470 "union_shard_errors": [
471 "data_digest_mismatch_info",
472 "size_mismatch_info"
473 ],
474 "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
475 "shards": [
476 {
477 "osd": 0,
478 "errors": [],
479 "size": 968,
480 "omap_digest": "0xffffffff",
481 "data_digest": "0xe978e67f"
482 },
483 {
484 "osd": 1,
485 "errors": [],
486 "size": 968,
487 "omap_digest": "0xffffffff",
488 "data_digest": "0xe978e67f"
489 },
490 {
491 "osd": 2,
492 "errors": [
493 "data_digest_mismatch_info",
494 "size_mismatch_info"
495 ],
496 "size": 0,
497 "omap_digest": "0xffffffff",
498 "data_digest": "0xffffffff"
499 }
500 ]
501 }
502 ]
503 }
504
505 In this case, the output indicates the following:
506
507 * The only inconsistent object is named ``foo``, and its head has
508 inconsistencies.
509 * The inconsistencies fall into two categories:
510
511 #. ``errors``: these errors indicate inconsistencies between shards, without
512 an indication of which shard(s) are bad. Check for the ``errors`` in the
513 ``shards`` array, if available, to pinpoint the problem.
514
515 * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
516 is different from the digests of the replica reads of ``OSD.0`` and
517 ``OSD.1``
518 * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
519 but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
520
521 #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
522 ``shards`` array. The ``errors`` are set for the shard with the problem.
523 These errors include ``read_error`` and other similar errors. The
524 ``errors`` ending in ``oi`` indicate a comparison with
525 ``selected_object_info``. Examine the ``shards`` array to determine
526 which shard has which error or errors.
527
528 * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
529 is not ``0xffffffff``, which is calculated from the shard read from
530 ``OSD.2``
531 * ``size_mismatch_info``: the size stored in the ``object-info`` is
532 different from the size read from ``OSD.2``. The latter is ``0``.
533
534 .. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
535 inconsistency is likely due to physical storage errors. In cases like this,
536 check the storage used by that OSD.
537
538 Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
539 repair.
540
541 To repair the inconsistent placement group, run a command of the following
542 form:
543
544 .. prompt:: bash
545
546 ceph pg repair {placement-group-ID}
547
548 .. warning: This command overwrites the "bad" copies with "authoritative"
549 copies. In most cases, Ceph is able to choose authoritative copies from all
550 the available replicas by using some predefined criteria. This, however,
551 does not work in every case. For example, it might be the case that the
552 stored data digest is missing, which means that the calculated digest is
553 ignored when Ceph chooses the authoritative copies. Be aware of this, and
554 use the above command with caution.
555
556
557 If you receive ``active + clean + inconsistent`` states periodically due to
558 clock skew, consider configuring the `NTP
559 <https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
560 hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
561 and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
562
563
564 Erasure Coded PGs are not active+clean
565 ======================================
566
567 If CRUSH fails to find enough OSDs to map to a PG, it will show as a
568 ``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
569
570 [2,1,6,0,5,8,2147483647,7,4]
571
572 Not enough OSDs
573 ---------------
574
575 If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
576 OSDs, the cluster will show "Not enough OSDs". In this case, you either create
577 another erasure coded pool that requires fewer OSDs, by running commands of the
578 following form:
579
580 .. prompt:: bash
581
582 ceph osd erasure-code-profile set myprofile k=5 m=3
583 ceph osd pool create erasurepool erasure myprofile
584
585 or add new OSDs, and the PG will automatically use them.
586
587 CRUSH constraints cannot be satisfied
588 -------------------------------------
589
590 If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
591 constraints that cannot be satisfied. If there are ten OSDs on two hosts and
592 the CRUSH rule requires that no two OSDs from the same host are used in the
593 same PG, the mapping may fail because only two OSDs will be found. Check the
594 constraint by displaying ("dumping") the rule, as shown here:
595
596 .. prompt:: bash
597
598 ceph osd crush rule ls
599
600 ::
601
602 [
603 "replicated_rule",
604 "erasurepool"]
605 $ ceph osd crush rule dump erasurepool
606 { "rule_id": 1,
607 "rule_name": "erasurepool",
608 "type": 3,
609 "steps": [
610 { "op": "take",
611 "item": -1,
612 "item_name": "default"},
613 { "op": "chooseleaf_indep",
614 "num": 0,
615 "type": "host"},
616 { "op": "emit"}]}
617
618
619 Resolve this problem by creating a new pool in which PGs are allowed to have
620 OSDs residing on the same host by running the following commands:
621
622 .. prompt:: bash
623
624 ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
625 ceph osd pool create erasurepool erasure myprofile
626
627 CRUSH gives up too soon
628 -----------------------
629
630 If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
631 with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
632 PG), it is possible that CRUSH gives up before finding a mapping. This problem
633 can be resolved by:
634
635 * lowering the erasure coded pool requirements to use fewer OSDs per PG (this
636 requires the creation of another pool, because erasure code profiles cannot
637 be modified dynamically).
638
639 * adding more OSDs to the cluster (this does not require the erasure coded pool
640 to be modified, because it will become clean automatically)
641
642 * using a handmade CRUSH rule that tries more times to find a good mapping.
643 This can be modified for an existing CRUSH rule by setting
644 ``set_choose_tries`` to a value greater than the default.
645
646 First, verify the problem by using ``crushtool`` after extracting the crushmap
647 from the cluster. This ensures that your experiments do not modify the Ceph
648 cluster and that they operate only on local files:
649
650 .. prompt:: bash
651
652 ceph osd crush rule dump erasurepool
653
654 ::
655
656 { "rule_id": 1,
657 "rule_name": "erasurepool",
658 "type": 3,
659 "steps": [
660 { "op": "take",
661 "item": -1,
662 "item_name": "default"},
663 { "op": "chooseleaf_indep",
664 "num": 0,
665 "type": "host"},
666 { "op": "emit"}]}
667 $ ceph osd getcrushmap > crush.map
668 got crush map from osdmap epoch 13
669 $ crushtool -i crush.map --test --show-bad-mappings \
670 --rule 1 \
671 --num-rep 9 \
672 --min-x 1 --max-x $((1024 * 1024))
673 bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
674 bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
675 bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
676
677 Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
678 needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
679 ``ceph osd crush rule dump``. This test will attempt to map one million values
680 (in this example, the range defined by ``[--min-x,--max-x]``) and must display
681 at least one bad mapping. If this test outputs nothing, all mappings have been
682 successful and you can be assured that the problem with your cluster is not
683 caused by bad mappings.
684
685 Changing the value of set_choose_tries
686 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
687
688 #. Decompile the CRUSH map to edit the CRUSH rule by running the following
689 command:
690
691 .. prompt:: bash
692
693 crushtool --decompile crush.map > crush.txt
694
695 #. Add the following line to the rule::
696
697 step set_choose_tries 100
698
699 The relevant part of the ``crush.txt`` file will resemble this::
700
701 rule erasurepool {
702 id 1
703 type erasure
704 step set_chooseleaf_tries 5
705 step set_choose_tries 100
706 step take default
707 step chooseleaf indep 0 type host
708 step emit
709 }
710
711 #. Recompile and retest the CRUSH rule:
712
713 .. prompt:: bash
714
715 crushtool --compile crush.txt -o better-crush.map
716
717 #. When all mappings succeed, display a histogram of the number of tries that
718 were necessary to find all of the mapping by using the
719 ``--show-choose-tries`` option of the ``crushtool`` command, as in the
720 following example:
721
722 .. prompt:: bash
723
724 crushtool -i better-crush.map --test --show-bad-mappings \
725 --show-choose-tries \
726 --rule 1 \
727 --num-rep 9 \
728 --min-x 1 --max-x $((1024 * 1024))
729 ...
730 11: 42
731 12: 44
732 13: 54
733 14: 45
734 15: 35
735 16: 34
736 17: 30
737 18: 25
738 19: 19
739 20: 22
740 21: 20
741 22: 17
742 23: 13
743 24: 16
744 25: 13
745 26: 11
746 27: 11
747 28: 13
748 29: 11
749 30: 10
750 31: 6
751 32: 5
752 33: 10
753 34: 3
754 35: 7
755 36: 5
756 37: 2
757 38: 5
758 39: 5
759 40: 2
760 41: 5
761 42: 4
762 43: 1
763 44: 2
764 45: 2
765 46: 3
766 47: 1
767 48: 0
768 ...
769 102: 0
770 103: 1
771 104: 0
772 ...
773
774 This output indicates that it took eleven tries to map forty-two PGs, twelve
775 tries to map forty-four PGs etc. The highest number of tries is the minimum
776 value of ``set_choose_tries`` that prevents bad mappings (for example,
777 ``103`` in the above output, because it did not take more than 103 tries for
778 any PG to be mapped).
779
780 .. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
781 .. _Placement Groups: ../../operations/placement-groups
782 .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref