ceph/doc/rados/troubleshooting/troubleshooting-pg.rst

   1 ====================
   2  Troubleshooting PGs
   3 ====================
   4
   5 Placement Groups Never Get Clean
   6 ================================
   7
   8 If, after you have created your cluster, any Placement Groups (PGs) remain in
   9 the ``active`` status, the ``active+remapped`` status or the
  10 ``active+degraded`` status and never achieves an ``active+clean`` status, you
  11 likely have a problem with your configuration.
  12
  13 In such a situation, it may be necessary to review the settings in the `Pool,
  14 PG and CRUSH Config Reference`_ and make appropriate adjustments.
  15
  16 As a general rule, run your cluster with more than one OSD and a pool size
  17 greater than two object replicas.
  18
  19 .. _one-node-cluster:
  20
  21 One Node Cluster
  22 ----------------
  23
  24 Ceph no longer provides documentation for operating on a single node.  Systems
  25 designed for distributed computing by definition do not run on a single node.
  26 The mounting of client kernel modules on a single node that contains a Ceph
  27 daemon may cause a deadlock due to issues with the Linux kernel itself (unless
  28 VMs are used as clients). You can experiment with Ceph in a one-node
  29 configuration, in spite of the limitations as described herein.
  30
  31 To create a cluster on a single node, you must change the
  32 ``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
  33 ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
  34 file before you create your monitors and OSDs. This tells Ceph that an OSD is
  35 permitted to place another OSD on the same host. If you are trying to set up a
  36 single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
  37 Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
  38 another node, chassis, rack, row, or datacenter depending on the setting.
  39
  40 .. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
  41    Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
  42    clients within virtual machines (VMs) on a single node.
  43
  44 If you are creating OSDs using a single disk, you must manually create
  45 directories for the data first.
  46
  47
  48 Fewer OSDs than Replicas
  49 ------------------------
  50
  51 If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
  52 in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
  53 to greater than ``2``.
  54
  55 There are a few ways to address this situation. If you want to operate your
  56 cluster in an ``active + degraded`` state with two replicas, you can set the
  57 ``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
  58 ``active + degraded`` state. You may also set the ``osd_pool_default_size``
  59 setting to ``2`` so that you have only two stored replicas (the original and
  60 one replica). In such a case, the cluster should achieve an ``active + clean``
  61 state.
  62
  63 .. note:: You can make the changes while the cluster is running. If you make
  64    the changes in your Ceph configuration file, you might need to restart your
  65    cluster.
  66
  67
  68 Pool Size = 1
  69 -------------
  70
  71 If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
  72 of the object. OSDs rely on other OSDs to tell them which objects they should
  73 have. If one OSD has a copy of an object and there is no second copy, then
  74 there is no second OSD to tell the first OSD that it should have that copy. For
  75 each placement group mapped to the first OSD (see ``ceph pg dump``), you can
  76 force the first OSD to notice the placement groups it needs by running a
  77 command of the following form:
  78
  79 .. prompt:: bash
  80
  81    ceph osd force-create-pg <pgid>
  82
  83
  84 CRUSH Map Errors
  85 ----------------
  86
  87 If any placement groups in your cluster are unclean, then there might be errors
  88 in your CRUSH map.
  89
  90
  91 Stuck Placement Groups
  92 ======================
  93
  94 It is normal for placement groups to enter "degraded" or "peering" states after
  95 a component failure. Normally, these states reflect the expected progression
  96 through the failure recovery process. However, a placement group that stays in
  97 one of these states for a long time might be an indication of a larger problem.
  98 For this reason, the Ceph Monitors will warn when placement groups get "stuck"
  99 in a non-optimal state. Specifically, we check for:
 100
 101 * ``inactive`` - The placement group has not been ``active`` for too long (that
 102   is, it hasn't been able to service read/write requests).
 103
 104 * ``unclean`` - The placement group has not been ``clean`` for too long (that
 105   is, it hasn't been able to completely recover from a previous failure).
 106
 107 * ``stale`` - The placement group status has not been updated by a
 108   ``ceph-osd``.  This indicates that all nodes storing this placement group may
 109   be ``down``.
 110
 111 List stuck placement groups by running one of the following commands:
 112
 113 .. prompt:: bash
 114
 115    ceph pg dump_stuck stale
 116    ceph pg dump_stuck inactive
 117    ceph pg dump_stuck unclean
 118
 119 - Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
 120   daemons are not running.
 121 - Stuck ``inactive`` placement groups usually indicate a peering problem (see
 122   :ref:`failures-osd-peering`).
 123 - Stuck ``unclean`` placement groups usually indicate that something is
 124   preventing recovery from completing, possibly unfound objects (see
 125   :ref:`failures-osd-unfound`);
 126
 127
 128
 129 .. _failures-osd-peering:
 130
 131 Placement Group Down - Peering Failure
 132 ======================================
 133
 134 In certain cases, the ``ceph-osd`` `peering` process can run into problems,
 135 which can prevent a PG from becoming active and usable. In such a case, running
 136 the command ``ceph health detail`` will report something similar to the following:
 137
 138 .. prompt:: bash
 139
 140    ceph health detail
 141
 142 ::
 143
 144     HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
 145     ...
 146     pg 0.5 is down+peering
 147     pg 1.4 is down+peering
 148     ...
 149     osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
 150
 151 Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
 152
 153 .. prompt:: bash
 154
 155    ceph pg 0.5 query
 156
 157 .. code-block:: javascript
 158
 159  { "state": "down+peering",
 160    ...
 161    "recovery_state": [
 162         { "name": "Started\/Primary\/Peering\/GetInfo",
 163           "enter_time": "2012-03-06 14:40:16.169679",
 164           "requested_info_from": []},
 165         { "name": "Started\/Primary\/Peering",
 166           "enter_time": "2012-03-06 14:40:16.169659",
 167           "probing_osds": [
 168                 0,
 169                 1],
 170           "blocked": "peering is blocked due to down osds",
 171           "down_osds_we_would_probe": [
 172                 1],
 173           "peering_blocked_by": [
 174                 { "osd": 1,
 175                   "current_lost_at": 0,
 176                   "comment": "starting or marking this osd lost may let us proceed"}]},
 177         { "name": "Started",
 178           "enter_time": "2012-03-06 14:40:16.169513"}
 179     ]
 180  }
 181
 182 The ``recovery_state`` section tells us that peering is blocked due to down
 183 ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
 184 particular ``ceph-osd`` and recovery will proceed.
 185
 186 Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
 187 there has been a disk failure), the cluster can be informed that the OSD is
 188 ``lost`` and the cluster can be instructed that it must cope as best it can.
 189
 190 .. important:: Informing the cluster that an OSD has been lost is dangerous
 191    because the cluster cannot guarantee that the other copies of the data are
 192    consistent and up to date.
 193
 194 To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
 195 anyway, run a command of the following form:
 196
 197 .. prompt:: bash
 198
 199    ceph osd lost 1
 200
 201 Recovery will proceed.
 202
 203
 204 .. _failures-osd-unfound:
 205
 206 Unfound Objects
 207 ===============
 208
 209 Under certain combinations of failures, Ceph may complain about ``unfound``
 210 objects, as in this example:
 211
 212 .. prompt:: bash
 213
 214    ceph health detail
 215
 216 ::
 217
 218    HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
 219    pg 2.4 is active+degraded, 78 unfound
 220
 221 This means that the storage cluster knows that some objects (or newer copies of
 222 existing objects) exist, but it hasn't found copies of them.  Here is an
 223 example of how this might come about for a PG whose data is on two OSDS, which
 224 we will call "1" and "2":
 225
 226 * 1 goes down
 227 * 2 handles some writes, alone
 228 * 1 comes up
 229 * 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
 230 * Before the new objects are copied, 2 goes down.
 231
 232 At this point, 1 knows that these objects exist, but there is no live
 233 ``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
 234 will block, and the cluster will hope that the failed node comes back soon.
 235 This is assumed to be preferable to returning an IO error to the user.
 236
 237 .. note:: The situation described immediately above is one reason that setting
 238    ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
 239    data loss.
 240
 241 Identify which objects are unfound by running a command of the following form:
 242
 243 .. prompt:: bash
 244
 245    ceph pg 2.4 list_unfound [starting offset, in json]
 246
 247 .. code-block:: javascript
 248
 249   {
 250     "num_missing": 1,
 251     "num_unfound": 1,
 252     "objects": [
 253         {
 254             "oid": {
 255                 "oid": "object",
 256                 "key": "",
 257                 "snapid": -2,
 258                 "hash": 2249616407,
 259                 "max": 0,
 260                 "pool": 2,
 261                 "namespace": ""
 262             },
 263             "need": "43'251",
 264             "have": "0'0",
 265             "flags": "none",
 266             "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
 267             "locations": [
 268                 "0(3)",
 269                 "4(2)"
 270             ]
 271         }
 272     ],
 273     "state": "NotRecovering",
 274     "available_might_have_unfound": true,
 275     "might_have_unfound": [
 276         {
 277             "osd": "2(4)",
 278             "status": "osd is down"
 279         }
 280     ],
 281     "more": false
 282   }
 283
 284 If there are too many objects to list in a single result, the ``more`` field
 285 will be true and you can query for more.  (Eventually the command line tool
 286 will hide this from you, but not yet.)
 287
 288 Now you can identify which OSDs have been probed or might contain data.
 289
 290 At the end of the listing (before ``more: false``), ``might_have_unfound`` is
 291 provided when ``available_might_have_unfound`` is true.  This is equivalent to
 292 the output of ``ceph pg #.# query``.  This eliminates the need to use ``query``
 293 directly.  The ``might_have_unfound`` information given behaves the same way as
 294 that ``query`` does, which is described below.  The only difference is that
 295 OSDs that have the status of ``already probed`` are ignored.
 296
 297 Use of ``query``:
 298
 299 .. prompt:: bash
 300
 301    ceph pg 2.4 query
 302
 303 .. code-block:: javascript
 304
 305    "recovery_state": [
 306         { "name": "Started\/Primary\/Active",
 307           "enter_time": "2012-03-06 15:15:46.713212",
 308           "might_have_unfound": [
 309                 { "osd": 1,
 310                   "status": "osd is down"}]},
 311
 312 In this case, the cluster knows that ``osd.1`` might have data, but it is
 313 ``down``. Here is the full range of possible states:
 314
 315 * already probed
 316 * querying
 317 * OSD is down
 318 * not queried (yet)
 319
 320 Sometimes it simply takes some time for the cluster to query possible
 321 locations.
 322
 323 It is possible that there are other locations where the object might exist that
 324 are not listed. For example: if an OSD is stopped and taken out of the cluster
 325 and then the cluster fully recovers, and then through a subsequent set of
 326 failures the cluster ends up with an unfound object, the cluster will ignore
 327 the removed OSD. (This scenario, however, is unlikely.)
 328
 329 If all possible locations have been queried and objects are still lost, you may
 330 have to give up on the lost objects. This, again, is possible only when unusual
 331 combinations of failures have occurred that allow the cluster to learn about
 332 writes that were performed before the writes themselves have been recovered. To
 333 mark the "unfound" objects as "lost", run a command of the following form:
 334
 335 .. prompt:: bash
 336
 337    ceph pg 2.5 mark_unfound_lost revert|delete
 338
 339 Here the final argument (``revert|delete``) specifies how the cluster should
 340 deal with lost objects.
 341
 342 The ``delete`` option will cause the cluster to forget about them entirely.
 343
 344 The ``revert`` option (which is not available for erasure coded pools) will
 345 either roll back to a previous version of the object or (if it was a new
 346 object) forget about the object entirely. Use ``revert`` with caution, as it
 347 may confuse applications that expect the object to exist.
 348
 349 Homeless Placement Groups
 350 =========================
 351
 352 It is possible that every OSD that has copies of a given placement group fails.
 353 If this happens, then the subset of the object store that contains those
 354 placement groups becomes unavailable and the monitor will receive no status
 355 updates for those placement groups. The monitor marks as ``stale`` any
 356 placement group whose primary OSD has failed. For example:
 357
 358 .. prompt:: bash
 359
 360    ceph health
 361
 362 ::
 363
 364     HEALTH_WARN 24 pgs stale; 3/300 in osds are down
 365
 366 Identify which placement groups are ``stale`` and which were the last OSDs to
 367 store the ``stale`` placement groups by running the following command:
 368
 369 .. prompt:: bash
 370
 371    ceph health detail
 372
 373 ::
 374
 375    HEALTH_WARN 24 pgs stale; 3/300 in osds are down
 376    ...
 377    pg 2.5 is stuck stale+active+remapped, last acting [2,0]
 378    ...
 379    osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
 380    osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
 381    osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
 382
 383 This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
 384 ``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
 385 that placement group.
 386
 387
 388 Only a Few OSDs Receive Data
 389 ============================
 390
 391 If only a few of the nodes in the cluster are receiving data, check the number
 392 of placement groups in the pool as instructed in the :ref:`Placement Groups
 393 <rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
 394 OSDs in an operation involving dividing the number of placement groups in the
 395 cluster by the number of OSDs in the cluster, a small number of placement
 396 groups (the remainder, in this operation) are sometimes not distributed across
 397 the cluster. In situations like this, create a pool with a placement group
 398 count that is a multiple of the number of OSDs. See `Placement Groups`_ for
 399 details. See the :ref:`Pool, PG, and CRUSH Config Reference
 400 <rados_config_pool_pg_crush_ref>` for instructions on changing the default
 401 values used to determine how many placement groups are assigned to each pool.
 402
 403
 404 Can't Write Data
 405 ================
 406
 407 If the cluster is up, but some OSDs are down and you cannot write data, make
 408 sure that you have the minimum number of OSDs running in the pool. If you don't
 409 have the minimum number of OSDs running in the pool, Ceph will not allow you to
 410 write data to it because there is no guarantee that Ceph can replicate your
 411 data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
 412 Config Reference <rados_config_pool_pg_crush_ref>` for details.
 413
 414
 415 PGs Inconsistent
 416 ================
 417
 418 If the command ``ceph health detail`` returns an ``active + clean +
 419 inconsistent`` state, this might indicate an error during scrubbing. Identify
 420 the inconsistent placement group or placement groups by running the following
 421 command:
 422
 423 .. prompt:: bash
 424
 425     $ ceph health detail
 426
 427 ::
 428
 429     HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
 430     pg 0.6 is active+clean+inconsistent, acting [0,1,2]
 431     2 scrub errors
 432
 433 Alternatively, run this command if you prefer to inspect the output in a
 434 programmatic way:
 435
 436 .. prompt:: bash
 437
 438    $ rados list-inconsistent-pg rbd
 439
 440 ::
 441
 442     ["0.6"]
 443
 444 There is only one consistent state, but in the worst case, we could have
 445 different inconsistencies in multiple perspectives found in more than one
 446 object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
 447 ``rados list-inconsistent-pg rbd`` will look something like this:
 448
 449 .. prompt:: bash
 450
 451    rados list-inconsistent-obj 0.6 --format=json-pretty
 452
 453 .. code-block:: javascript
 454
 455     {
 456         "epoch": 14,
 457         "inconsistents": [
 458             {
 459                 "object": {
 460                     "name": "foo",
 461                     "nspace": "",
 462                     "locator": "",
 463                     "snap": "head",
 464                     "version": 1
 465                 },
 466                 "errors": [
 467                     "data_digest_mismatch",
 468                     "size_mismatch"
 469                 ],
 470                 "union_shard_errors": [
 471                     "data_digest_mismatch_info",
 472                     "size_mismatch_info"
 473                 ],
 474                 "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
 475                 "shards": [
 476                     {
 477                         "osd": 0,
 478                         "errors": [],
 479                         "size": 968,
 480                         "omap_digest": "0xffffffff",
 481                         "data_digest": "0xe978e67f"
 482                     },
 483                     {
 484                         "osd": 1,
 485                         "errors": [],
 486                         "size": 968,
 487                         "omap_digest": "0xffffffff",
 488                         "data_digest": "0xe978e67f"
 489                     },
 490                     {
 491                         "osd": 2,
 492                         "errors": [
 493                             "data_digest_mismatch_info",
 494                             "size_mismatch_info"
 495                         ],
 496                         "size": 0,
 497                         "omap_digest": "0xffffffff",
 498                         "data_digest": "0xffffffff"
 499                     }
 500                 ]
 501             }
 502         ]
 503     }
 504
 505 In this case, the output indicates the following:
 506
 507 * The only inconsistent object is named ``foo``, and its head has
 508   inconsistencies.
 509 * The inconsistencies fall into two categories:
 510
 511   #. ``errors``: these errors indicate inconsistencies between shards, without
 512      an indication of which shard(s) are bad. Check for the ``errors`` in the
 513      ``shards`` array, if available, to pinpoint the problem.
 514
 515      * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
 516        is different from the digests of the replica reads of ``OSD.0`` and
 517        ``OSD.1``
 518      * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
 519        but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
 520
 521   #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
 522      ``shards`` array. The ``errors`` are set for the shard with the problem.
 523      These errors include ``read_error`` and other similar errors. The
 524      ``errors`` ending in ``oi`` indicate a comparison with
 525      ``selected_object_info``. Examine the ``shards`` array to determine
 526      which shard has which error or errors.
 527
 528      * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
 529        is not ``0xffffffff``, which is calculated from the shard read from
 530        ``OSD.2``
 531      * ``size_mismatch_info``: the size stored in the ``object-info`` is
 532        different from the size read from ``OSD.2``. The latter is ``0``.
 533
 534 .. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
 535    inconsistency is likely due to physical storage errors. In cases like this,
 536    check the storage used by that OSD.
 537
 538    Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
 539    repair.
 540
 541 To repair the inconsistent placement group, run a command of the following
 542 form:
 543
 544 .. prompt:: bash
 545
 546    ceph pg repair {placement-group-ID}
 547
 548 .. warning: This command overwrites the "bad" copies with "authoritative"
 549    copies. In most cases, Ceph is able to choose authoritative copies from all
 550    the available replicas by using some predefined criteria. This, however,
 551    does not work in every case. For example, it might be the case that the
 552    stored data digest is missing, which means that the calculated digest is
 553    ignored when Ceph chooses the authoritative copies. Be aware of this, and
 554    use the above command with caution.
 555
 556
 557 If you receive ``active + clean + inconsistent`` states periodically due to
 558 clock skew, consider configuring the `NTP
 559 <https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
 560 hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
 561 and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
 562
 563
 564 Erasure Coded PGs are not active+clean
 565 ======================================
 566
 567 If CRUSH fails to find enough OSDs to map to a PG, it will show as a
 568 ``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
 569
 570      [2,1,6,0,5,8,2147483647,7,4]
 571
 572 Not enough OSDs
 573 ---------------
 574
 575 If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
 576 OSDs, the cluster will show "Not enough OSDs". In this case, you either create
 577 another erasure coded pool that requires fewer OSDs, by running commands of the
 578 following form:
 579
 580 .. prompt:: bash
 581
 582      ceph osd erasure-code-profile set myprofile k=5 m=3
 583      ceph osd pool create erasurepool erasure myprofile
 584
 585 or add new OSDs, and the PG will automatically use them.
 586
 587 CRUSH constraints cannot be satisfied
 588 -------------------------------------
 589
 590 If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
 591 constraints that cannot be satisfied. If there are ten OSDs on two hosts and
 592 the CRUSH rule requires that no two OSDs from the same host are used in the
 593 same PG, the mapping may fail because only two OSDs will be found. Check the
 594 constraint by displaying ("dumping") the rule, as shown here:
 595
 596 .. prompt:: bash
 597
 598    ceph osd crush rule ls
 599
 600 ::
 601
 602     [
 603         "replicated_rule",
 604         "erasurepool"]
 605     $ ceph osd crush rule dump erasurepool
 606     { "rule_id": 1,
 607       "rule_name": "erasurepool",
 608       "type": 3,
 609       "steps": [
 610             { "op": "take",
 611               "item": -1,
 612               "item_name": "default"},
 613             { "op": "chooseleaf_indep",
 614               "num": 0,
 615               "type": "host"},
 616             { "op": "emit"}]}
 617
 618
 619 Resolve this problem by creating a new pool in which PGs are allowed to have
 620 OSDs residing on the same host by running the following commands:
 621
 622 .. prompt:: bash
 623
 624    ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
 625    ceph osd pool create erasurepool erasure myprofile
 626
 627 CRUSH gives up too soon
 628 -----------------------
 629
 630 If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
 631 with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
 632 PG), it is possible that CRUSH gives up before finding a mapping. This problem
 633 can be resolved by:
 634
 635 * lowering the erasure coded pool requirements to use fewer OSDs per PG (this
 636   requires the creation of another pool, because erasure code profiles cannot
 637   be modified dynamically).
 638
 639 * adding more OSDs to the cluster (this does not require the erasure coded pool
 640   to be modified, because it will become clean automatically)
 641
 642 * using a handmade CRUSH rule that tries more times to find a good mapping.
 643   This can be modified for an existing CRUSH rule by setting
 644   ``set_choose_tries`` to a value greater than the default.
 645
 646 First, verify the problem by using  ``crushtool`` after extracting the crushmap
 647 from the cluster. This ensures that your experiments do not modify the Ceph
 648 cluster and that they operate only on local files:
 649
 650 .. prompt:: bash
 651
 652    ceph osd crush rule dump erasurepool
 653
 654 ::
 655
 656     { "rule_id": 1,
 657       "rule_name": "erasurepool",
 658       "type": 3,
 659       "steps": [
 660             { "op": "take",
 661               "item": -1,
 662               "item_name": "default"},
 663             { "op": "chooseleaf_indep",
 664               "num": 0,
 665               "type": "host"},
 666             { "op": "emit"}]}
 667     $ ceph osd getcrushmap > crush.map
 668     got crush map from osdmap epoch 13
 669     $ crushtool -i crush.map --test --show-bad-mappings \
 670        --rule 1 \
 671        --num-rep 9 \
 672        --min-x 1 --max-x $((1024 * 1024))
 673     bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
 674     bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
 675     bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
 676
 677 Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
 678 needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
 679 ``ceph osd crush rule dump``. This test will attempt to map one million values
 680 (in this example, the range defined by ``[--min-x,--max-x]``) and must display
 681 at least one bad mapping. If this test outputs nothing, all mappings have been
 682 successful and you can be assured that the problem with your cluster is not
 683 caused by bad mappings.
 684
 685 Changing the value of set_choose_tries
 686 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 687
 688 #. Decompile the CRUSH map to edit the CRUSH rule by running the following
 689    command:
 690
 691    .. prompt:: bash
 692
 693       crushtool --decompile crush.map > crush.txt
 694
 695 #. Add the following line to the rule::
 696
 697       step set_choose_tries 100
 698
 699    The relevant part of the ``crush.txt`` file will resemble this::
 700
 701       rule erasurepool {
 702               id 1
 703               type erasure
 704               step set_chooseleaf_tries 5
 705               step set_choose_tries 100
 706               step take default
 707               step chooseleaf indep 0 type host
 708               step emit
 709       }
 710
 711 #. Recompile and retest the CRUSH rule:
 712
 713    .. prompt:: bash
 714
 715       crushtool --compile crush.txt -o better-crush.map
 716
 717 #. When all mappings succeed, display a histogram of the number of tries that
 718    were necessary to find all of the mapping by using the
 719    ``--show-choose-tries`` option of the ``crushtool`` command, as in the
 720    following example:
 721
 722    .. prompt:: bash
 723
 724       crushtool -i better-crush.map --test --show-bad-mappings \
 725        --show-choose-tries \
 726        --rule 1 \
 727        --num-rep 9 \
 728        --min-x 1 --max-x $((1024 * 1024))
 729     ...
 730     11:        42
 731     12:        44
 732     13:        54
 733     14:        45
 734     15:        35
 735     16:        34
 736     17:        30
 737     18:        25
 738     19:        19
 739     20:        22
 740     21:        20
 741     22:        17
 742     23:        13
 743     24:        16
 744     25:        13
 745     26:        11
 746     27:        11
 747     28:        13
 748     29:        11
 749     30:        10
 750     31:         6
 751     32:         5
 752     33:        10
 753     34:         3
 754     35:         7
 755     36:         5
 756     37:         2
 757     38:         5
 758     39:         5
 759     40:         2
 760     41:         5
 761     42:         4
 762     43:         1
 763     44:         2
 764     45:         2
 765     46:         3
 766     47:         1
 767     48:         0
 768     ...
 769     102:         0
 770     103:         1
 771     104:         0
 772     ...
 773
 774    This output indicates that it took eleven tries to map forty-two PGs, twelve
 775    tries to map forty-four PGs etc. The highest number of tries is the minimum
 776    value of ``set_choose_tries`` that prevents bad mappings (for example,
 777    ``103`` in the above output, because it did not take more than 103 tries for
 778    any PG to be mapped).
 779
 780 .. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
 781 .. _Placement Groups: ../../operations/placement-groups
 782 .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref