ceph/doc/rados/operations/crush-map-edits.rst

   1 Manually editing the CRUSH Map
   2 ==============================
   3
   4 .. note:: Manually editing the CRUSH map is an advanced administrator
   5    operation. For the majority of installations, CRUSH changes can be
   6    implemented via the Ceph CLI and do not require manual CRUSH map edits. If
   7    you have identified a use case where manual edits *are* necessary with a
   8    recent Ceph release, consider contacting the Ceph developers at dev@ceph.io
   9    so that future versions of Ceph do not have this problem.
  10
  11 To edit an existing CRUSH map, carry out the following procedure:
  12
  13 #. `Get the CRUSH map`_.
  14 #. `Decompile`_ the CRUSH map.
  15 #.  Edit at least one of the following sections: `Devices`_, `Buckets`_, and
  16     `Rules`_. Use a text editor for this task.
  17 #. `Recompile`_ the CRUSH map.
  18 #. `Set the CRUSH map`_.
  19
  20 For details on setting the CRUSH map rule for a specific pool, see `Set Pool
  21 Values`_.
  22
  23 .. _Get the CRUSH map: #getcrushmap
  24 .. _Decompile: #decompilecrushmap
  25 .. _Devices: #crushmapdevices
  26 .. _Buckets: #crushmapbuckets
  27 .. _Rules: #crushmaprules
  28 .. _Recompile: #compilecrushmap
  29 .. _Set the CRUSH map: #setcrushmap
  30 .. _Set Pool Values: ../pools#setpoolvalues
  31
  32 .. _getcrushmap:
  33
  34 Get the CRUSH Map
  35 -----------------
  36
  37 To get the CRUSH map for your cluster, run a command of the following form:
  38
  39 .. prompt:: bash $
  40
  41     ceph osd getcrushmap -o {compiled-crushmap-filename}
  42
  43 Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have
  44 specified. Because the CRUSH map is in a compiled form, you must first
  45 decompile it before you can edit it.
  46
  47 .. _decompilecrushmap:
  48
  49 Decompile the CRUSH Map
  50 -----------------------
  51
  52 To decompile the CRUSH map, run a command of the following form:
  53
  54 .. prompt:: bash $
  55
  56     crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
  57
  58 .. _compilecrushmap:
  59
  60 Recompile the CRUSH Map
  61 -----------------------
  62
  63 To compile the CRUSH map, run a command of the following form:
  64
  65 .. prompt:: bash $
  66
  67     crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename}
  68
  69 .. _setcrushmap:
  70
  71 Set the CRUSH Map
  72 -----------------
  73
  74 To set the CRUSH map for your cluster, run a command of the following form:
  75
  76 .. prompt:: bash $
  77
  78     ceph osd setcrushmap -i {compiled-crushmap-filename}
  79
  80 Ceph loads (``-i``) a compiled CRUSH map from the filename that you have
  81 specified.
  82
  83 Sections
  84 --------
  85
  86 A CRUSH map has six main sections:
  87
  88 #. **tunables:** The preamble at the top of the map describes any *tunables*
  89    that are not a part of legacy CRUSH behavior. These tunables correct for old
  90    bugs, optimizations, or other changes that have been made over the years to
  91    improve CRUSH's behavior.
  92
  93 #. **devices:** Devices are individual OSDs that store data.
  94
  95 #. **types**: Bucket ``types`` define the types of buckets that are used in
  96    your CRUSH hierarchy.
  97
  98 #. **buckets:** Buckets consist of a hierarchical aggregation of storage
  99    locations (for example, rows, racks, chassis, hosts) and their assigned
 100    weights. After the bucket ``types`` have been defined, the CRUSH map defines
 101    each node in the hierarchy, its type, and which devices or other nodes it
 102    contains.
 103
 104 #. **rules:** Rules define policy about how data is distributed across
 105    devices in the hierarchy.
 106
 107 #. **choose_args:** ``choose_args`` are alternative weights associated with
 108    the hierarchy that have been adjusted in order to optimize data placement. A
 109    single ``choose_args`` map can be used for the entire cluster, or a number
 110    of ``choose_args`` maps can be created such that each map is crafted for a
 111    particular pool.
 112
 113
 114 .. _crushmapdevices:
 115
 116 CRUSH-Map Devices
 117 -----------------
 118
 119 Devices are individual OSDs that store data. In this section, there is usually
 120 one device defined for each OSD daemon in your cluster. Devices are identified
 121 by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where
 122 ``N`` is the device's ``id``).
 123
 124
 125 .. _crush-map-device-class:
 126
 127 A device can also have a *device class* associated with it: for example,
 128 ``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted
 129 by CRUSH rules. This means that device classes allow CRUSH rules to select only
 130 OSDs that match certain characteristics. For example, you might want an RBD
 131 pool associated only with SSDs and a different RBD pool associated only with
 132 HDDs.
 133
 134 To see a list of devices, run the following command:
 135
 136 .. prompt:: bash #
 137
 138    ceph device ls
 139
 140 The output of this command takes the following form:
 141
 142 ::
 143
 144     device {num} {osd.name} [class {class}]
 145
 146 For example:
 147
 148 .. prompt:: bash #
 149
 150     ceph device ls
 151
 152 ::
 153
 154     device 0 osd.0 class ssd
 155     device 1 osd.1 class hdd
 156     device 2 osd.2
 157     device 3 osd.3
 158
 159 In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This
 160 daemon might map to a single storage device, a pair of devices (for example,
 161 one for data and one for a journal or metadata), or in some cases a small RAID
 162 device or a partition of a larger storage device.
 163
 164
 165 CRUSH-Map Bucket Types
 166 ----------------------
 167
 168 The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a
 169 hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets)
 170 typically represent physical locations in a hierarchy. Nodes aggregate other
 171 nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their
 172 corresponding storage media.
 173
 174 .. tip:: In the context of CRUSH, the term "bucket" is used to refer to
 175    a node in the hierarchy (that is, to a location or a piece of physical
 176    hardware). In the context of RADOS Gateway APIs, however, the term
 177    "bucket" has a different meaning.
 178
 179 To add a bucket type to the CRUSH map, create a new line under the list of
 180 bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
 181 By convention, there is exactly one leaf bucket type and it is ``type 0``;
 182 however, you may give the leaf bucket any name you like (for example: ``osd``,
 183 ``disk``, ``drive``, ``storage``)::
 184
 185     # types
 186     type {num} {bucket-name}
 187
 188 For example::
 189
 190     # types
 191     type 0 osd
 192     type 1 host
 193     type 2 chassis
 194     type 3 rack
 195     type 4 row
 196     type 5 pdu
 197     type 6 pod
 198     type 7 room
 199     type 8 datacenter
 200     type 9 zone
 201     type 10 region
 202     type 11 root
 203
 204 .. _crushmapbuckets:
 205
 206 CRUSH-Map Bucket Hierarchy
 207 --------------------------
 208
 209 The CRUSH algorithm distributes data objects among storage devices according to
 210 a per-device weight value, approximating a uniform probability distribution.
 211 CRUSH distributes objects and their replicas according to the hierarchical
 212 cluster map you define. The CRUSH map represents the available storage devices
 213 and the logical elements that contain them.
 214
 215 To map placement groups (PGs) to OSDs across failure domains, a CRUSH map
 216 defines a hierarchical list of bucket types under ``#types`` in the generated
 217 CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf
 218 nodes according to their failure domains (for example: hosts, chassis, racks,
 219 power distribution units, pods, rows, rooms, and data centers). With the
 220 exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and
 221 you may define it according to your own needs.
 222
 223 We recommend adapting your CRUSH map to your preferred hardware-naming
 224 conventions and using bucket names that clearly reflect the physical
 225 hardware. Clear naming practice can make it easier to administer the cluster
 226 and easier to troubleshoot problems when OSDs malfunction (or other hardware
 227 malfunctions) and the administrator needs access to physical hardware.
 228
 229
 230 In the following example, the bucket hierarchy has a leaf bucket named ``osd``
 231 and two node buckets named ``host`` and ``rack``:
 232
 233 .. ditaa::
 234                            +-----------+
 235                            | {o}rack   |
 236                            |   Bucket  |
 237                            +-----+-----+
 238                                  |
 239                  +---------------+---------------+
 240                  |                               |
 241            +-----+-----+                   +-----+-----+
 242            | {o}host   |                   | {o}host   |
 243            |   Bucket  |                   |   Bucket  |
 244            +-----+-----+                   +-----+-----+
 245                  |                               |
 246          +-------+-------+               +-------+-------+
 247          |               |               |               |
 248    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 249    |    osd    |   |    osd    |   |    osd    |   |    osd    |
 250    |   Bucket  |   |   Bucket  |   |   Bucket  |   |   Bucket  |
 251    +-----------+   +-----------+   +-----------+   +-----------+
 252
 253 .. note:: The higher-numbered ``rack`` bucket type aggregates the
 254    lower-numbered ``host`` bucket type.
 255
 256 Because leaf nodes reflect storage devices that have already been declared
 257 under the ``#devices`` list at the beginning of the CRUSH map, there is no need
 258 to declare them as bucket instances. The second-lowest bucket type in your
 259 hierarchy is typically used to aggregate the devices (that is, the
 260 second-lowest bucket type is usually the computer that contains the storage
 261 media and, such as ``node``, ``computer``, ``server``, ``host``, or
 262 ``machine``). In high-density environments, it is common to have multiple hosts
 263 or nodes in a single chassis (for example, in the cases of blades or twins). It
 264 is important to anticipate the potential consequences of chassis failure -- for
 265 example, during the replacement of a chassis in case of a node failure, the
 266 chassis's hosts or nodes (and their associated OSDs) will be in a ``down``
 267 state.
 268
 269 To declare a bucket instance, do the following: specify its type, give it a
 270 unique name (an alphanumeric string), assign it a unique ID expressed as a
 271 negative integer (this is optional), assign it a weight relative to the total
 272 capacity and capability of the item(s) in the bucket, assign it a bucket
 273 algorithm (usually ``straw2``), and specify the bucket algorithm's hash
 274 (usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A
 275 bucket may have one or more items. The items may consist of node buckets or
 276 leaves. Items may have a weight that reflects the relative weight of the item.
 277
 278 To declare a node bucket, use the following syntax::
 279
 280     [bucket-type] [bucket-name] {
 281         id [a unique negative numeric ID]
 282         weight [the relative capacity/capability of the item(s)]
 283         alg [the bucket type: uniform | list | tree | straw | straw2 ]
 284         hash [the hash type: 0 by default]
 285         item [item-name] weight [weight]
 286     }
 287
 288 For example, in the above diagram, two host buckets (referred to in the
 289 declaration below as ``node1`` and ``node2``) and one rack bucket (referred to
 290 in the declaration below as ``rack1``) are defined. The OSDs are declared as
 291 items within the host buckets::
 292
 293     host node1 {
 294         id -1
 295         alg straw2
 296         hash 0
 297         item osd.0 weight 1.00
 298         item osd.1 weight 1.00
 299     }
 300
 301     host node2 {
 302         id -2
 303         alg straw2
 304         hash 0
 305         item osd.2 weight 1.00
 306         item osd.3 weight 1.00
 307     }
 308
 309     rack rack1 {
 310         id -3
 311         alg straw2
 312         hash 0
 313         item node1 weight 2.00
 314         item node2 weight 2.00
 315     }
 316
 317 .. note:: In this example, the rack bucket does not contain any OSDs. Instead,
 318    it contains lower-level host buckets and includes the sum of their weight in
 319    the item entry.
 320
 321
 322 .. topic:: Bucket Types
 323
 324    Ceph supports five bucket types. Each bucket type provides a balance between
 325    performance and reorganization efficiency, and each is different from the
 326    others. If you are unsure of which bucket type to use, use the ``straw2``
 327    bucket. For a more technical discussion of bucket types than is offered
 328    here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized
 329    Placement of Replicated Data`_.
 330
 331    The bucket types are as follows:
 332
 333     #. **uniform**: Uniform buckets aggregate devices that have **exactly**
 334        the same weight. For example, when hardware is commissioned or
 335        decommissioned, it is often done in sets of machines that have exactly
 336        the same physical configuration (this can be the case, for example,
 337        after bulk purchases). When storage devices have exactly the same
 338        weight, you may use the ``uniform`` bucket type, which allows CRUSH to
 339        map replicas into uniform buckets in constant time. If your devices have
 340        non-uniform weights, you should not use the uniform bucket algorithm.
 341
 342     #. **list**: List buckets aggregate their content as linked lists. The
 343        behavior of list buckets is governed by the :abbr:`RUSH (Replication
 344        Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this
 345        bucket type, an object is either relocated to the newest device in
 346        accordance with an appropriate probability, or it remains on the older
 347        devices as before. This results in optimal data migration when items are
 348        added to the bucket. The removal of items from the middle or the tail of
 349        the list, however, can result in a significant amount of unnecessary
 350        data movement. This means that list buckets are most suitable for
 351        circumstances in which they **never shrink or very rarely shrink**.
 352
 353     #. **tree**: Tree buckets use a binary search tree. They are more efficient
 354        at dealing with buckets that contain many items than are list buckets.
 355        The behavior of tree buckets is governed by the :abbr:`RUSH (Replication
 356        Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the
 357        placement time to 0(log\ :sub:`n`). This means that tree buckets are
 358        suitable for managing large sets of devices or nested buckets.
 359
 360     #. **straw**: Straw buckets allow all items in the bucket to "compete"
 361        against each other for replica placement through a process analogous to
 362        drawing straws. This is different from the behavior of list buckets and
 363        tree buckets, which use a divide-and-conquer strategy that either gives
 364        certain items precedence (for example, those at the beginning of a list)
 365        or obviates the need to consider entire subtrees of items. Such an
 366        approach improves the performance of the replica placement process, but
 367        can also introduce suboptimal reorganization behavior when the contents
 368        of a bucket change due an addition, a removal, or the re-weighting of an
 369        item.
 370
 371         * **straw2**: Straw2 buckets improve on Straw by correctly avoiding
 372           any data movement between items when neighbor weights change. For
 373           example, if the weight of a given item changes (including during the
 374           operations of adding it to the cluster or removing it from the
 375           cluster), there will be data movement to or from only that item.
 376           Neighbor weights are not taken into account.
 377
 378
 379 .. topic:: Hash
 380
 381    Each bucket uses a hash algorithm. As of Reef, Ceph supports the
 382    ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm,
 383    enter ``0`` as your hash setting.
 384
 385 .. _weightingbucketitems:
 386
 387 .. topic:: Weighting Bucket Items
 388
 389    Ceph expresses bucket weights as doubles, which allows for fine-grained
 390    weighting. A weight is the relative difference between device capacities. We
 391    recommend using ``1.00`` as the relative weight for a 1 TB storage device.
 392    In such a scenario, a weight of ``0.50`` would represent approximately 500
 393    GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets
 394    higher in the CRUSH hierarchy have a weight that is the sum of the weight of
 395    the leaf items aggregated by the bucket.
 396
 397
 398 .. _crushmaprules:
 399
 400 CRUSH Map Rules
 401 ---------------
 402
 403 CRUSH maps have rules that include data placement for a pool: these are
 404 called "CRUSH rules". The default CRUSH map has one rule for each pool. If you
 405 are running a large cluster, you might create many pools and each of those
 406 pools might have its own non-default CRUSH rule.
 407
 408
 409 .. note:: In most cases, there is no need to modify the default rule. When a
 410    new pool is created, by default the rule will be set to the value ``0``
 411    (which indicates the default CRUSH rule, which has the numeric ID ``0``).
 412
 413 CRUSH rules define policy that governs how data is distributed across the devices in
 414 the hierarchy. The rules define placement as well as replication strategies or
 415 distribution policies that allow you to specify exactly how CRUSH places data
 416 replicas. For example, you might create one rule selecting a pair of targets for
 417 two-way mirroring, another rule for selecting three targets in two different data
 418 centers for three-way replication, and yet another rule for erasure coding across
 419 six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2**
 420 of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_.
 421
 422 A rule takes the following form::
 423
 424     rule <rulename> {
 425
 426         id [a unique integer ID]
 427         type [replicated|erasure]
 428         step take <bucket-name> [class <device-class>]
 429         step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type>
 430         step emit
 431     }
 432
 433
 434 ``id``
 435    :Description: A unique integer that identifies the rule.
 436    :Purpose: A component of the rule mask.
 437    :Type: Integer
 438    :Required: Yes
 439    :Default: 0
 440
 441
 442 ``type``
 443    :Description: Denotes the type of replication strategy to be enforced by the
 444                  rule.
 445    :Purpose: A component of the rule mask.
 446    :Type: String
 447    :Required: Yes
 448    :Default: ``replicated``
 449    :Valid Values: ``replicated`` or ``erasure``
 450
 451
 452 ``step take <bucket-name> [class <device-class>]``
 453    :Description: Takes a bucket name and iterates down the tree. If
 454                  the ``device-class`` argument is specified, the argument must
 455                  match a class assigned to OSDs within the cluster. Only
 456                  devices belonging to the class are included.
 457    :Purpose: A component of the rule.
 458    :Required: Yes
 459    :Example: ``step take data``
 460
 461
 462
 463 ``step choose firstn {num} type {bucket-type}``
 464    :Description: Selects ``num`` buckets of the given type from within the
 465                  current bucket. ``{num}`` is usually the number of replicas in
 466                  the pool (in other words, the pool size).
 467
 468                  - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
 469                  - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
 470                  - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
 471
 472    :Purpose: A component of the rule.
 473    :Prerequisite: Follows ``step take`` or ``step choose``.
 474    :Example: ``step choose firstn 1 type row``
 475
 476
 477 ``step chooseleaf firstn {num} type {bucket-type}``
 478    :Description: Selects a set of buckets of the given type and chooses a leaf
 479                  node (that is, an OSD) from the subtree of each bucket in that set of buckets. The
 480                  number of buckets in the set is usually the number of replicas in
 481                  the pool (in other words, the pool size).
 482
 483                  - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
 484                  - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
 485                  - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
 486    :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step.
 487    :Prerequisite: Follows ``step take`` or ``step choose``.
 488    :Example: ``step chooseleaf firstn 0 type row``
 489
 490
 491 ``step emit``
 492    :Description: Outputs the current value on the top of the stack and empties
 493                  the stack. Typically used
 494                  at the end of a rule, but may also be used to choose from different
 495                  trees in the same rule.
 496
 497    :Purpose: A component of the rule.
 498    :Prerequisite: Follows ``step choose``.
 499    :Example: ``step emit``
 500
 501 .. important:: A single CRUSH rule can be assigned to multiple pools, but
 502    a single pool cannot have multiple CRUSH rules.
 503
 504 ``firstn`` or ``indep``
 505
 506    :Description: Determines which replacement strategy CRUSH uses when items (OSDs)
 507                  are marked ``down`` in the CRUSH map. When this rule is used
 508                  with replicated pools, ``firstn`` is used. When this rule is
 509                  used with erasure-coded pools, ``indep`` is used.
 510
 511                  Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then
 512                  OSD 3 goes down.
 513
 514                  When in ``firstn`` mode, CRUSH simply adjusts its calculation
 515                  to select OSDs 1 and 2, then selects 3 and discovers that 3 is
 516                  down, retries and selects 4 and 5, and finally goes on to
 517                  select a new OSD: OSD 6. The final CRUSH mapping
 518                  transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6.
 519
 520                  However, if you were storing an erasure-coded pool, the above
 521                  sequence would have changed the data that is mapped to OSDs 4,
 522                  5, and 6. The ``indep`` mode attempts to avoid this unwanted
 523                  consequence. When in ``indep`` mode, CRUSH can be expected to
 524                  select 3, discover that 3 is down, retry, and select 6. The
 525                  final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5
 526                  → 1, 2, 6, 4, 5.
 527
 528 .. _crush-reclassify:
 529
 530 Migrating from a legacy SSD rule to device classes
 531 --------------------------------------------------
 532
 533 Prior to the Luminous release's introduction of the *device class* feature, in
 534 order to write rules that applied to a specialized device type (for example,
 535 SSD), it was necessary to manually edit the CRUSH map and maintain a parallel
 536 hierarchy for each device type. The device class feature provides a more
 537 transparent way to achieve this end.
 538
 539 However, if your cluster is migrated from an existing manually-customized
 540 per-device map to new device class-based rules, all data in the system will be
 541 reshuffled.
 542
 543 The ``crushtool`` utility has several commands that can transform a legacy rule
 544 and hierarchy and allow you to start using the new device class rules. There
 545 are three possible types of transformation:
 546
 547 #. ``--reclassify-root <root-name> <device-class>``
 548
 549    This command examines everything under ``root-name`` in the hierarchy and
 550    rewrites any rules that reference the specified root and that have the
 551    form ``take <root-name>`` so that they instead have the
 552    form ``take <root-name> class <device-class>``. The command also renumbers
 553    the buckets in such a way that the old IDs are used for the specified
 554    class's "shadow tree" and as a result no data movement takes place.
 555
 556    For example, suppose you have the following as an existing rule::
 557
 558      rule replicated_rule {
 559         id 0
 560         type replicated
 561         step take default
 562         step chooseleaf firstn 0 type rack
 563         step emit
 564      }
 565
 566    If the root ``default`` is reclassified as class ``hdd``, the new rule will
 567    be as follows::
 568
 569      rule replicated_rule {
 570         id 0
 571         type replicated
 572         step take default class hdd
 573         step chooseleaf firstn 0 type rack
 574         step emit
 575      }
 576
 577 #. ``--set-subtree-class <bucket-name> <device-class>``
 578
 579    This command marks every device in the subtree that is rooted at *bucket-name*
 580    with the specified device class.
 581
 582    This command is typically used in conjunction with the ``--reclassify-root`` option
 583    in order to ensure that all devices in that root are labeled with the
 584    correct class. In certain circumstances, however, some of those devices
 585    are correctly labeled with a different class and must not be relabeled. To
 586    manage this difficulty, one can exclude the ``--set-subtree-class``
 587    option. The remapping process will not be perfect, because the previous rule
 588    had an effect on devices of multiple classes but the adjusted rules will map
 589    only to devices of the specified device class. However, when there are not many
 590    outlier devices, the resulting level of data movement is often within tolerable
 591    limits.
 592
 593
 594 #. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>``
 595
 596    This command allows you to merge a parallel type-specific hierarchy with the
 597    normal hierarchy. For example, many users have maps that resemble the
 598    following::
 599
 600      host node1 {
 601         id -2           # do not change unnecessarily
 602         # weight 109.152
 603         alg straw2
 604         hash 0  # rjenkins1
 605         item osd.0 weight 9.096
 606         item osd.1 weight 9.096
 607         item osd.2 weight 9.096
 608         item osd.3 weight 9.096
 609         item osd.4 weight 9.096
 610         item osd.5 weight 9.096
 611         ...
 612      }
 613
 614      host node1-ssd {
 615         id -10          # do not change unnecessarily
 616         # weight 2.000
 617         alg straw2
 618         hash 0  # rjenkins1
 619         item osd.80 weight 2.000
 620     ...
 621      }
 622
 623      root default {
 624         id -1           # do not change unnecessarily
 625         alg straw2
 626         hash 0  # rjenkins1
 627         item node1 weight 110.967
 628         ...
 629      }
 630
 631      root ssd {
 632         id -18          # do not change unnecessarily
 633         # weight 16.000
 634         alg straw2
 635         hash 0  # rjenkins1
 636         item node1-ssd weight 2.000
 637     ...
 638      }
 639
 640    This command reclassifies each bucket that matches a certain
 641    pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For
 642    example, in the above example, we would use the pattern
 643    ``%-ssd``. For each matched bucket, the remaining portion of the
 644    name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All
 645    devices in the matched bucket are labeled with the specified
 646    device class and then moved to the base bucket. If the base bucket
 647    does not exist (for example, ``node12-ssd`` exists but ``node12`` does
 648    not), then it is created and linked under the specified
 649    *default parent* bucket. In each case, care is taken to preserve
 650    the old bucket IDs for the new shadow buckets in order to prevent data
 651    movement. Any rules with ``take`` steps that reference the old
 652    buckets are adjusted accordingly.
 653
 654
 655 #. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>``
 656
 657    The same command can also be used without a wildcard in order to map a
 658    single bucket. For example, in the previous example, we want the
 659    ``ssd`` bucket to be mapped to the ``default`` bucket.
 660
 661 #. The final command to convert the map that consists of the above fragments
 662    resembles the following:
 663
 664    .. prompt:: bash $
 665
 666       ceph osd getcrushmap -o original
 667       crushtool -i original --reclassify \
 668         --set-subtree-class default hdd \
 669         --reclassify-root default hdd \
 670         --reclassify-bucket %-ssd ssd default \
 671         --reclassify-bucket ssd ssd default \
 672         -o adjusted
 673
 674 ``--compare`` flag
 675 ------------------
 676
 677 A ``--compare`` flag is available to make sure that the conversion performed in
 678 :ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is
 679 correct. This flag tests a large sample of inputs against the CRUSH map and
 680 checks that the expected result is output. The options that control these
 681 inputs are the same as the options that apply to the ``--test`` command. For an
 682 illustration of how this ``--compare`` command applies to the above example,
 683 see the following:
 684
 685 .. prompt:: bash $
 686
 687    crushtool -i original --compare adjusted
 688
 689 ::
 690
 691   rule 0 had 0/10240 mismatched mappings (0)
 692   rule 1 had 0/10240 mismatched mappings (0)
 693   maps appear equivalent
 694
 695 If the command finds any differences, the ratio of remapped inputs is reported
 696 in the parentheses.
 697
 698 When you are satisfied with the adjusted map, apply it to the cluster by
 699 running the following command:
 700
 701 .. prompt:: bash $
 702
 703    ceph osd setcrushmap -i adjusted
 704
 705 Manually Tuning CRUSH
 706 ---------------------
 707
 708 If you have verified that all clients are running recent code, you can adjust
 709 the CRUSH tunables by extracting the CRUSH map, modifying the values, and
 710 reinjecting the map into the cluster. The procedure is carried out as follows:
 711
 712 #. Extract the latest CRUSH map:
 713
 714    .. prompt:: bash $
 715
 716       ceph osd getcrushmap -o /tmp/crush
 717
 718 #. Adjust tunables. In our tests, the following values appear to result in the
 719    best behavior for both large and small clusters. The procedure requires that
 720    you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool``
 721    command. Use this option with **extreme care**:
 722
 723    .. prompt:: bash $
 724
 725       crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
 726
 727 #. Reinject the modified map:
 728
 729    .. prompt:: bash $
 730
 731      ceph osd setcrushmap -i /tmp/crush.new
 732
 733 Legacy values
 734 -------------
 735
 736 To set the legacy values of the CRUSH tunables, run the following command:
 737
 738 .. prompt:: bash $
 739
 740    crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
 741
 742 The special ``--enable-unsafe-tunables`` flag is required. Be careful when
 743 running old versions of the ``ceph-osd`` daemon after reverting to legacy
 744 values, because the feature bit is not perfectly enforced.
 745
 746 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf