ceph/doc/rados/operations/crush-map.rst

   1 ============
   2  CRUSH Maps
   3 ============
   4
   5 The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
   6 computes storage locations in order to determine how to store and retrieve
   7 data.  CRUSH allows Ceph clients to communicate with OSDs directly rather than
   8 through a centralized server or broker. By using an algorithmically-determined
   9 method of storing and retrieving data, Ceph avoids a single point of failure, a
  10 performance bottleneck, and a physical limit to its scalability.
  11
  12 CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
  13 distributing the data across the cluster in accordance with configured
  14 replication policy and failure domains. For a detailed discussion of CRUSH, see
  15 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
  16
  17 CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
  18 hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
  19 replicates data within the cluster's pools. By reflecting the underlying
  20 physical organization of the installation, CRUSH can model (and thereby
  21 address) the potential for correlated device failures.  Some factors relevant
  22 to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
  23 power source, shared networking, and failure domains. By encoding this
  24 information into the CRUSH map, CRUSH placement policies distribute object
  25 replicas across failure domains while maintaining the desired distribution. For
  26 example, to address the possibility of concurrent failures, it might be
  27 desirable to ensure that data replicas are on devices that reside in or rely
  28 upon different shelves, racks, power supplies, controllers, or physical
  29 locations.
  30
  31 When OSDs are deployed, they are automatically added to the CRUSH map under a
  32 ``host`` bucket that is named for the node on which the OSDs run. This
  33 behavior, combined with the configured CRUSH failure domain, ensures that
  34 replicas or erasure-code shards are distributed across hosts and that the
  35 failure of a single host or other kinds of failures will not affect
  36 availability. For larger clusters, administrators must carefully consider their
  37 choice of failure domain. For example, distributing replicas across racks is
  38 typical for mid- to large-sized clusters.
  39
  40
  41 CRUSH Location
  42 ==============
  43
  44 The location of an OSD within the CRUSH map's hierarchy is referred to as its
  45 ``CRUSH location``. The specification of a CRUSH location takes the form of a
  46 list of key-value pairs. For example, if an OSD is in a particular row, rack,
  47 chassis, and host, and is also part of the 'default' CRUSH root (which is the
  48 case for most clusters), its CRUSH location can be specified as follows::
  49
  50   root=default row=a rack=a2 chassis=a2a host=a2a1
  51
  52 .. note::
  53
  54    #. The order of the keys does not matter.
  55    #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
  56       valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
  57       ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
  58       types suffice for nearly all clusters, but can be customized by
  59       modifying the CRUSH map.
  60    #. Not all keys need to be specified. For example, by default, Ceph
  61       automatically sets an ``OSD``'s location as ``root=default
  62       host=HOSTNAME`` (as determined by the output of ``hostname -s``).
  63
  64 The CRUSH location for an OSD can be modified by adding the ``crush location``
  65 option in ``ceph.conf``. When this option has been added, every time the OSD
  66 starts it verifies that it is in the correct location in the CRUSH map and
  67 moves itself if it is not. To disable this automatic CRUSH map management, add
  68 the following to the ``ceph.conf`` configuration file in the ``[osd]``
  69 section::
  70
  71    osd crush update on start = false
  72
  73 Note that this action is unnecessary in most cases.
  74
  75
  76 Custom location hooks
  77 ---------------------
  78
  79 A custom location hook can be used to generate a more complete CRUSH location
  80 on startup. The CRUSH location is determined by, in order of preference:
  81
  82 #. A ``crush location`` option in ``ceph.conf``
  83 #. A default of ``root=default host=HOSTNAME`` where the hostname is determined
  84    by the output of the ``hostname -s`` command
  85
  86 A script can be written to provide additional location fields (for example,
  87 ``rack`` or ``datacenter``) and the hook can be enabled via the following
  88 config option::
  89
  90    crush location hook = /path/to/customized-ceph-crush-location
  91
  92 This hook is passed several arguments (see below). The hook outputs a single
  93 line to ``stdout`` that contains the CRUSH location description. The output
  94 resembles the following:::
  95
  96   --cluster CLUSTER --id ID --type TYPE
  97
  98 Here the cluster name is typically ``ceph``, the ``id`` is the daemon
  99 identifier or (in the case of OSDs) the OSD number, and the daemon type is
 100 ``osd``, ``mds, ``mgr``, or ``mon``.
 101
 102 For example, a simple hook that specifies a rack location via a value in the
 103 file ``/etc/rack`` might be as follows::
 104
 105   #!/bin/sh
 106   echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
 107
 108
 109 CRUSH structure
 110 ===============
 111
 112 The CRUSH map consists of (1) a hierarchy that describes the physical topology
 113 of the cluster and (2) a set of rules that defines data placement policy. The
 114 hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
 115 other physical features or groupings: hosts, racks, rows, data centers, and so
 116 on. The rules determine how replicas are placed in terms of that hierarchy (for
 117 example, 'three replicas in different racks').
 118
 119 Devices
 120 -------
 121
 122 Devices are individual OSDs that store data (usually one device for each
 123 storage drive).  Devices are identified by an ``id`` (a non-negative integer)
 124 and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
 125
 126 In Luminous and later releases, OSDs can have a *device class* assigned (for
 127 example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
 128 rules. Device classes are especially useful when mixing device types within
 129 hosts.
 130
 131 .. _crush_map_default_types:
 132
 133 Types and Buckets
 134 -----------------
 135
 136 "Bucket", in the context of CRUSH, is a term for any of the internal nodes in
 137 the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
 138 *types* that are used to identify these nodes. Default types include:
 139
 140 - ``osd`` (or ``device``)
 141 - ``host``
 142 - ``chassis``
 143 - ``rack``
 144 - ``row``
 145 - ``pdu``
 146 - ``pod``
 147 - ``room``
 148 - ``datacenter``
 149 - ``zone``
 150 - ``region``
 151 - ``root``
 152
 153 Most clusters use only a handful of these types, and other types can be defined
 154 as needed.
 155
 156 The hierarchy is built with devices (normally of type ``osd``) at the leaves
 157 and non-device types as the internal nodes. The root node is of type ``root``.
 158 For example:
 159
 160
 161 .. ditaa::
 162
 163                         +-----------------+
 164                         |{o}root default  |
 165                         +--------+--------+
 166                                  |
 167                  +---------------+---------------+
 168                  |                               |
 169           +------+------+                 +------+------+
 170           |{o}host foo  |                 |{o}host bar  |
 171           +------+------+                 +------+------+
 172                  |                               |
 173          +-------+-------+               +-------+-------+
 174          |               |               |               |
 175    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 176    |   osd.0   |   |   osd.1   |   |   osd.2   |   |   osd.3   |
 177    +-----------+   +-----------+   +-----------+   +-----------+
 178
 179
 180 Each node (device or bucket) in the hierarchy has a *weight* that indicates the
 181 relative proportion of the total data that should be stored by that device or
 182 hierarchy subtree. Weights are set at the leaves, indicating the size of the
 183 device. These weights automatically sum in an 'up the tree' direction: that is,
 184 the weight of the ``root`` node will be the sum of the weights of all devices
 185 contained under it. Weights are typically measured in tebibytes (TiB).
 186
 187 To get a simple view of the cluster's CRUSH hierarchy, including weights, run
 188 the following command:
 189
 190 .. prompt:: bash $
 191
 192    ceph osd tree
 193
 194 Rules
 195 -----
 196
 197 CRUSH rules define policy governing how data is distributed across the devices
 198 in the hierarchy. The rules define placement as well as replication strategies
 199 or distribution policies that allow you to specify exactly how CRUSH places
 200 data replicas. For example, you might create one rule selecting a pair of
 201 targets for two-way mirroring, another rule for selecting three targets in two
 202 different data centers for three-way replication, and yet another rule for
 203 erasure coding across six storage devices. For a detailed discussion of CRUSH
 204 rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
 205 Placement of Replicated Data`_.
 206
 207 CRUSH rules can be created via the command-line by specifying the *pool type*
 208 that they will govern (replicated or erasure coded), the *failure domain*, and
 209 optionally a *device class*.  In rare cases, CRUSH rules must be created by
 210 manually editing the CRUSH map.
 211
 212 To see the rules that are defined for the cluster, run the following command:
 213
 214 .. prompt:: bash $
 215
 216    ceph osd crush rule ls
 217
 218 To view the contents of the rules, run the following command:
 219
 220 .. prompt:: bash $
 221
 222    ceph osd crush rule dump
 223
 224 .. _device_classes:
 225
 226 Device classes
 227 --------------
 228
 229 Each device can optionally have a *class* assigned. By default, OSDs
 230 automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
 231 accordance with the type of device they are backed by.
 232
 233 To explicitly set the device class of one or more OSDs, run a command of the
 234 following form:
 235
 236 .. prompt:: bash $
 237
 238    ceph osd crush set-device-class <class> <osd-name> [...]
 239
 240 Once a device class has been set, it cannot be changed to another class until
 241 the old class is unset. To remove the old class of one or more OSDs, run a
 242 command of the following form:
 243
 244 .. prompt:: bash $
 245
 246    ceph osd crush rm-device-class <osd-name> [...]
 247
 248 This restriction allows administrators to set device classes that won't be
 249 changed on OSD restart or by a script.
 250
 251 To create a placement rule that targets a specific device class, run a command
 252 of the following form:
 253
 254 .. prompt:: bash $
 255
 256    ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 257
 258 To apply the new placement rule to a specific pool, run a command of the
 259 following form:
 260
 261 .. prompt:: bash $
 262
 263    ceph osd pool set <pool-name> crush_rule <rule-name>
 264
 265 Device classes are implemented by creating one or more "shadow" CRUSH
 266 hierarchies.  For each device class in use, there will be a shadow hierarchy
 267 that contains only devices of that class. CRUSH rules can then distribute data
 268 across the relevant shadow hierarchy.  This approach is fully backward
 269 compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
 270 items displayed, run the following command:
 271
 272 .. prompt:: bash #
 273
 274    ceph osd crush tree --show-shadow
 275
 276 Some older clusters that were created before the Luminous release rely on
 277 manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
 278 clusters, there is a *reclassify* tool available that can help them transition
 279 to device classes without triggering unwanted data movement (see
 280 :ref:`crush-reclassify`).
 281
 282 Weight sets
 283 -----------
 284
 285 A *weight set* is an alternative set of weights to use when calculating data
 286 placement. The normal weights associated with each device in the CRUSH map are
 287 set in accordance with the device size and indicate how much data should be
 288 stored where. However, because CRUSH is a probabilistic pseudorandom placement
 289 process, there is always some variation from this ideal distribution (in the
 290 same way that rolling a die sixty times will likely not result in exactly ten
 291 ones and ten sixes). Weight sets allow the cluster to perform numerical
 292 optimization based on the specifics of your cluster (for example: hierarchy,
 293 pools) to achieve a balanced distribution.
 294
 295 Ceph supports two types of weight sets:
 296
 297 #. A **compat** weight set is a single alternative set of weights for each
 298    device and each node in the cluster. Compat weight sets cannot be expected
 299    to correct all anomalies (for example, PGs for different pools might be of
 300    different sizes and have different load levels, but are mostly treated alike
 301    by the balancer).  However, they have the major advantage of being *backward
 302    compatible* with previous versions of Ceph. This means that even though
 303    weight sets were first introduced in Luminous v12.2.z, older clients (for
 304    example, Firefly) can still connect to the cluster when a compat weight set
 305    is being used to balance data.
 306
 307 #. A **per-pool** weight set is more flexible in that it allows placement to
 308    be optimized for each data pool. Additionally, weights can be adjusted
 309    for each position of placement, allowing the optimizer to correct for a
 310    subtle skew of data toward devices with small weights relative to their
 311    peers (an effect that is usually apparent only in very large clusters
 312    but that can cause balancing problems).
 313
 314 When weight sets are in use, the weights associated with each node in the
 315 hierarchy are visible in a separate column (labeled either as ``(compat)`` or
 316 as the pool name) in the output of the following command:
 317
 318 .. prompt:: bash #
 319
 320    ceph osd tree
 321
 322 If both *compat* and *per-pool* weight sets are in use, data placement for a
 323 particular pool will use its own per-pool weight set if present. If only
 324 *compat* weight sets are in use, data placement will use the compat weight set.
 325 If neither are in use, data placement will use the normal CRUSH weights.
 326
 327 Although weight sets can be set up and adjusted manually, we recommend enabling
 328 the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
 329 cluster is running Luminous or a later release.
 330
 331 Modifying the CRUSH map
 332 =======================
 333
 334 .. _addosd:
 335
 336 Adding/Moving an OSD
 337 --------------------
 338
 339 .. note:: Under normal conditions, OSDs automatically add themselves to the
 340    CRUSH map when they are created. The command in this section is rarely
 341    needed.
 342
 343
 344 To add or move an OSD in the CRUSH map of a running cluster, run a command of
 345 the following form:
 346
 347 .. prompt:: bash $
 348
 349    ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
 350
 351 For details on this command's parameters, see the following:
 352
 353 ``name``
 354    :Description: The full name of the OSD.
 355    :Type: String
 356    :Required: Yes
 357    :Example: ``osd.0``
 358
 359
 360 ``weight``
 361    :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB).
 362    :Type: Double
 363    :Required: Yes
 364    :Example: ``2.0``
 365
 366
 367 ``root``
 368    :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``).
 369    :Type: Key-value pair.
 370    :Required: Yes
 371    :Example: ``root=default``
 372
 373
 374 ``bucket-type``
 375    :Description: The OSD's location in the CRUSH hierarchy.
 376    :Type: Key-value pairs.
 377    :Required: No
 378    :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 379
 380 In the following example, the command adds ``osd.0`` to the hierarchy, or moves
 381 ``osd.0`` from a previous location:
 382
 383 .. prompt:: bash $
 384
 385    ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
 386
 387
 388 Adjusting OSD weight
 389 --------------------
 390
 391 .. note:: Under normal conditions, OSDs automatically add themselves to the
 392    CRUSH map with the correct weight when they are created. The command in this
 393    section is rarely needed.
 394
 395 To adjust an OSD's CRUSH weight in a running cluster, run a command of the
 396 following form:
 397
 398 .. prompt:: bash $
 399
 400    ceph osd crush reweight {name} {weight}
 401
 402 For details on this command's parameters, see the following:
 403
 404 ``name``
 405    :Description: The full name of the OSD.
 406    :Type: String
 407    :Required: Yes
 408    :Example: ``osd.0``
 409
 410
 411 ``weight``
 412    :Description: The CRUSH weight of the OSD.
 413    :Type: Double
 414    :Required: Yes
 415    :Example: ``2.0``
 416
 417
 418 .. _removeosd:
 419
 420 Removing an OSD
 421 ---------------
 422
 423 .. note:: OSDs are normally removed from the CRUSH map as a result of the
 424    `ceph osd purge`` command. This command is rarely needed.
 425
 426 To remove an OSD from the CRUSH map of a running cluster, run a command of the
 427 following form:
 428
 429 .. prompt:: bash $
 430
 431    ceph osd crush remove {name}
 432
 433 For details on the ``name`` parameter, see the following:
 434
 435 ``name``
 436    :Description: The full name of the OSD.
 437    :Type: String
 438    :Required: Yes
 439    :Example: ``osd.0``
 440
 441
 442 Adding a CRUSH Bucket
 443 ---------------------
 444
 445 .. note:: Buckets are implicitly created when an OSD is added and the command
 446    that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the
 447    OSD's location (provided that a bucket with that name does not already
 448    exist). The command in this section is typically used when manually
 449    adjusting the structure of the hierarchy after OSDs have already been
 450    created. One use of this command is to move a series of hosts to a new
 451    rack-level bucket.  Another use of this command is to add new ``host``
 452    buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive
 453    any data until they are ready to receive data. When they are ready, move the
 454    buckets to the ``default`` root or to any other root as described below.
 455
 456 To add a bucket in the CRUSH map of a running cluster, run a command of the
 457 following form:
 458
 459 .. prompt:: bash $
 460
 461    ceph osd crush add-bucket {bucket-name} {bucket-type}
 462
 463 For details on this command's parameters, see the following:
 464
 465 ``bucket-name``
 466    :Description: The full name of the bucket.
 467    :Type: String
 468    :Required: Yes
 469    :Example: ``rack12``
 470
 471
 472 ``bucket-type``
 473    :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy.
 474    :Type: String
 475    :Required: Yes
 476    :Example: ``rack``
 477
 478 In the following example, the command adds the ``rack12`` bucket to the hierarchy:
 479
 480 .. prompt:: bash $
 481
 482    ceph osd crush add-bucket rack12 rack
 483
 484 Moving a Bucket
 485 ---------------
 486
 487 To move a bucket to a different location or position in the CRUSH map
 488 hierarchy, run a command of the following form:
 489
 490 .. prompt:: bash $
 491
 492    ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
 493
 494 For details on this command's parameters, see the following:
 495
 496 ``bucket-name``
 497    :Description: The name of the bucket that you are moving.
 498    :Type: String
 499    :Required: Yes
 500    :Example: ``foo-bar-1``
 501
 502 ``bucket-type``
 503    :Description: The bucket's new location in the CRUSH hierarchy.
 504    :Type: Key-value pairs.
 505    :Required: No
 506    :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 507
 508 Removing a Bucket
 509 -----------------
 510
 511 To remove a bucket from the CRUSH hierarchy, run a command of the following
 512 form:
 513
 514 .. prompt:: bash $
 515
 516    ceph osd crush remove {bucket-name}
 517
 518 .. note:: A bucket must already be empty before it is removed from the CRUSH
 519    hierarchy. In other words, there must not be OSDs or any other CRUSH buckets
 520    within it.
 521
 522 For details on the ``bucket-name`` parameter, see the following:
 523
 524 ``bucket-name``
 525    :Description: The name of the bucket that is being removed.
 526    :Type: String
 527    :Required: Yes
 528    :Example: ``rack12``
 529
 530 In the following example, the command removes the ``rack12`` bucket from the
 531 hierarchy:
 532
 533 .. prompt:: bash $
 534
 535    ceph osd crush remove rack12
 536
 537 Creating a compat weight set
 538 ----------------------------
 539
 540 .. note:: Normally this action is done automatically if needed by the
 541    ``balancer`` module (provided that the module is enabled).
 542
 543 To create a *compat* weight set, run the following command:
 544
 545 .. prompt:: bash $
 546
 547    ceph osd crush weight-set create-compat
 548
 549 To adjust the weights of the compat weight set, run a command of the following
 550 form:
 551
 552 .. prompt:: bash $
 553
 554    ceph osd crush weight-set reweight-compat {name} {weight}
 555
 556 To destroy the compat weight set, run the following command:
 557
 558 .. prompt:: bash $
 559
 560    ceph osd crush weight-set rm-compat
 561
 562 Creating per-pool weight sets
 563 -----------------------------
 564
 565 To create a weight set for a specific pool, run a command of the following
 566 form:
 567
 568 .. prompt:: bash $
 569
 570    ceph osd crush weight-set create {pool-name} {mode}
 571
 572 .. note:: Per-pool weight sets can be used only if all servers and daemons are
 573    running Luminous v12.2.z or a later release.
 574
 575 For details on this command's parameters, see the following:
 576
 577 ``pool-name``
 578    :Description: The name of a RADOS pool.
 579    :Type: String
 580    :Required: Yes
 581    :Example: ``rbd``
 582
 583 ``mode``
 584    :Description: Either ``flat`` or ``positional``. A *flat* weight set
 585                  assigns a single weight to all devices or buckets. A
 586                  *positional* weight set has a potentially different
 587                  weight for each position in the resulting placement
 588                  mapping. For example: if a pool has a replica count of
 589                  ``3``, then a positional weight set will have three
 590                  weights for each device and bucket.
 591    :Type: String
 592    :Required: Yes
 593    :Example: ``flat``
 594
 595 To adjust the weight of an item in a weight set, run a command of the following
 596 form:
 597
 598 .. prompt:: bash $
 599
 600    ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
 601
 602 To list existing weight sets, run the following command:
 603
 604 .. prompt:: bash $
 605
 606    ceph osd crush weight-set ls
 607
 608 To remove a weight set, run a command of the following form:
 609
 610 .. prompt:: bash $
 611
 612    ceph osd crush weight-set rm {pool-name}
 613
 614
 615 Creating a rule for a replicated pool
 616 -------------------------------------
 617
 618 When you create a CRUSH rule for a replicated pool, there is an important
 619 decision to make: selecting a failure domain. For example, if you select a
 620 failure domain of ``host``, then CRUSH will ensure that each replica of the
 621 data is stored on a unique host.  Alternatively, if you select a failure domain
 622 of ``rack``, then each replica of the data will be stored in a different rack.
 623 Your selection of failure domain should be guided by the size and its CRUSH
 624 topology.
 625
 626 The entire cluster hierarchy is typically nested beneath a root node that is
 627 named ``default``. If you have customized your hierarchy, you might want to
 628 create a rule nested beneath some other node in the hierarchy.  In creating
 629 this rule for the customized hierarchy, the node type doesn't matter, and in
 630 particular the rule does not have to be nested beneath a ``root`` node.
 631
 632 It is possible to create a rule that restricts data placement to a specific
 633 *class* of device. By default, Ceph OSDs automatically classify themselves as
 634 either ``hdd`` or ``ssd`` in accordance with the underlying type of device
 635 being used. These device classes can be customized. One might set the ``device
 636 class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
 637 them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
 638 and pools may be flexibly constrained to use (or avoid using) specific subsets
 639 of OSDs based on specific requirements.
 640
 641 To create a rule for a replicated pool, run a command of the following form:
 642
 643 .. prompt:: bash $
 644
 645    ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
 646
 647 For details on this command's parameters, see the following:
 648
 649 ``name``
 650    :Description: The name of the rule.
 651    :Type: String
 652    :Required: Yes
 653    :Example: ``rbd-rule``
 654
 655 ``root``
 656    :Description: The name of the CRUSH hierarchy node under which data is to be placed.
 657    :Type: String
 658    :Required: Yes
 659    :Example: ``default``
 660
 661 ``failure-domain-type``
 662    :Description: The type of CRUSH nodes used for the replicas of the failure domain.
 663    :Type: String
 664    :Required: Yes
 665    :Example: ``rack``
 666
 667 ``class``
 668    :Description: The device class on which data is to be placed.
 669    :Type: String
 670    :Required: No
 671    :Example: ``ssd``
 672
 673 Creating a rule for an erasure-coded pool
 674 -----------------------------------------
 675
 676 For an erasure-coded pool, similar decisions need to be made: what the failure
 677 domain is, which node in the hierarchy data will be placed under (usually
 678 ``default``), and whether placement is restricted to a specific device class.
 679 However, erasure-code pools are created in a different way: there is a need to
 680 construct them carefully with reference to the erasure code plugin in use. For
 681 this reason, these decisions must be incorporated into the **erasure-code
 682 profile**.  A CRUSH rule will then be created from the erasure-code profile,
 683 either explicitly or automatically when the profile is used to create a pool.
 684
 685 To list the erasure-code profiles, run the following command:
 686
 687 .. prompt:: bash $
 688
 689    ceph osd erasure-code-profile ls
 690
 691 To view a specific existing profile, run a command of the following form:
 692
 693 .. prompt:: bash $
 694
 695    ceph osd erasure-code-profile get {profile-name}
 696
 697 Under normal conditions, profiles should never be modified; instead, a new
 698 profile should be created and used when creating either a new pool or a new
 699 rule for an existing pool.
 700
 701 An erasure-code profile consists of a set of key-value pairs. Most of these
 702 key-value pairs govern the behavior of the erasure code that encodes data in
 703 the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH
 704 rule that is created.
 705
 706 The relevant erasure-code profile properties are as follows:
 707
 708  * **crush-root**: the name of the CRUSH node under which to place data
 709    [default: ``default``].
 710  * **crush-failure-domain**: the CRUSH bucket type used in the distribution of
 711    erasure-coded shards [default: ``host``].
 712  * **crush-device-class**: the device class on which to place data [default:
 713    none, which means that all devices are used].
 714  * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
 715    number of erasure-code shards, affecting the resulting CRUSH rule.
 716
 717  After a profile is defined, you can create a CRUSH rule by running a command
 718  of the following form:
 719
 720 .. prompt:: bash $
 721
 722    ceph osd crush rule create-erasure {name} {profile-name}
 723
 724 .. note: When creating a new pool, it is not necessary to create the rule
 725    explicitly. If only the erasure-code profile is specified and the rule
 726    argument is omitted, then Ceph will create the CRUSH rule automatically.
 727
 728
 729 Deleting rules
 730 --------------
 731
 732 To delete rules that are not in use by pools, run a command of the following
 733 form:
 734
 735 .. prompt:: bash $
 736
 737    ceph osd crush rule rm {rule-name}
 738
 739 .. _crush-map-tunables:
 740
 741 Tunables
 742 ========
 743
 744 The CRUSH algorithm that is used to calculate the placement of data has been
 745 improved over time. In order to support changes in behavior, we have provided
 746 users with sets of tunables that determine which legacy or optimal version of
 747 CRUSH is to be used.
 748
 749 In order to use newer tunables, all Ceph clients and daemons must support the
 750 new major release of CRUSH. Because of this requirement, we have created
 751 ``profiles`` that are named after the Ceph version in which they were
 752 introduced. For example, the ``firefly`` tunables were first supported by the
 753 Firefly release and do not work with older clients (for example, clients
 754 running Dumpling).  After a cluster's tunables profile is changed from a legacy
 755 set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options
 756 will prevent older clients that do not support the new CRUSH features from
 757 connecting to the cluster.
 758
 759 argonaut (legacy)
 760 -----------------
 761
 762 The legacy CRUSH behavior used by Argonaut and older releases works fine for
 763 most clusters, provided that not many OSDs have been marked ``out``.
 764
 765 bobtail (CRUSH_TUNABLES2)
 766 -------------------------
 767
 768 The ``bobtail`` tunable profile provides the following improvements:
 769
 770  * For hierarchies with a small number of devices in leaf buckets, some PGs
 771    might map to fewer than the desired number of replicas, resulting in
 772    ``undersized`` PGs.  This is known to happen in the case of hierarchies with
 773    ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each
 774    host.
 775
 776  * For large clusters, a small percentage of PGs might map to fewer than the
 777    desired number of OSDs. This is known to happen when there are multiple
 778    hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
 779    ``osd``).
 780
 781  * When one or more OSDs are marked ``out``, data tends to be redistributed
 782    to nearby OSDs instead of across the entire hierarchy.
 783
 784 The tunables introduced in the Bobtail release are as follows:
 785
 786  * ``choose_local_tries``: Number of local retries. The legacy value is ``2``,
 787    and the optimal value is ``0``.
 788
 789  * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal
 790    value is 0.
 791
 792  * ``choose_total_tries``: Total number of attempts to choose an item.  The
 793    legacy value is ``19``, but subsequent testing indicates that a value of
 794    ``50`` is more appropriate for typical clusters. For extremely large
 795    clusters, an even larger value might be necessary.
 796
 797  * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will
 798    retry, or try only once and allow the original placement to retry. The
 799    legacy default is ``0``, and the optimal value is ``1``.
 800
 801 Migration impact:
 802
 803  * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a
 804    moderate amount of data movement. Use caution on a cluster that is already
 805    populated with data.
 806
 807 firefly (CRUSH_TUNABLES3)
 808 -------------------------
 809
 810 chooseleaf_vary_r
 811 ~~~~~~~~~~~~~~~~~
 812
 813 This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step
 814 behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs.
 815
 816 This profile was introduced in the Firefly release, and adds a new tunable as follows:
 817
 818  * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start
 819    with a non-zero value of ``r``, as determined by the number of attempts the
 820    parent has already made. The legacy default value is ``0``, but with this
 821    value CRUSH is sometimes unable to find a mapping. The optimal value (in
 822    terms of computational cost and correctness) is ``1``.
 823
 824 Migration impact:
 825
 826  * For existing clusters that store a great deal of data, changing this tunable
 827    from ``0`` to ``1`` will trigger a large amount of data migration; a value
 828    of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will
 829    cause less data to move.
 830
 831 straw_calc_version tunable
 832 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 833
 834 There were problems with the internal weights calculated and stored in the
 835 CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH
 836 weight of ``0`` or with a mix of different and unique weights, CRUSH would
 837 distribute data incorrectly (that is, not in proportion to the weights).
 838
 839 This tunable, introduced in the Firefly release, is as follows:
 840
 841  * ``straw_calc_version``: A value of ``0`` preserves the old, broken
 842    internal-weight calculation; a value of ``1`` fixes the problem.
 843
 844 Migration impact:
 845
 846  * Changing this tunable to a value of ``1`` and then adjusting a straw bucket
 847    (either by adding, removing, or reweighting an item or by using the
 848    reweight-all command) can trigger a small to moderate amount of data
 849    movement provided that the cluster has hit one of the problematic
 850    conditions.
 851
 852 This tunable option is notable in that it has absolutely no impact on the
 853 required kernel version in the client side.
 854
 855 hammer (CRUSH_V4)
 856 -----------------
 857
 858 The ``hammer`` tunable profile does not affect the mapping of existing CRUSH
 859 maps simply by changing the profile. However:
 860
 861  * There is a new bucket algorithm supported: ``straw2``. This new algorithm
 862    fixes several limitations in the original ``straw``. More specifically, the
 863    old ``straw`` buckets would change some mappings that should not have
 864    changed when a weight was adjusted, while ``straw2`` achieves the original
 865    goal of changing mappings only to or from the bucket item whose weight has
 866    changed.
 867
 868  * The ``straw2`` type is the default type for any newly created buckets.
 869
 870 Migration impact:
 871
 872  * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small
 873    amount of data movement, depending on how much the bucket items' weights
 874    vary from each other. When the weights are all the same no data will move,
 875    and the more variance there is in the weights the more movement there will
 876    be.
 877
 878 jewel (CRUSH_TUNABLES5)
 879 -----------------------
 880
 881 The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a
 882 result, significantly fewer mappings change when an OSD is marked ``out`` of
 883 the cluster. This improvement results in significantly less data movement.
 884
 885 The new tunable introduced in the Jewel release is as follows:
 886
 887  * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt
 888    will use a better value for an inner loop that greatly reduces the number of
 889    mapping changes when an OSD is marked ``out``. The legacy value is ``0``,
 890    and the new value of ``1`` uses the new approach.
 891
 892 Migration impact:
 893
 894  * Changing this value on an existing cluster will result in a very large
 895    amount of data movement because nearly every PG mapping is likely to change.
 896
 897 Client versions that support CRUSH_TUNABLES2
 898 --------------------------------------------
 899
 900  * v0.55 and later, including Bobtail (v0.56.x)
 901  * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients)
 902
 903 Client versions that support CRUSH_TUNABLES3
 904 --------------------------------------------
 905
 906  * v0.78 (Firefly) and later
 907  * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients)
 908
 909 Client versions that support CRUSH_V4
 910 -------------------------------------
 911
 912  * v0.94 (Hammer) and later
 913  * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients)
 914
 915 Client versions that support CRUSH_TUNABLES5
 916 --------------------------------------------
 917
 918  * v10.0.2 (Jewel) and later
 919  * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients)
 920
 921 "Non-optimal tunables" warning
 922 ------------------------------
 923
 924 In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush
 925 map has non-optimal tunables") if any of the current CRUSH tunables have
 926 non-optimal values: that is, if any fail to have the optimal values from the
 927 :ref:` ``default`` profile
 928 <rados_operations_crush_map_default_profile_definition>`.  There are two
 929 different ways to silence the alert:
 930
 931 1. Adjust the CRUSH tunables on the existing cluster so as to render them
 932    optimal. Making this adjustment will trigger some data movement
 933    (possibly as much as 10%). This approach is generally preferred to the
 934    other approach, but special care must be taken in situations where
 935    data movement might affect performance: for example, in production clusters.
 936    To enable optimal tunables, run the following command:
 937
 938    .. prompt:: bash $
 939
 940       ceph osd crush tunables optimal
 941
 942    There are several potential problems that might make it preferable to revert
 943    to the previous values of the tunables. The new values might generate too
 944    much load for the cluster to handle, the new values might unacceptably slow
 945    the operation of the cluster, or there might be a client-compatibility
 946    problem. Such client-compatibility problems can arise when using old-kernel
 947    CephFS or RBD clients, or pre-Bobtail ``librados`` clients.  To revert to
 948    the previous values of the tunables, run the following command:
 949
 950    .. prompt:: bash $
 951
 952       ceph osd crush tunables legacy
 953
 954 2. To silence the alert without making any changes to CRUSH,
 955    add the following option to the ``[mon]`` section of your ceph.conf file::
 956
 957       mon_warn_on_legacy_crush_tunables = false
 958
 959    In order for this change to take effect, you will need to either restart
 960    the monitors or run the following command to apply the option to the
 961    monitors while they are still running:
 962
 963    .. prompt:: bash $
 964
 965       ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
 966
 967
 968 Tuning CRUSH
 969 ------------
 970
 971 When making adjustments to CRUSH tunables, keep the following considerations in
 972 mind:
 973
 974  * Adjusting the values of CRUSH tunables will result in the shift of one or
 975    more PGs from one storage node to another. If the Ceph cluster is already
 976    storing a great deal of data, be prepared for significant data movement.
 977  * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they
 978    immediately begin rejecting new connections from clients that do not support
 979    the new feature. However, already-connected clients are effectively
 980    grandfathered in, and any of these clients that do not support the new
 981    feature will malfunction.
 982  * If the CRUSH tunables are set to newer (non-legacy) values and subsequently
 983    reverted to the legacy values, ``ceph-osd`` daemons will not be required to
 984    support any of the newer CRUSH features associated with the newer
 985    (non-legacy) values. However, the OSD peering process requires the
 986    examination and understanding of old maps. For this reason, **if the cluster
 987    has previously used non-legacy CRUSH values, do not run old versions of
 988    the** ``ceph-osd`` **daemon** -- even if the latest version of the map has
 989    been reverted so as to use the legacy defaults.
 990
 991 The simplest way to adjust CRUSH tunables is to apply them in matched sets
 992 known as *profiles*. As of the Octopus release, Ceph supports the following
 993 profiles:
 994
 995  * ``legacy``: The legacy behavior from argonaut and earlier.
 996  * ``argonaut``: The legacy values supported by the argonaut release.
 997  * ``bobtail``: The values supported by the bobtail release.
 998  * ``firefly``: The values supported by the firefly release.
 999  * ``hammer``: The values supported by the hammer release.
1000  * ``jewel``: The values supported by the jewel release.
1001  * ``optimal``: The best values for the current version of Ceph.
1002    .. _rados_operations_crush_map_default_profile_definition:
1003  * ``default``: The default values of a new cluster that has been installed
1004    from scratch. These values, which depend on the current version of Ceph, are
1005    hardcoded and are typically a mix of optimal and legacy values.  These
1006    values often correspond to the ``optimal`` profile of either the previous
1007    LTS (long-term service) release or the most recent release for which most
1008    users are expected to have up-to-date clients.
1009
1010 To apply a profile to a running cluster, run a command of the following form:
1011
1012 .. prompt:: bash $
1013
1014    ceph osd crush tunables {PROFILE}
1015
1016 This action might trigger a great deal of data movement. Consult release notes
1017 and documentation before changing the profile on a running cluster. Consider
1018 throttling recovery and backfill parameters in order to limit the backfill
1019 resulting from a specific change.
1020
1021 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
1022
1023
1024 Tuning Primary OSD Selection
1025 ============================
1026
1027 When a Ceph client reads or writes data, it first contacts the primary OSD in
1028 each affected PG's acting set. By default, the first OSD in the acting set is
1029 the primary OSD (also known as the "lead OSD"). For example, in the acting set
1030 ``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD.
1031 However, sometimes it is clear that an OSD is not well suited to act as the
1032 lead as compared with other OSDs (for example, if the OSD has a slow drive or a
1033 slow controller). To prevent performance bottlenecks (especially on read
1034 operations) and at the same time maximize the utilization of your hardware, you
1035 can influence the selection of the primary OSD either by adjusting "primary
1036 affinity" values, or by crafting a CRUSH rule that selects OSDs that are better
1037 suited to act as the lead rather than other OSDs.
1038
1039 To determine whether tuning Ceph's selection of primary OSDs will improve
1040 cluster performance, pool redundancy strategy must be taken into account. For
1041 replicated pools, this tuning can be especially useful, because by default read
1042 operations are served from the primary OSD of each PG. For erasure-coded pools,
1043 however, the speed of read operations can be increased by enabling **fast
1044 read** (see :ref:`pool-settings`).
1045
1046 Primary Affinity
1047 ----------------
1048
1049 **Primary affinity** is a characteristic of an OSD that governs the likelihood
1050 that a given OSD will be selected as the primary OSD (or "lead OSD") in a given
1051 acting set. A primary affinity value can be any real number in the range ``0``
1052 to ``1``, inclusive.
1053
1054 As an example of a common scenario in which it can be useful to adjust primary
1055 affinity values, let us suppose that a cluster contains a mix of drive sizes:
1056 for example, suppose it contains some older racks with 1.9 TB SATA SSDs and
1057 some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned
1058 twice the number of PGs and will thus serve twice the number of write and read
1059 operations -- they will be busier than the former. In such a scenario, you
1060 might make a rough assignment of primary affinity as inversely proportional to
1061 OSD size. Such an assignment will not be 100% optimal, but it can readily
1062 achieve a 15% improvement in overall read throughput by means of a more even
1063 utilization of SATA interface bandwidth and CPU cycles. This example is not
1064 merely a thought experiment meant to illustrate the theoretical benefits of
1065 adjusting primary affinity values; this fifteen percent improvement was
1066 achieved on an actual Ceph cluster.
1067
1068 By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster
1069 in which every OSD has this default value, all OSDs are equally likely to act
1070 as a primary OSD.
1071
1072 By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less
1073 likely to select the OSD as primary in a PG's acting set. To change the weight
1074 value associated with a specific OSD's primary affinity, run a command of the
1075 following form:
1076
1077 .. prompt:: bash $
1078
1079    ceph osd primary-affinity <osd-id> <weight>
1080
1081 The primary affinity of an OSD can be set to any real number in the range
1082 ``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as
1083 primary and ``1`` indicates that the OSD is maximally likely to be used as a
1084 primary. When the weight is between these extremes, its value indicates roughly
1085 how likely it is that CRUSH will select the OSD associated with it as a
1086 primary.
1087
1088 The process by which CRUSH selects the lead OSD is not a mere function of a
1089 simple probability determined by relative affinity values. Nevertheless,
1090 measurable results can be achieved even with first-order approximations of
1091 desirable primary affinity values.
1092
1093
1094 Custom CRUSH Rules
1095 ------------------
1096
1097 Some clusters balance cost and performance by mixing SSDs and HDDs in the same
1098 replicated pool. By setting the primary affinity of HDD OSDs to ``0``,
1099 operations will be directed to an SSD OSD in each acting set. Alternatively,
1100 you can define a CRUSH rule that always selects an SSD OSD as the primary OSD
1101 and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting
1102 set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs.
1103
1104 For example, see the following CRUSH rule::
1105
1106     rule mixed_replicated_rule {
1107             id 11
1108             type replicated
1109             step take default class ssd
1110             step chooseleaf firstn 1 type host
1111             step emit
1112             step take default class hdd
1113             step chooseleaf firstn 0 type host
1114             step emit
1115     }
1116
1117 This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool,
1118 this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on
1119 different hosts, because the first SSD OSD might be colocated with any of the
1120 ``N`` HDD OSDs.
1121
1122 To avoid this extra storage requirement, you might place SSDs and HDDs in
1123 different hosts. However, taking this approach means that all client requests
1124 will be received by hosts with SSDs. For this reason, it might be advisable to
1125 have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the
1126 latter will under normal circumstances perform only recovery operations. Here
1127 the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement
1128 not to contain any of the same servers, as seen in the following CRUSH rule::
1129
1130         rule mixed_replicated_rule_two {
1131                id 1
1132                type replicated
1133                step take ssd_hosts class ssd
1134                step chooseleaf firstn 1 type host
1135                step emit
1136                step take hdd_hosts class hdd
1137                step chooseleaf firstn -1 type host
1138                step emit
1139         }
1140
1141 .. note:: If a primary SSD OSD fails, then requests to the associated PG will
1142    be temporarily served from a slower HDD OSD until the PG's data has been
1143    replicated onto the replacement primary SSD OSD.
1144
1145