ceph/doc/rados/operations/crush-map.rst

   1 ============
   2  CRUSH Maps
   3 ============
   4
   5 The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
   6 determines how to store and retrieve data by computing storage locations.
   7 CRUSH empowers Ceph clients to communicate with OSDs directly rather than
   8 through a centralized server or broker. With an algorithmically determined
   9 method of storing and retrieving data, Ceph avoids a single point of failure, a
  10 performance bottleneck, and a physical limit to its scalability.
  11
  12 CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly
  13 map data to OSDs, distributing it across the cluster according to configured
  14 replication policy and failure domain.  For a detailed discussion of CRUSH, see
  15 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
  16
  17 CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy
  18 of 'buckets' for aggregating devices and buckets, and
  19 rules that govern how CRUSH replicates data within the cluster's pools. By
  20 reflecting the underlying physical organization of the installation, CRUSH can
  21 model (and thereby address) the potential for correlated device failures.
  22 Typical factors include chassis, racks, physical proximity, a shared power
  23 source, and shared networking. By encoding this information into the cluster
  24 map, CRUSH placement
  25 policies distribute object replicas across failure domains while
  26 maintaining the desired distribution. For example, to address the
  27 possibility of concurrent failures, it may be desirable to ensure that data
  28 replicas are on devices using different shelves, racks, power supplies,
  29 controllers, and/or physical locations.
  30
  31 When you deploy OSDs they are automatically added to the CRUSH map under a
  32 ``host`` bucket named for the node on which they run.  This,
  33 combined with the configured CRUSH failure domain, ensures that replicas or
  34 erasure code shards are distributed across hosts and that a single host or other
  35 failure will not affect availability.  For larger clusters, administrators must
  36 carefully consider their choice of failure domain.  Separating replicas across racks,
  37 for example, is typical for mid- to large-sized clusters.
  38
  39
  40 CRUSH Location
  41 ==============
  42
  43 The location of an OSD within the CRUSH map's hierarchy is
  44 referred to as a ``CRUSH location``.  This location specifier takes the
  45 form of a list of key and value pairs.  For
  46 example, if an OSD is in a particular row, rack, chassis and host, and
  47 is part of the 'default' CRUSH root (which is the case for most
  48 clusters), its CRUSH location could be described as::
  49
  50   root=default row=a rack=a2 chassis=a2a host=a2a1
  51
  52 Note:
  53
  54 #. Note that the order of the keys does not matter.
  55 #. The key name (left of ``=``) must be a valid CRUSH ``type``.  By default
  56    these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``,
  57    ``rack``, ``chassis`` and ``host``.
  58    These defined types suffice for almost all clusters, but can be customized
  59    by modifying the CRUSH map.
  60 #. Not all keys need to be specified.  For example, by default, Ceph
  61    automatically sets an ``OSD``'s location to be
  62    ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
  63
  64 The CRUSH location for an OSD can be defined by adding the ``crush location``
  65 option in ``ceph.conf``.  Each time the OSD starts,
  66 it verifies it is in the correct location in the CRUSH map and, if it is not,
  67 it moves itself.  To disable this automatic CRUSH map management, add the
  68 following to your configuration file in the ``[osd]`` section::
  69
  70   osd crush update on start = false
  71
  72 Note that in most cases you will not need to manually configure this.
  73
  74
  75 Custom location hooks
  76 ---------------------
  77
  78 A customized location hook can be used to generate a more complete
  79 CRUSH location on startup.  The CRUSH location is based on, in order
  80 of preference:
  81
  82 #. A ``crush location`` option in ``ceph.conf``
  83 #. A default of ``root=default host=HOSTNAME`` where the hostname is
  84    derived from the ``hostname -s`` command
  85
  86 A script can be written to provide additional
  87 location fields (for example, ``rack`` or ``datacenter``) and the
  88 hook enabled via the config option::
  89
  90  crush location hook = /path/to/customized-ceph-crush-location
  91
  92 This hook is passed several arguments (below) and should output a single line
  93 to ``stdout`` with the CRUSH location description.::
  94
  95   --cluster CLUSTER --id ID --type TYPE
  96
  97 where the cluster name is typically ``ceph``, the ``id`` is the daemon
  98 identifier (e.g., the OSD number or daemon identifier), and the daemon
  99 type is ``osd``, ``mds``, etc.
 100
 101 For example, a simple hook that additionally specifies a rack location
 102 based on a value in the file ``/etc/rack`` might be::
 103
 104   #!/bin/sh
 105   echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
 106
 107
 108 CRUSH structure
 109 ===============
 110
 111 The CRUSH map consists of a hierarchy that describes
 112 the physical topology of the cluster and a set of rules defining
 113 data placement policy.  The hierarchy has
 114 devices (OSDs) at the leaves, and internal nodes
 115 corresponding to other physical features or groupings: hosts, racks,
 116 rows, datacenters, and so on.  The rules describe how replicas are
 117 placed in terms of that hierarchy (e.g., 'three replicas in different
 118 racks').
 119
 120 Devices
 121 -------
 122
 123 Devices are individual OSDs that store data, usually one for each storage drive.
 124 Devices are identified by an ``id``
 125 (a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id.
 126
 127 Since the Luminous release, devices may also have a *device class* assigned (e.g.,
 128 ``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by
 129 CRUSH rules.  This is especially useful when mixing device types within hosts.
 130
 131 .. _crush_map_default_types:
 132
 133 Types and Buckets
 134 -----------------
 135
 136 A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
 137 racks, rows, etc.  The CRUSH map defines a series of *types* that are
 138 used to describe these nodes.  Default types include:
 139
 140 - ``osd`` (or ``device``)
 141 - ``host``
 142 - ``chassis``
 143 - ``rack``
 144 - ``row``
 145 - ``pdu``
 146 - ``pod``
 147 - ``room``
 148 - ``datacenter``
 149 - ``zone``
 150 - ``region``
 151 - ``root``
 152
 153 Most clusters use only a handful of these types, and others
 154 can be defined as needed.
 155
 156 The hierarchy is built with devices (normally type ``osd``) at the
 157 leaves, interior nodes with non-device types, and a root node of type
 158 ``root``.  For example,
 159
 160 .. ditaa::
 161
 162                         +-----------------+
 163                         |{o}root default  |
 164                         +--------+--------+
 165                                  |
 166                  +---------------+---------------+
 167                  |                               |
 168           +------+------+                 +------+------+
 169           |{o}host foo  |                 |{o}host bar  |
 170           +------+------+                 +------+------+
 171                  |                               |
 172          +-------+-------+               +-------+-------+
 173          |               |               |               |
 174    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 175    |   osd.0   |   |   osd.1   |   |   osd.2   |   |   osd.3   |
 176    +-----------+   +-----------+   +-----------+   +-----------+
 177
 178 Each node (device or bucket) in the hierarchy has a *weight*
 179 that indicates the relative proportion of the total
 180 data that device or hierarchy subtree should store.  Weights are set
 181 at the leaves, indicating the size of the device, and automatically
 182 sum up the tree, such that the weight of the ``root`` node
 183 will be the total of all devices contained beneath it.  Normally
 184 weights are in units of terabytes (TB).
 185
 186 You can get a simple view the of CRUSH hierarchy for your cluster,
 187 including weights, with::
 188
 189   ceph osd tree
 190
 191 Rules
 192 -----
 193
 194 CRUSH Rules define policy about how data is distributed across the devices
 195 in the hierarchy. They define placement and replication strategies or
 196 distribution policies that allow you to specify exactly how CRUSH
 197 places data replicas. For example, you might create a rule selecting
 198 a pair of targets for two-way mirroring, another rule for selecting
 199 three targets in two different data centers for three-way mirroring, and
 200 yet another rule for erasure coding (EC) across six storage devices. For a
 201 detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
 202 Scalable, Decentralized Placement of Replicated Data`_, and more
 203 specifically to **Section 3.2**.
 204
 205 CRUSH rules can be created via the CLI by
 206 specifying the *pool type* they will be used for (replicated or
 207 erasure coded), the *failure domain*, and optionally a *device class*.
 208 In rare cases rules must be written by hand by manually editing the
 209 CRUSH map.
 210
 211 You can see what rules are defined for your cluster with::
 212
 213   ceph osd crush rule ls
 214
 215 You can view the contents of the rules with::
 216
 217   ceph osd crush rule dump
 218
 219 Device classes
 220 --------------
 221
 222 Each device can optionally have a *class* assigned.  By
 223 default, OSDs automatically set their class at startup to
 224 `hdd`, `ssd`, or `nvme` based on the type of device they are backed
 225 by.
 226
 227 The device class for one or more OSDs can be explicitly set with::
 228
 229   ceph osd crush set-device-class <class> <osd-name> [...]
 230
 231 Once a device class is set, it cannot be changed to another class
 232 until the old class is unset with::
 233
 234   ceph osd crush rm-device-class <osd-name> [...]
 235
 236 This allows administrators to set device classes without the class
 237 being changed on OSD restart or by some other script.
 238
 239 A placement rule that targets a specific device class can be created with::
 240
 241   ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 242
 243 A pool can then be changed to use the new rule with::
 244
 245   ceph osd pool set <pool-name> crush_rule <rule-name>
 246
 247 Device classes are implemented by creating a "shadow" CRUSH hierarchy
 248 for each device class in use that contains only devices of that class.
 249 CRUSH rules can then distribute data over the shadow hierarchy.
 250 This approach is fully backward compatible with
 251 old Ceph clients.  You can view the CRUSH hierarchy with shadow items
 252 with::
 253
 254   ceph osd crush tree --show-shadow
 255
 256 For older clusters created before Luminous that relied on manually
 257 crafted CRUSH maps to maintain per-device-type hierarchies, there is a
 258 *reclassify* tool available to help transition to device classes
 259 without triggering data movement (see :ref:`crush-reclassify`).
 260
 261
 262 Weights sets
 263 ------------
 264
 265 A *weight set* is an alternative set of weights to use when
 266 calculating data placement.  The normal weights associated with each
 267 device in the CRUSH map are set based on the device size and indicate
 268 how much data we *should* be storing where.  However, because CRUSH is
 269 a "probabilistic" pseudorandom placement process, there is always some
 270 variation from this ideal distribution, in the same way that rolling a
 271 die sixty times will not result in rolling exactly 10 ones and 10
 272 sixes.  Weight sets allow the cluster to perform numerical optimization
 273 based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
 274 a balanced distribution.
 275
 276 There are two types of weight sets supported:
 277
 278  #. A **compat** weight set is a single alternative set of weights for
 279     each device and node in the cluster.  This is not well-suited for
 280     correcting for all anomalies (for example, placement groups for
 281     different pools may be different sizes and have different load
 282     levels, but will be mostly treated the same by the balancer).
 283     However, compat weight sets have the huge advantage that they are
 284     *backward compatible* with previous versions of Ceph, which means
 285     that even though weight sets were first introduced in Luminous
 286     v12.2.z, older clients (e.g., firefly) can still connect to the
 287     cluster when a compat weight set is being used to balance data.
 288  #. A **per-pool** weight set is more flexible in that it allows
 289     placement to be optimized for each data pool.  Additionally,
 290     weights can be adjusted for each position of placement, allowing
 291     the optimizer to correct for a subtle skew of data toward devices
 292     with small weights relative to their peers (and effect that is
 293     usually only apparently in very large clusters but which can cause
 294     balancing problems).
 295
 296 When weight sets are in use, the weights associated with each node in
 297 the hierarchy is visible as a separate column (labeled either
 298 ``(compat)`` or the pool name) from the command::
 299
 300   ceph osd tree
 301
 302 When both *compat* and *per-pool* weight sets are in use, data
 303 placement for a particular pool will use its own per-pool weight set
 304 if present.  If not, it will use the compat weight set if present.  If
 305 neither are present, it will use the normal CRUSH weights.
 306
 307 Although weight sets can be set up and manipulated by hand, it is
 308 recommended that the ``ceph-mgr`` *balancer* module be enabled to do so
 309 automatically when running Luminous or later releases.
 310
 311
 312 Modifying the CRUSH map
 313 =======================
 314
 315 .. _addosd:
 316
 317 Add/Move an OSD
 318 ---------------
 319
 320 .. note: OSDs are normally automatically added to the CRUSH map when
 321          the OSD is created.  This command is rarely needed.
 322
 323 To add or move an OSD in the CRUSH map of a running cluster::
 324
 325   ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
 326
 327 Where:
 328
 329 ``name``
 330
 331 :Description: The full name of the OSD.
 332 :Type: String
 333 :Required: Yes
 334 :Example: ``osd.0``
 335
 336
 337 ``weight``
 338
 339 :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
 340 :Type: Double
 341 :Required: Yes
 342 :Example: ``2.0``
 343
 344
 345 ``root``
 346
 347 :Description: The root node of the tree in which the OSD resides (normally ``default``)
 348 :Type: Key/value pair.
 349 :Required: Yes
 350 :Example: ``root=default``
 351
 352
 353 ``bucket-type``
 354
 355 :Description: You may specify the OSD's location in the CRUSH hierarchy.
 356 :Type: Key/value pairs.
 357 :Required: No
 358 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 359
 360
 361 The following example adds ``osd.0`` to the hierarchy, or moves the
 362 OSD from a previous location. ::
 363
 364   ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
 365
 366
 367 Adjust OSD weight
 368 -----------------
 369
 370 .. note: Normally OSDs automatically add themselves to the CRUSH map
 371          with the correct weight when they are created. This command
 372          is rarely needed.
 373
 374 To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute
 375 the following::
 376
 377   ceph osd crush reweight {name} {weight}
 378
 379 Where:
 380
 381 ``name``
 382
 383 :Description: The full name of the OSD.
 384 :Type: String
 385 :Required: Yes
 386 :Example: ``osd.0``
 387
 388
 389 ``weight``
 390
 391 :Description: The CRUSH weight for the OSD.
 392 :Type: Double
 393 :Required: Yes
 394 :Example: ``2.0``
 395
 396
 397 .. _removeosd:
 398
 399 Remove an OSD
 400 -------------
 401
 402 .. note: OSDs are normally removed from the CRUSH as part of the
 403    ``ceph osd purge`` command.  This command is rarely needed.
 404
 405 To remove an OSD from the CRUSH map of a running cluster, execute the
 406 following::
 407
 408   ceph osd crush remove {name}
 409
 410 Where:
 411
 412 ``name``
 413
 414 :Description: The full name of the OSD.
 415 :Type: String
 416 :Required: Yes
 417 :Example: ``osd.0``
 418
 419
 420 Add a Bucket
 421 ------------
 422
 423 .. note: Buckets are implicitly created when an OSD is added
 424    that specifies a ``{bucket-type}={bucket-name}`` as part of its
 425    location,  if a bucket with that name does not already exist.  This
 426    command is typically used when manually adjusting the structure of the
 427    hierarchy after OSDs have been created.  One use is to move a
 428    series of hosts underneath a new rack-level bucket; another is to
 429    add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't
 430    receive data until you're ready, at which time you would move them to the
 431    ``default`` or other root as described below.
 432
 433 To add a bucket in the CRUSH map of a running cluster, execute the
 434 ``ceph osd crush add-bucket`` command::
 435
 436   ceph osd crush add-bucket {bucket-name} {bucket-type}
 437
 438 Where:
 439
 440 ``bucket-name``
 441
 442 :Description: The full name of the bucket.
 443 :Type: String
 444 :Required: Yes
 445 :Example: ``rack12``
 446
 447
 448 ``bucket-type``
 449
 450 :Description: The type of the bucket. The type must already exist in the hierarchy.
 451 :Type: String
 452 :Required: Yes
 453 :Example: ``rack``
 454
 455
 456 The following example adds the ``rack12`` bucket to the hierarchy::
 457
 458   ceph osd crush add-bucket rack12 rack
 459
 460 Move a Bucket
 461 -------------
 462
 463 To move a bucket to a different location or position in the CRUSH map
 464 hierarchy, execute the following::
 465
 466   ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
 467
 468 Where:
 469
 470 ``bucket-name``
 471
 472 :Description: The name of the bucket to move/reposition.
 473 :Type: String
 474 :Required: Yes
 475 :Example: ``foo-bar-1``
 476
 477 ``bucket-type``
 478
 479 :Description: You may specify the bucket's location in the CRUSH hierarchy.
 480 :Type: Key/value pairs.
 481 :Required: No
 482 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 483
 484 Remove a Bucket
 485 ---------------
 486
 487 To remove a bucket from the CRUSH hierarchy, execute the following::
 488
 489   ceph osd crush remove {bucket-name}
 490
 491 .. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
 492
 493 Where:
 494
 495 ``bucket-name``
 496
 497 :Description: The name of the bucket that you'd like to remove.
 498 :Type: String
 499 :Required: Yes
 500 :Example: ``rack12``
 501
 502 The following example removes the ``rack12`` bucket from the hierarchy::
 503
 504   ceph osd crush remove rack12
 505
 506 Creating a compat weight set
 507 ----------------------------
 508
 509 .. note: This step is normally done automatically by the ``balancer``
 510    module when enabled.
 511
 512 To create a *compat* weight set::
 513
 514   ceph osd crush weight-set create-compat
 515
 516 Weights for the compat weight set can be adjusted with::
 517
 518   ceph osd crush weight-set reweight-compat {name} {weight}
 519
 520 The compat weight set can be destroyed with::
 521
 522   ceph osd crush weight-set rm-compat
 523
 524 Creating per-pool weight sets
 525 -----------------------------
 526
 527 To create a weight set for a specific pool,::
 528
 529   ceph osd crush weight-set create {pool-name} {mode}
 530
 531 .. note:: Per-pool weight sets require that all servers and daemons
 532           run Luminous v12.2.z or later.
 533
 534 Where:
 535
 536 ``pool-name``
 537
 538 :Description: The name of a RADOS pool
 539 :Type: String
 540 :Required: Yes
 541 :Example: ``rbd``
 542
 543 ``mode``
 544
 545 :Description: Either ``flat`` or ``positional``.  A *flat* weight set
 546               has a single weight for each device or bucket.  A
 547               *positional* weight set has a potentially different
 548               weight for each position in the resulting placement
 549               mapping.  For example, if a pool has a replica count of
 550               3, then a positional weight set will have three weights
 551               for each device and bucket.
 552 :Type: String
 553 :Required: Yes
 554 :Example: ``flat``
 555
 556 To adjust the weight of an item in a weight set::
 557
 558   ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
 559
 560 To list existing weight sets,::
 561
 562   ceph osd crush weight-set ls
 563
 564 To remove a weight set,::
 565
 566   ceph osd crush weight-set rm {pool-name}
 567
 568 Creating a rule for a replicated pool
 569 -------------------------------------
 570
 571 For a replicated pool, the primary decision when creating the CRUSH
 572 rule is what the failure domain is going to be.  For example, if a
 573 failure domain of ``host`` is selected, then CRUSH will ensure that
 574 each replica of the data is stored on a unique host.  If ``rack``
 575 is selected, then each replica will be stored in a different rack.
 576 What failure domain you choose primarily depends on the size and
 577 topology of your cluster.
 578
 579 In most cases the entire cluster hierarchy is nested beneath a root node
 580 named ``default``.  If you have customized your hierarchy, you may
 581 want to create a rule nested at some other node in the hierarchy.  It
 582 doesn't matter what type is associated with that node (it doesn't have
 583 to be a ``root`` node).
 584
 585 It is also possible to create a rule that restricts data placement to
 586 a specific *class* of device.  By default, Ceph OSDs automatically
 587 classify themselves as either ``hdd`` or ``ssd``, depending on the
 588 underlying type of device being used.  These classes can also be
 589 customized.
 590
 591 To create a replicated rule,::
 592
 593   ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
 594
 595 Where:
 596
 597 ``name``
 598
 599 :Description: The name of the rule
 600 :Type: String
 601 :Required: Yes
 602 :Example: ``rbd-rule``
 603
 604 ``root``
 605
 606 :Description: The name of the node under which data should be placed.
 607 :Type: String
 608 :Required: Yes
 609 :Example: ``default``
 610
 611 ``failure-domain-type``
 612
 613 :Description: The type of CRUSH nodes across which we should separate replicas.
 614 :Type: String
 615 :Required: Yes
 616 :Example: ``rack``
 617
 618 ``class``
 619
 620 :Description: The device class on which data should be placed.
 621 :Type: String
 622 :Required: No
 623 :Example: ``ssd``
 624
 625 Creating a rule for an erasure coded pool
 626 -----------------------------------------
 627
 628 For an erasure-coded (EC) pool, the same basic decisions need to be made:
 629 what is the failure domain, which node in the
 630 hierarchy will data be placed under (usually ``default``), and will
 631 placement be restricted to a specific device class.  Erasure code
 632 pools are created a bit differently, however, because they need to be
 633 constructed carefully based on the erasure code being used.  For this reason,
 634 you must include this information in the *erasure code profile*.  A CRUSH
 635 rule will then be created from that either explicitly or automatically when
 636 the profile is used to create a pool.
 637
 638 The erasure code profiles can be listed with::
 639
 640   ceph osd erasure-code-profile ls
 641
 642 An existing profile can be viewed with::
 643
 644   ceph osd erasure-code-profile get {profile-name}
 645
 646 Normally profiles should never be modified; instead, a new profile
 647 should be created and used when creating a new pool or creating a new
 648 rule for an existing pool.
 649
 650 An erasure code profile consists of a set of key=value pairs.  Most of
 651 these control the behavior of the erasure code that is encoding data
 652 in the pool.  Those that begin with ``crush-``, however, affect the
 653 CRUSH rule that is created.
 654
 655 The erasure code profile properties of interest are:
 656
 657  * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``].
 658  * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``].
 659  * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used].
 660  * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
 661
 662 Once a profile is defined, you can create a CRUSH rule with::
 663
 664   ceph osd crush rule create-erasure {name} {profile-name}
 665
 666 .. note: When creating a new pool, it is not actually necessary to
 667    explicitly create the rule.  If the erasure code profile alone is
 668    specified and the rule argument is left off then Ceph will create
 669    the CRUSH rule automatically.
 670
 671 Deleting rules
 672 --------------
 673
 674 Rules that are not in use by pools can be deleted with::
 675
 676   ceph osd crush rule rm {rule-name}
 677
 678
 679 .. _crush-map-tunables:
 680
 681 Tunables
 682 ========
 683
 684 Over time, we have made (and continue to make) improvements to the
 685 CRUSH algorithm used to calculate the placement of data.  In order to
 686 support the change in behavior, we have introduced a series of tunable
 687 options that control whether the legacy or improved variation of the
 688 algorithm is used.
 689
 690 In order to use newer tunables, both clients and servers must support
 691 the new version of CRUSH.  For this reason, we have created
 692 ``profiles`` that are named after the Ceph version in which they were
 693 introduced.  For example, the ``firefly`` tunables are first supported
 694 by the Firefly release, and will not work with older (e.g., Dumpling)
 695 clients.  Once a given set of tunables are changed from the legacy
 696 default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
 697 clients who do not support the new CRUSH features from connecting to
 698 the cluster.
 699
 700 argonaut (legacy)
 701 -----------------
 702
 703 The legacy CRUSH behavior used by Argonaut and older releases works
 704 fine for most clusters, provided there are not many OSDs that have
 705 been marked out.
 706
 707 bobtail (CRUSH_TUNABLES2)
 708 -------------------------
 709
 710 The ``bobtail`` tunable profile fixes a few key misbehaviors:
 711
 712  * For hierarchies with a small number of devices in the leaf buckets,
 713    some PGs map to fewer than the desired number of replicas.  This
 714    commonly happens for hierarchies with "host" nodes with a small
 715    number (1-3) of OSDs nested beneath each one.
 716
 717  * For large clusters, some small percentages of PGs map to fewer than
 718    the desired number of OSDs.  This is more prevalent when there are
 719    multiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``).
 720
 721  * When some OSDs are marked out, the data tends to get redistributed
 722    to nearby OSDs instead of across the entire hierarchy.
 723
 724 The new tunables are:
 725
 726  * ``choose_local_tries``: Number of local retries.  Legacy value is
 727    2, optimal value is 0.
 728
 729  * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
 730    is 0.
 731
 732  * ``choose_total_tries``: Total number of attempts to choose an item.
 733    Legacy value was 19, subsequent testing indicates that a value of
 734    50 is more appropriate for typical clusters.  For extremely large
 735    clusters, a larger value might be necessary.
 736
 737  * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
 738    will retry, or only try once and allow the original placement to
 739    retry.  Legacy default is 0, optimal value is 1.
 740
 741 Migration impact:
 742
 743  * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount
 744    of data movement.  Use caution on a cluster that is already
 745    populated with data.
 746
 747 firefly (CRUSH_TUNABLES3)
 748 -------------------------
 749
 750 The ``firefly`` tunable profile fixes a problem
 751 with ``chooseleaf`` CRUSH rule behavior that tends to result in PG
 752 mappings with too few results when too many OSDs have been marked out.
 753
 754 The new tunable is:
 755
 756  * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
 757    start with a non-zero value of ``r``, based on how many attempts the
 758    parent has already made.  Legacy default is ``0``, but with this value
 759    CRUSH is sometimes unable to find a mapping.  The optimal value (in
 760    terms of computational cost and correctness) is ``1``.
 761
 762 Migration impact:
 763
 764  * For existing clusters that house lots of data, changing
 765    from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5``
 766    will allow CRUSH to still find a valid mapping but will cause less data
 767    to move.
 768
 769 straw_calc_version tunable (introduced with Firefly too)
 770 --------------------------------------------------------
 771
 772 There were some problems with the internal weights calculated and
 773 stored in the CRUSH map for ``straw`` algorithm buckets.  Specifically, when
 774 there were items with a CRUSH weight of ``0``, or both a mix of different and
 775 unique weights, CRUSH would distribute data incorrectly (i.e.,
 776 not in proportion to the weights).
 777
 778 The new tunable is:
 779
 780  * ``straw_calc_version``: A value of ``0`` preserves the old, broken
 781    internal weight calculation; a value of ``1`` fixes the behavior.
 782
 783 Migration impact:
 784
 785  * Moving to straw_calc_version ``1`` and then adjusting a straw bucket
 786    (by adding, removing, or reweighting an item, or by using the
 787    reweight-all command) can trigger a small to moderate amount of
 788    data movement *if* the cluster has hit one of the problematic
 789    conditions.
 790
 791 This tunable option is special because it has absolutely no impact
 792 concerning the required kernel version in the client side.
 793
 794 hammer (CRUSH_V4)
 795 -----------------
 796
 797 The ``hammer`` tunable profile does not affect the
 798 mapping of existing CRUSH maps simply by changing the profile.  However:
 799
 800  * There is a new bucket algorithm (``straw2``) supported.  The new
 801    ``straw2`` bucket algorithm fixes several limitations in the original
 802    ``straw``.  Specifically, the old ``straw`` buckets would
 803    change some mappings that should have changed when a weight was
 804    adjusted, while ``straw2`` achieves the original goal of only
 805    changing mappings to or from the bucket item whose weight has
 806    changed.
 807
 808  * ``straw2`` is the default for any newly created buckets.
 809
 810 Migration impact:
 811
 812  * Changing a bucket type from ``straw`` to ``straw2`` will result in
 813    a reasonably small amount of data movement, depending on how much
 814    the bucket item weights vary from each other.  When the weights are
 815    all the same no data will move, and when item weights vary
 816    significantly there will be more movement.
 817
 818 jewel (CRUSH_TUNABLES5)
 819 -----------------------
 820
 821 The ``jewel`` tunable profile improves the
 822 overall behavior of CRUSH such that significantly fewer mappings
 823 change when an OSD is marked out of the cluster.  This results in
 824 significantly less data movement.
 825
 826 The new tunable is:
 827
 828  * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
 829    use a better value for an inner loop that greatly reduces the number
 830    of mapping changes when an OSD is marked out.  The legacy value is ``0``,
 831    while the new value of ``1`` uses the new approach.
 832
 833 Migration impact:
 834
 835  * Changing this value on an existing cluster will result in a very
 836    large amount of data movement as almost every PG mapping is likely
 837    to change.
 838
 839
 840
 841
 842 Which client versions support CRUSH_TUNABLES
 843 --------------------------------------------
 844
 845  * argonaut series, v0.48.1 or later
 846  * v0.49 or later
 847  * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
 848
 849 Which client versions support CRUSH_TUNABLES2
 850 ---------------------------------------------
 851
 852  * v0.55 or later, including bobtail series (v0.56.x)
 853  * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
 854
 855 Which client versions support CRUSH_TUNABLES3
 856 ---------------------------------------------
 857
 858  * v0.78 (firefly) or later
 859  * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
 860
 861 Which client versions support CRUSH_V4
 862 --------------------------------------
 863
 864  * v0.94 (hammer) or later
 865  * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
 866
 867 Which client versions support CRUSH_TUNABLES5
 868 ---------------------------------------------
 869
 870  * v10.0.2 (jewel) or later
 871  * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
 872
 873 Warning when tunables are non-optimal
 874 -------------------------------------
 875
 876 Starting with version v0.74, Ceph will issue a health warning if the
 877 current CRUSH tunables don't include all the optimal values from the
 878 ``default`` profile (see below for the meaning of the ``default`` profile).
 879 To make this warning go away, you have two options:
 880
 881 1. Adjust the tunables on the existing cluster.  Note that this will
 882    result in some data movement (possibly as much as 10%).  This is the
 883    preferred route, but should be taken with care on a production cluster
 884    where the data movement may affect performance.  You can enable optimal
 885    tunables with::
 886
 887       ceph osd crush tunables optimal
 888
 889    If things go poorly (e.g., too much load) and not very much
 890    progress has been made, or there is a client compatibility problem
 891    (old kernel CephFS or RBD clients, or pre-Bobtail ``librados``
 892    clients), you can switch back with::
 893
 894       ceph osd crush tunables legacy
 895
 896 2. You can make the warning go away without making any changes to CRUSH by
 897    adding the following option to your ceph.conf ``[mon]`` section::
 898
 899       mon warn on legacy crush tunables = false
 900
 901    For the change to take effect, you will need to restart the monitors, or
 902    apply the option to running monitors with::
 903
 904       ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
 905
 906
 907 A few important points
 908 ----------------------
 909
 910  * Adjusting these values will result in the shift of some PGs between
 911    storage nodes.  If the Ceph cluster is already storing a lot of
 912    data, be prepared for some fraction of the data to move.
 913  * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
 914    feature bits of new connections as soon as they get
 915    the updated map.  However, already-connected clients are
 916    effectively grandfathered in, and will misbehave if they do not
 917    support the new feature.
 918  * If the CRUSH tunables are set to non-legacy values and then later
 919    changed back to the default values, ``ceph-osd`` daemons will not be
 920    required to support the feature.  However, the OSD peering process
 921    requires examining and understanding old maps.  Therefore, you
 922    should not run old versions of the ``ceph-osd`` daemon
 923    if the cluster has previously used non-legacy CRUSH values, even if
 924    the latest version of the map has been switched back to using the
 925    legacy defaults.
 926
 927 Tuning CRUSH
 928 ------------
 929
 930 The simplest way to adjust CRUSH tunables is by applying them in matched
 931 sets known as *profiles*.  As of the Octopus release these are:
 932
 933  * ``legacy``: the legacy behavior from argonaut and earlier.
 934  * ``argonaut``: the legacy values supported by the original argonaut release
 935  * ``bobtail``: the values supported by the bobtail release
 936  * ``firefly``: the values supported by the firefly release
 937  * ``hammer``: the values supported by the hammer release
 938  * ``jewel``: the values supported by the jewel release
 939  * ``optimal``: the best (ie optimal) values of the current version of Ceph
 940  * ``default``: the default values of a new cluster installed from
 941    scratch. These values, which depend on the current version of Ceph,
 942    are hardcoded and are generally a mix of optimal and legacy values.
 943    These values generally match the ``optimal`` profile of the previous
 944    LTS release, or the most recent release for which we generally expect
 945    most users to have up-to-date clients for.
 946
 947 You can apply a profile to a running cluster with the command::
 948
 949  ceph osd crush tunables {PROFILE}
 950
 951 Note that this may result in data movement, potentially quite a bit.  Study
 952 release notes and documentation carefully before changing the profile on a
 953 running cluster, and consider throttling recovery/backfill parameters to
 954 limit the impact of a bolus of backfill.
 955
 956
 957 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
 958
 959
 960 Primary Affinity
 961 ================
 962
 963 When a Ceph Client reads or writes data, it first contacts the primary OSD in
 964 each affected PG's acting set. By default, the first OSD in the acting set is
 965 the primary.  For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is
 966 listed first and thus is the primary (aka lead) OSD. Sometimes we know that an
 967 OSD is less well suited to act as the lead than are other OSDs (e.g., it has
 968 a slow drive or a slow controller). To prevent performance bottlenecks
 969 (especially on read operations) while maximizing utilization of your hardware,
 970 you can influence the selection of primary OSDs by adjusting primary affinity
 971 values, or by crafting a CRUSH rule that selects preferred OSDs first.
 972
 973 Tuning primary OSD selection is mainly useful for replicated pools, because
 974 by default read operations are served from the primary OSD for each PG.
 975 For erasure coded (EC) pools, a way to speed up read operations is to enable
 976 **fast read** as described in :ref:`pool-settings`.
 977
 978 A common scenario for primary affinity is when a cluster contains
 979 a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
 980 3.84TB SATA SSDs.  On average the latter will be assigned double the number of
 981 PGs and thus will serve double the number of write and read operations, thus
 982 they'll be busier than the former.  A rough assignment of primary affinity
 983 inversely proportional to OSD size won't be 100% optimal, but it can readily
 984 achieve a 15% improvement in overall read throughput by utilizing SATA
 985 interface bandwidth and CPU cycles more evenly.
 986
 987 By default, all ceph OSDs have primary affinity of ``1``, which indicates that
 988 any OSD may act as a primary with equal probability.
 989
 990 You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose
 991 the OSD as primary in a PG's acting set.::
 992
 993         ceph osd primary-affinity <osd-id> <weight>
 994
 995 You may set an OSD's primary affinity to a real number in the range
 996 ``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary
 997 and ``1`` indicates that an OSD may be used as a primary.  When the weight is
 998 between these extremes, it is less likely that
 999 CRUSH will select that OSD as a primary.  The process for
1000 selecting the lead OSD is more nuanced than a simple probability based on
1001 relative affinity values, but measurable results can be achieved even with
1002 first-order approximations of desirable values.
1003
1004 Custom CRUSH Rules
1005 ------------------
1006
1007 There are occasional clusters that balance cost and performance by mixing SSDs
1008 and HDDs in the same replicated pool. By setting the primary affinity of HDD
1009 OSDs to ``0`` one can direct operations to the SSD in each acting set. An
1010 alternative is to define a CRUSH rule that always selects an SSD OSD as the
1011 first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting
1012 set will contain exactly one SSD OSD as the primary with the balance on HDDs.
1013
1014 For example, the CRUSH rule below::
1015
1016         rule mixed_replicated_rule {
1017                 id 11
1018                 type replicated
1019                 min_size 1
1020                 max_size 10
1021                 step take default class ssd
1022                 step chooseleaf firstn 1 type host
1023                 step emit
1024                 step take default class hdd
1025                 step chooseleaf firstn 0 type host
1026                 step emit
1027         }
1028
1029 chooses an SSD as the first OSD.  Note that for an ``N``-times replicated pool
1030 this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different
1031 hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD
1032 OSDs.
1033
1034 This extra storage requirement can be avoided by placing SSDs and HDDs in
1035 different hosts with the tradeoff that hosts with SSDs will receive all client
1036 requests.  You may thus consider faster CPU(s) for SSD hosts and more modest
1037 ones for HDD nodes, since the latter will normally only service recovery
1038 operations.  Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly
1039 must not contain the same servers::
1040
1041         rule mixed_replicated_rule_two {
1042                id 1
1043                type replicated
1044                min_size 1
1045                max_size 10
1046                step take ssd_hosts class ssd
1047                step chooseleaf firstn 1 type host
1048                step emit
1049                step take hdd_hosts class hdd
1050                step chooseleaf firstn -1 type host
1051                step emit
1052         }
1053
1054
1055
1056 Note also that on failure of an SSD, requests to a PG will be served temporarily
1057 from a (slower) HDD OSD until the PG's data has been replicated onto the replacement
1058 primary SSD OSD.
1059