ceph/doc/rados/operations/crush-map.rst

   1 ============
   2  CRUSH Maps
   3 ============
   4
   5 The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
   6 determines how to store and retrieve data by computing data storage locations.
   7 CRUSH empowers Ceph clients to communicate with OSDs directly rather than
   8 through a centralized server or broker. With an algorithmically determined
   9 method of storing and retrieving data, Ceph avoids a single point of failure, a
  10 performance bottleneck, and a physical limit to its scalability.
  11
  12 CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
  13 store and retrieve data in OSDs with a uniform distribution of data across the
  14 cluster. For a detailed discussion of CRUSH, see
  15 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
  16
  17 CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
  18 'buckets' for aggregating the devices into physical locations, and a list of
  19 rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
  20 reflecting the underlying physical organization of the installation, CRUSH can
  21 model—and thereby address—potential sources of correlated device failures.
  22 Typical sources include physical proximity, a shared power source, and a shared
  23 network. By encoding this information into the cluster map, CRUSH placement
  24 policies can separate object replicas across different failure domains while
  25 still maintaining the desired distribution. For example, to address the
  26 possibility of concurrent failures, it may be desirable to ensure that data
  27 replicas are on devices using different shelves, racks, power supplies,
  28 controllers, and/or physical locations.
  29
  30 When you deploy OSDs they are automatically placed within the CRUSH map under a
  31 ``host`` node named with the hostname for the host they are running on.  This,
  32 combined with the default CRUSH failure domain, ensures that replicas or erasure
  33 code shards are separated across hosts and a single host failure will not
  34 affect availability.  For larger clusters, however, administrators should carefully consider their choice of failure domain.  Separating replicas across racks,
  35 for example, is common for mid- to large-sized clusters.
  36
  37
  38 CRUSH Location
  39 ==============
  40
  41 The location of an OSD in terms of the CRUSH map's hierarchy is
  42 referred to as a ``crush location``.  This location specifier takes the
  43 form of a list of key and value pairs describing a position.  For
  44 example, if an OSD is in a particular row, rack, chassis and host, and
  45 is part of the 'default' CRUSH tree (this is the case for the vast
  46 majority of clusters), its crush location could be described as::
  47
  48   root=default row=a rack=a2 chassis=a2a host=a2a1
  49
  50 Note:
  51
  52 #. Note that the order of the keys does not matter.
  53 #. The key name (left of ``=``) must be a valid CRUSH ``type``.  By default
  54    these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
  55    but those types can be customized to be anything appropriate by modifying
  56    the CRUSH map.
  57 #. Not all keys need to be specified.  For example, by default, Ceph
  58    automatically sets a ``ceph-osd`` daemon's location to be
  59    ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
  60
  61 The crush location for an OSD is normally expressed via the ``crush location``
  62 config option being set in the ``ceph.conf`` file.  Each time the OSD starts,
  63 it verifies it is in the correct location in the CRUSH map and, if it is not,
  64 it moved itself.  To disable this automatic CRUSH map management, add the
  65 following to your configuration file in the ``[osd]`` section::
  66
  67   osd crush update on start = false
  68
  69
  70 Custom location hooks
  71 ---------------------
  72
  73 A customized location hook can be used to generate a more complete
  74 crush location on startup. The sample ``ceph-crush-location`` utility
  75 will generate a CRUSH location string for a given daemon.  The
  76 location is based on, in order of preference:
  77
  78 #. A ``crush location`` option in ceph.conf.
  79 #. A default of ``root=default host=HOSTNAME`` where the hostname is
  80    generated with the ``hostname -s`` command.
  81
  82 This is not useful by itself, as the OSD itself has the exact same
  83 behavior.  However, the script can be modified to provide additional
  84 location fields (for example, the rack or datacenter), and then the
  85 hook enabled via the config option::
  86
  87   crush location hook = /path/to/customized-ceph-crush-location
  88
  89 This hook is passed several arguments (below) and should output a single line
  90 to stdout with the CRUSH location description.::
  91
  92   $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
  93
  94 where the cluster name is typically 'ceph', the id is the daemon
  95 identifier (the OSD number), and the daemon type is typically ``osd``.
  96
  97
  98 CRUSH structure
  99 ===============
 100
 101 The CRUSH map consists of, loosely speaking, a hierarchy describing
 102 the physical topology of the cluster, and a set of rules defining
 103 policy about how we place data on those devices.  The hierarchy has
 104 devices (``ceph-osd`` daemons) at the leaves, and internal nodes
 105 corresponding to other physical features or groupings: hosts, racks,
 106 rows, datacenters, and so on.  The rules describe how replicas are
 107 placed in terms of that hierarchy (e.g., 'three replicas in different
 108 racks').
 109
 110 Devices
 111 -------
 112
 113 Devices are individual ``ceph-osd`` daemons that can store data.  You
 114 will normally have one defined here for each OSD daemon in your
 115 cluster.  Devices are identified by an id (a non-negative integer) and
 116 a name, normally ``osd.N`` where ``N`` is the device id.
 117
 118 Devices may also have a *device class* associated with them (e.g.,
 119 ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
 120 crush rule.
 121
 122 Types and Buckets
 123 -----------------
 124
 125 A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
 126 racks, rows, etc.  The CRUSH map defines a series of *types* that are
 127 used to describe these nodes.  By default, these types include:
 128
 129 - osd (or device)
 130 - host
 131 - chassis
 132 - rack
 133 - row
 134 - pdu
 135 - pod
 136 - room
 137 - datacenter
 138 - region
 139 - root
 140
 141 Most clusters make use of only a handful of these types, and others
 142 can be defined as needed.
 143
 144 The hierarchy is built with devices (normally type ``osd``) at the
 145 leaves, interior nodes with non-device types, and a root node of type
 146 ``root``.  For example,
 147
 148 .. ditaa::
 149
 150                         +-----------------+
 151                         | {o}root default |
 152                         +--------+--------+
 153                                  |
 154                  +---------------+---------------+
 155                  |                               |
 156          +-------+-------+                 +-----+-------+
 157          | {o}host foo   |                 | {o}host bar |
 158          +-------+-------+                 +-----+-------+
 159                  |                               |
 160          +-------+-------+               +-------+-------+
 161          |               |               |               |
 162    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 163    |  osd.0    |   |   osd.1   |   |   osd.2   |   |   osd.3   |
 164    +-----------+   +-----------+   +-----------+   +-----------+
 165
 166 Each node (device or bucket) in the hierarchy has a *weight*
 167 associated with it, indicating the relative proportion of the total
 168 data that device or hierarchy subtree should store.  Weights are set
 169 at the leaves, indicating the size of the device, and automatically
 170 sum up the tree from there, such that the weight of the default node
 171 will be the total of all devices contained beneath it.  Normally
 172 weights are in units of terabytes (TB).
 173
 174 You can get a simple view the CRUSH hierarchy for your cluster,
 175 including the weights, with::
 176
 177   ceph osd crush tree
 178
 179 Rules
 180 -----
 181
 182 Rules define policy about how data is distributed across the devices
 183 in the hierarchy.
 184
 185 CRUSH rules define placement and replication strategies or
 186 distribution policies that allow you to specify exactly how CRUSH
 187 places object replicas. For example, you might create a rule selecting
 188 a pair of targets for 2-way mirroring, another rule for selecting
 189 three targets in two different data centers for 3-way mirroring, and
 190 yet another rule for erasure coding over six storage devices. For a
 191 detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
 192 Scalable, Decentralized Placement of Replicated Data`_, and more
 193 specifically to **Section 3.2**.
 194
 195 In almost all cases, CRUSH rules can be created via the CLI by
 196 specifying the *pool type* they will be used for (replicated or
 197 erasure coded), the *failure domain*, and optionally a *device class*.
 198 In rare cases rules must be written by hand by manually editing the
 199 CRUSH map.
 200
 201 You can see what rules are defined for your cluster with::
 202
 203   ceph osd crush rule ls
 204
 205 You can view the contents of the rules with::
 206
 207   ceph osd crush rule dump
 208
 209
 210 Weights sets
 211 ------------
 212
 213 A *weight set* is an alternative set of weights to use when
 214 calculating data placement.  The normal weights associated with each
 215 device in the CRUSH map are set based on the device size and indicate
 216 how much data we *should* be storing where.  However, because CRUSH is
 217 based on a pseudorandom placement process, there is always some
 218 variation from this ideal distribution, the same way that rolling a
 219 dice sixty times will not result in rolling exactly 10 ones and 10
 220 sixes.  Weight sets allow the cluster to do a numerical optimization
 221 based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
 222 a balanced distribution.
 223
 224 There are two types of weight sets supported:
 225
 226  #. A **compat** weight set is a single alternative set of weights for
 227     each device and node in the cluster.  This is not well-suited for
 228     correcting for all anomalies (for example, placement groups for
 229     different pools may be different sizes and have different load
 230     levels, but will be mostly treated the same by the balancer).
 231     However, compat weight sets have the huge advantage that they are
 232     *backward compatible* with previous versions of Ceph, which means
 233     that even though weight sets were first introduced in Luminous
 234     v12.2.z, older clients (e.g., firefly) can still connect to the
 235     cluster when a compat weight set is being used to balance data.
 236  #. A **per-pool** weight set is more flexible in that it allows
 237     placement to be optimized for each data pool.  Additionally,
 238     weights can be adjusted for each position of placement, allowing
 239     the optimizer to correct for a suble skew of data toward devices
 240     with small weights relative to their peers (and effect that is
 241     usually only apparently in very large clusters but which can cause
 242     balancing problems).
 243
 244 When weight sets are in use, the weights associated with each node in
 245 the hierarchy is visible as a separate column (labeled either
 246 ``(compat)`` or the pool name) from the command::
 247
 248   ceph osd crush tree
 249
 250 When both *compat* and *per-pool* weight sets are in use, data
 251 placement for a particular pool will use its own per-pool weight set
 252 if present.  If not, it will use the compat weight set if present.  If
 253 neither are present, it will use the normal CRUSH weights.
 254
 255 Although weight sets can be set up and manipulated by hand, it is
 256 recommended that the *balancer* module be enabled to do so
 257 automatically.
 258
 259
 260 Modifying the CRUSH map
 261 =======================
 262
 263 .. _addosd:
 264
 265 Add/Move an OSD
 266 ---------------
 267
 268 .. note: OSDs are normally automatically added to the CRUSH map when
 269          the OSD is created.  This command is rarely needed.
 270
 271 To add or move an OSD in the CRUSH map of a running cluster::
 272
 273   ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
 274
 275 Where:
 276
 277 ``name``
 278
 279 :Description: The full name of the OSD.
 280 :Type: String
 281 :Required: Yes
 282 :Example: ``osd.0``
 283
 284
 285 ``weight``
 286
 287 :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
 288 :Type: Double
 289 :Required: Yes
 290 :Example: ``2.0``
 291
 292
 293 ``root``
 294
 295 :Description: The root node of the tree in which the OSD resides (normally ``default``)
 296 :Type: Key/value pair.
 297 :Required: Yes
 298 :Example: ``root=default``
 299
 300
 301 ``bucket-type``
 302
 303 :Description: You may specify the OSD's location in the CRUSH hierarchy.
 304 :Type: Key/value pairs.
 305 :Required: No
 306 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 307
 308
 309 The following example adds ``osd.0`` to the hierarchy, or moves the
 310 OSD from a previous location. ::
 311
 312   ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
 313
 314
 315 Adjust OSD weight
 316 -----------------
 317
 318 .. note: Normally OSDs automatically add themselves to the CRUSH map
 319          with the correct weight when they are created. This command
 320          is rarely needed.
 321
 322 To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
 323 the following::
 324
 325   ceph osd crush reweight {name} {weight}
 326
 327 Where:
 328
 329 ``name``
 330
 331 :Description: The full name of the OSD.
 332 :Type: String
 333 :Required: Yes
 334 :Example: ``osd.0``
 335
 336
 337 ``weight``
 338
 339 :Description: The CRUSH weight for the OSD.
 340 :Type: Double
 341 :Required: Yes
 342 :Example: ``2.0``
 343
 344
 345 .. _removeosd:
 346
 347 Remove an OSD
 348 -------------
 349
 350 .. note: OSDs are normally removed from the CRUSH as part of the
 351    ``ceph osd purge`` command.  This command is rarely needed.
 352
 353 To remove an OSD from the CRUSH map of a running cluster, execute the
 354 following::
 355
 356   ceph osd crush remove {name}
 357
 358 Where:
 359
 360 ``name``
 361
 362 :Description: The full name of the OSD.
 363 :Type: String
 364 :Required: Yes
 365 :Example: ``osd.0``
 366
 367
 368 Add a Bucket
 369 ------------
 370
 371 .. note: Buckets are normally implicitly created when an OSD is added
 372    that specifies a ``{bucket-type}={bucket-name}`` as part of its
 373    location and a bucket with that name does not already exist.  This
 374    command is typically used when manually adjusting the structure of the
 375    hierarchy after OSDs have been created (for example, to move a
 376    series of hosts underneath a new rack-level bucket).
 377
 378 To add a bucket in the CRUSH map of a running cluster, execute the
 379 ``ceph osd crush add-bucket`` command::
 380
 381   ceph osd crush add-bucket {bucket-name} {bucket-type}
 382
 383 Where:
 384
 385 ``bucket-name``
 386
 387 :Description: The full name of the bucket.
 388 :Type: String
 389 :Required: Yes
 390 :Example: ``rack12``
 391
 392
 393 ``bucket-type``
 394
 395 :Description: The type of the bucket. The type must already exist in the hierarchy.
 396 :Type: String
 397 :Required: Yes
 398 :Example: ``rack``
 399
 400
 401 The following example adds the ``rack12`` bucket to the hierarchy::
 402
 403   ceph osd crush add-bucket rack12 rack
 404
 405 Move a Bucket
 406 -------------
 407
 408 To move a bucket to a different location or position in the CRUSH map
 409 hierarchy, execute the following::
 410
 411   ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
 412
 413 Where:
 414
 415 ``bucket-name``
 416
 417 :Description: The name of the bucket to move/reposition.
 418 :Type: String
 419 :Required: Yes
 420 :Example: ``foo-bar-1``
 421
 422 ``bucket-type``
 423
 424 :Description: You may specify the bucket's location in the CRUSH hierarchy.
 425 :Type: Key/value pairs.
 426 :Required: No
 427 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
 428
 429 Remove a Bucket
 430 ---------------
 431
 432 To remove a bucket from the CRUSH map hierarchy, execute the following::
 433
 434   ceph osd crush remove {bucket-name}
 435
 436 .. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
 437
 438 Where:
 439
 440 ``bucket-name``
 441
 442 :Description: The name of the bucket that you'd like to remove.
 443 :Type: String
 444 :Required: Yes
 445 :Example: ``rack12``
 446
 447 The following example removes the ``rack12`` bucket from the hierarchy::
 448
 449   ceph osd crush remove rack12
 450
 451 Creating a compat weight set
 452 ----------------------------
 453
 454 .. note: This step is normally done automatically by the ``balancer``
 455    module when enabled.
 456
 457 To create a *compat* weight set::
 458
 459   ceph osd crush weight-set create-compat
 460
 461 Weights for the compat weight set can be adjusted with::
 462
 463   ceph osd crush weight-set reweight-compat {name} {weight}
 464
 465 The compat weight set can be destroyed with::
 466
 467   ceph osd crush weight-set rm-compat
 468
 469 Creating per-pool weight sets
 470 -----------------------------
 471
 472 To create a weight set for a specific pool,::
 473
 474   ceph osd crush weight-set create {pool-name} {mode}
 475
 476 .. note:: Per-pool weight sets require that all servers and daemons
 477           run Luminous v12.2.z or later.
 478
 479 Where:
 480
 481 ``pool-name``
 482
 483 :Description: The name of a RADOS pool
 484 :Type: String
 485 :Required: Yes
 486 :Example: ``rbd``
 487
 488 ``mode``
 489
 490 :Description: Either ``flat`` or ``positional``.  A *flat* weight set
 491               has a single weight for each device or bucket.  A
 492               *positional* weight set has a potentially different
 493               weight for each position in the resulting placement
 494               mapping.  For example, if a pool has a replica count of
 495               3, then a positional weight set will have three weights
 496               for each device and bucket.
 497 :Type: String
 498 :Required: Yes
 499 :Example: ``flat``
 500
 501 To adjust the weight of an item in a weight set::
 502
 503   ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
 504
 505 To list existing weight sets,::
 506
 507   ceph osd crush weight-set ls
 508
 509 To remove a weight set,::
 510
 511   ceph osd crush weight-set rm {pool-name}
 512
 513 Creating a rule for a replicated pool
 514 -------------------------------------
 515
 516 For a replicated pool, the primary decision when creating the CRUSH
 517 rule is what the failure domain is going to be.  For example, if a
 518 failure domain of ``host`` is selected, then CRUSH will ensure that
 519 each replica of the data is stored on a different host.  If ``rack``
 520 is selected, then each replica will be stored in a different rack.
 521 What failure domain you choose primarily depends on the size of your
 522 cluster and how your hierarchy is structured.
 523
 524 Normally, the entire cluster hierarchy is nested beneath a root node
 525 named ``default``.  If you have customized your hierarchy, you may
 526 want to create a rule nested at some other node in the hierarchy.  It
 527 doesn't matter what type is associated with that node (it doesn't have
 528 to be a ``root`` node).
 529
 530 It is also possible to create a rule that restricts data placement to
 531 a specific *class* of device.  By default, Ceph OSDs automatically
 532 classify themselves as either ``hdd`` or ``ssd``, depending on the
 533 underlying type of device being used.  These classes can also be
 534 customized.
 535
 536 To create a replicated rule,::
 537
 538   ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
 539
 540 Where:
 541
 542 ``name``
 543
 544 :Description: The name of the rule
 545 :Type: String
 546 :Required: Yes
 547 :Example: ``rbd-rule``
 548
 549 ``root``
 550
 551 :Description: The name of the node under which data should be placed.
 552 :Type: String
 553 :Required: Yes
 554 :Example: ``default``
 555
 556 ``failure-domain-type``
 557
 558 :Description: The type of CRUSH nodes across which we should separate replicas.
 559 :Type: String
 560 :Required: Yes
 561 :Example: ``rack``
 562
 563 ``class``
 564
 565 :Description: The device class data should be placed on.
 566 :Type: String
 567 :Required: No
 568 :Example: ``ssd``
 569
 570 Creating a rule for an erasure coded pool
 571 -----------------------------------------
 572
 573 For an erasure-coded pool, the same basic decisions need to be made as
 574 with a replicated pool: what is the failure domain, what node in the
 575 hierarchy will data be placed under (usually ``default``), and will
 576 placement be restricted to a specific device class.  Erasure code
 577 pools are created a bit differently, however, because they need to be
 578 constructed carefully based on the erasure code being used.  For this reason,
 579 you must include this information in the *erasure code profile*.  A CRUSH
 580 rule will then be created from that either explicitly or automatically when
 581 the profile is used to create a pool.
 582
 583 The erasure code profiles can be listed with::
 584
 585   ceph osd erasure-code-profile ls
 586
 587 An existing profile can be viewed with::
 588
 589   ceph osd erasure-code-profile get {profile-name}
 590
 591 Normally profiles should never be modified; instead, a new profile
 592 should be created and used when creating a new pool or creating a new
 593 rule for an existing pool.
 594
 595 An erasure code profile consists of a set of key=value pairs.  Most of
 596 these control the behavior of the erasure code that is encoding data
 597 in the pool.  Those that begin with ``crush-``, however, affect the
 598 CRUSH rule that is created.
 599
 600 The erasure code profile properties of interest are:
 601
 602  * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
 603  * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
 604  * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
 605  * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
 606
 607 Once a profile is defined, you can create a CRUSH rule with::
 608
 609   ceph osd crush rule create-erasure {name} {profile-name}
 610
 611 .. note: When creating a new pool, it is not actually necessary to
 612    explicitly create the rule.  If the erasure code profile alone is
 613    specified and the rule argument is left off then Ceph will create
 614    the CRUSH rule automatically.
 615
 616 Deleting rules
 617 --------------
 618
 619 Rules that are not in use by pools can be deleted with::
 620
 621   ceph osd crush rule rm {rule-name}
 622
 623
 624 Tunables
 625 ========
 626
 627 Over time, we have made (and continue to make) improvements to the
 628 CRUSH algorithm used to calculate the placement of data.  In order to
 629 support the change in behavior, we have introduced a series of tunable
 630 options that control whether the legacy or improved variation of the
 631 algorithm is used.
 632
 633 In order to use newer tunables, both clients and servers must support
 634 the new version of CRUSH.  For this reason, we have created
 635 ``profiles`` that are named after the Ceph version in which they were
 636 introduced.  For example, the ``firefly`` tunables are first supported
 637 in the firefly release, and will not work with older (e.g., dumpling)
 638 clients.  Once a given set of tunables are changed from the legacy
 639 default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
 640 clients who do not support the new CRUSH features from connecting to
 641 the cluster.
 642
 643 argonaut (legacy)
 644 -----------------
 645
 646 The legacy CRUSH behavior used by argonaut and older releases works
 647 fine for most clusters, provided there are not too many OSDs that have
 648 been marked out.
 649
 650 bobtail (CRUSH_TUNABLES2)
 651 -------------------------
 652
 653 The bobtail tunable profile fixes a few key misbehaviors:
 654
 655  * For hierarchies with a small number of devices in the leaf buckets,
 656    some PGs map to fewer than the desired number of replicas.  This
 657    commonly happens for hierarchies with "host" nodes with a small
 658    number (1-3) of OSDs nested beneath each one.
 659
 660  * For large clusters, some small percentages of PGs map to less than
 661    the desired number of OSDs.  This is more prevalent when there are
 662    several layers of the hierarchy (e.g., row, rack, host, osd).
 663
 664  * When some OSDs are marked out, the data tends to get redistributed
 665    to nearby OSDs instead of across the entire hierarchy.
 666
 667 The new tunables are:
 668
 669  * ``choose_local_tries``: Number of local retries.  Legacy value is
 670    2, optimal value is 0.
 671
 672  * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
 673    is 0.
 674
 675  * ``choose_total_tries``: Total number of attempts to choose an item.
 676    Legacy value was 19, subsequent testing indicates that a value of
 677    50 is more appropriate for typical clusters.  For extremely large
 678    clusters, a larger value might be necessary.
 679
 680  * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
 681    will retry, or only try once and allow the original placement to
 682    retry.  Legacy default is 0, optimal value is 1.
 683
 684 Migration impact:
 685
 686  * Moving from argonaut to bobtail tunables triggers a moderate amount
 687    of data movement.  Use caution on a cluster that is already
 688    populated with data.
 689
 690 firefly (CRUSH_TUNABLES3)
 691 -------------------------
 692
 693 The firefly tunable profile fixes a problem
 694 with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
 695 mappings with too few results when too many OSDs have been marked out.
 696
 697 The new tunable is:
 698
 699  * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
 700    start with a non-zero value of r, based on how many attempts the
 701    parent has already made.  Legacy default is 0, but with this value
 702    CRUSH is sometimes unable to find a mapping.  The optimal value (in
 703    terms of computational cost and correctness) is 1.
 704
 705 Migration impact:
 706
 707  * For existing clusters that have lots of existing data, changing
 708    from 0 to 1 will cause a lot of data to move; a value of 4 or 5
 709    will allow CRUSH to find a valid mapping but will make less data
 710    move.
 711
 712 straw_calc_version tunable (introduced with Firefly too)
 713 --------------------------------------------------------
 714
 715 There were some problems with the internal weights calculated and
 716 stored in the CRUSH map for ``straw`` buckets.  Specifically, when
 717 there were items with a CRUSH weight of 0 or both a mix of weights and
 718 some duplicated weights CRUSH would distribute data incorrectly (i.e.,
 719 not in proportion to the weights).
 720
 721 The new tunable is:
 722
 723  * ``straw_calc_version``: A value of 0 preserves the old, broken
 724    internal weight calculation; a value of 1 fixes the behavior.
 725
 726 Migration impact:
 727
 728  * Moving to straw_calc_version 1 and then adjusting a straw bucket
 729    (by adding, removing, or reweighting an item, or by using the
 730    reweight-all command) can trigger a small to moderate amount of
 731    data movement *if* the cluster has hit one of the problematic
 732    conditions.
 733
 734 This tunable option is special because it has absolutely no impact
 735 concerning the required kernel version in the client side.
 736
 737 hammer (CRUSH_V4)
 738 -----------------
 739
 740 The hammer tunable profile does not affect the
 741 mapping of existing CRUSH maps simply by changing the profile.  However:
 742
 743  * There is a new bucket type (``straw2``) supported.  The new
 744    ``straw2`` bucket type fixes several limitations in the original
 745    ``straw`` bucket.  Specifically, the old ``straw`` buckets would
 746    change some mappings that should have changed when a weight was
 747    adjusted, while ``straw2`` achieves the original goal of only
 748    changing mappings to or from the bucket item whose weight has
 749    changed.
 750
 751  * ``straw2`` is the default for any newly created buckets.
 752
 753 Migration impact:
 754
 755  * Changing a bucket type from ``straw`` to ``straw2`` will result in
 756    a reasonably small amount of data movement, depending on how much
 757    the bucket item weights vary from each other.  When the weights are
 758    all the same no data will move, and when item weights vary
 759    significantly there will be more movement.
 760
 761 jewel (CRUSH_TUNABLES5)
 762 -----------------------
 763
 764 The jewel tunable profile improves the
 765 overall behavior of CRUSH such that significantly fewer mappings
 766 change when an OSD is marked out of the cluster.
 767
 768 The new tunable is:
 769
 770  * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
 771    use a better value for an inner loop that greatly reduces the number
 772    of mapping changes when an OSD is marked out.  The legacy value is 0,
 773    while the new value of 1 uses the new approach.
 774
 775 Migration impact:
 776
 777  * Changing this value on an existing cluster will result in a very
 778    large amount of data movement as almost every PG mapping is likely
 779    to change.
 780
 781
 782
 783
 784 Which client versions support CRUSH_TUNABLES
 785 --------------------------------------------
 786
 787  * argonaut series, v0.48.1 or later
 788  * v0.49 or later
 789  * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
 790
 791 Which client versions support CRUSH_TUNABLES2
 792 ---------------------------------------------
 793
 794  * v0.55 or later, including bobtail series (v0.56.x)
 795  * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
 796
 797 Which client versions support CRUSH_TUNABLES3
 798 ---------------------------------------------
 799
 800  * v0.78 (firefly) or later
 801  * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
 802
 803 Which client versions support CRUSH_V4
 804 --------------------------------------
 805
 806  * v0.94 (hammer) or later
 807  * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
 808
 809 Which client versions support CRUSH_TUNABLES5
 810 ---------------------------------------------
 811
 812  * v10.0.2 (jewel) or later
 813  * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
 814
 815 Warning when tunables are non-optimal
 816 -------------------------------------
 817
 818 Starting with version v0.74, Ceph will issue a health warning if the
 819 current CRUSH tunables don't include all the optimal values from the
 820 ``default`` profile (see below for the meaning of the ``default`` profile).
 821 To make this warning go away, you have two options:
 822
 823 1. Adjust the tunables on the existing cluster.  Note that this will
 824    result in some data movement (possibly as much as 10%).  This is the
 825    preferred route, but should be taken with care on a production cluster
 826    where the data movement may affect performance.  You can enable optimal
 827    tunables with::
 828
 829       ceph osd crush tunables optimal
 830
 831    If things go poorly (e.g., too much load) and not very much
 832    progress has been made, or there is a client compatibility problem
 833    (old kernel cephfs or rbd clients, or pre-bobtail librados
 834    clients), you can switch back with::
 835
 836       ceph osd crush tunables legacy
 837
 838 2. You can make the warning go away without making any changes to CRUSH by
 839    adding the following option to your ceph.conf ``[mon]`` section::
 840
 841       mon warn on legacy crush tunables = false
 842
 843    For the change to take effect, you will need to restart the monitors, or
 844    apply the option to running monitors with::
 845
 846       ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
 847
 848
 849 A few important points
 850 ----------------------
 851
 852  * Adjusting these values will result in the shift of some PGs between
 853    storage nodes.  If the Ceph cluster is already storing a lot of
 854    data, be prepared for some fraction of the data to move.
 855  * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
 856    feature bits of new connections as soon as they get
 857    the updated map.  However, already-connected clients are
 858    effectively grandfathered in, and will misbehave if they do not
 859    support the new feature.
 860  * If the CRUSH tunables are set to non-legacy values and then later
 861    changed back to the defult values, ``ceph-osd`` daemons will not be
 862    required to support the feature.  However, the OSD peering process
 863    requires examining and understanding old maps.  Therefore, you
 864    should not run old versions of the ``ceph-osd`` daemon
 865    if the cluster has previously used non-legacy CRUSH values, even if
 866    the latest version of the map has been switched back to using the
 867    legacy defaults.
 868
 869 Tuning CRUSH
 870 ------------
 871
 872 The simplest way to adjust the crush tunables is by changing to a known
 873 profile.  Those are:
 874
 875  * ``legacy``: the legacy behavior from argonaut and earlier.
 876  * ``argonaut``: the legacy values supported by the original argonaut release
 877  * ``bobtail``: the values supported by the bobtail release
 878  * ``firefly``: the values supported by the firefly release
 879  * ``hammer``: the values supported by the hammer release
 880  * ``jewel``: the values supported by the jewel release
 881  * ``optimal``: the best (ie optimal) values of the current version of Ceph
 882  * ``default``: the default values of a new cluster installed from
 883    scratch. These values, which depend on the current version of Ceph,
 884    are hard coded and are generally a mix of optimal and legacy values.
 885    These values generally match the ``optimal`` profile of the previous
 886    LTS release, or the most recent release for which we generally except
 887    more users to have up to date clients for.
 888
 889 You can select a profile on a running cluster with the command::
 890
 891  ceph osd crush tunables {PROFILE}
 892
 893 Note that this may result in some data movement.
 894
 895
 896 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
 897
 898
 899 Primary Affinity
 900 ================
 901
 902 When a Ceph Client reads or writes data, it always contacts the primary OSD in
 903 the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
 904 OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
 905 a slow disk or a slow controller). To prevent performance bottlenecks
 906 (especially on read operations) while maximizing utilization of your hardware,
 907 you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
 908 the OSD as a primary in an acting set. ::
 909
 910         ceph osd primary-affinity <osd-id> <weight>
 911
 912 Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
 913 may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
 914 **NOT** be used as a primary and ``1`` means that an OSD may be used as a
 915 primary.  When the weight is ``< 1``, it is less likely that CRUSH will select
 916 the Ceph OSD Daemon to act as a primary.
 917
 918
 919