ceph/doc/rados/operations/crush-map-edits.rst

   1 Manually editing a CRUSH Map
   2 ============================
   3
   4 .. note:: Manually editing the CRUSH map is considered an advanced
   5           administrator operation.  All CRUSH changes that are
   6           necessary for the overwhelming majority of installations are
   7           possible via the standard ceph CLI and do not require manual
   8           CRUSH map edits.  If you have identified a use case where
   9           manual edits *are* necessary, consider contacting the Ceph
  10           developers so that future versions of Ceph can make this
  11           unnecessary.
  12
  13 To edit an existing CRUSH map:
  14
  15 #. `Get the CRUSH map`_.
  16 #. `Decompile`_ the CRUSH map.
  17 #. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
  18 #. `Recompile`_ the CRUSH map.
  19 #. `Set the CRUSH map`_.
  20
  21 For details on setting the CRUSH map rule for a specific pool, see `Set
  22 Pool Values`_.
  23
  24 .. _Get the CRUSH map: #getcrushmap
  25 .. _Decompile: #decompilecrushmap
  26 .. _Devices: #crushmapdevices
  27 .. _Buckets: #crushmapbuckets
  28 .. _Rules: #crushmaprules
  29 .. _Recompile: #compilecrushmap
  30 .. _Set the CRUSH map: #setcrushmap
  31 .. _Set Pool Values: ../pools#setpoolvalues
  32
  33 .. _getcrushmap:
  34
  35 Get a CRUSH Map
  36 ---------------
  37
  38 To get the CRUSH map for your cluster, execute the following::
  39
  40         ceph osd getcrushmap -o {compiled-crushmap-filename}
  41
  42 Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
  43 the CRUSH map is in a compiled form, you must decompile it first before you can
  44 edit it.
  45
  46 .. _decompilecrushmap:
  47
  48 Decompile a CRUSH Map
  49 ---------------------
  50
  51 To decompile a CRUSH map, execute the following::
  52
  53         crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
  54
  55
  56 Sections
  57 --------
  58
  59 There are six main sections to a CRUSH Map.
  60
  61 #. **tunables:** The preamble at the top of the map described any *tunables*
  62    for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These
  63    correct for old bugs, optimizations, or other changes in behavior that have
  64    been made over the years to improve CRUSH's behavior.
  65
  66 #. **devices:** Devices are individual ``ceph-osd`` daemons that can
  67    store data.
  68
  69 #. **types**: Bucket ``types`` define the types of buckets used in
  70    your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
  71    of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
  72    their assigned weights.
  73
  74 #. **buckets:** Once you define bucket types, you must define each node
  75    in the hierarchy, its type, and which devices or other nodes it
  76    containes.
  77
  78 #. **rules:** Rules define policy about how data is distributed across
  79    devices in the hierarchy.
  80
  81 #. **choose_args:** Choose_args are alternative weights associated with
  82    the hierarchy that have been adjusted to optimize data placement.  A single
  83    choose_args map can be used for the entire cluster, or one can be
  84    created for each individual pool.
  85
  86
  87 .. _crushmapdevices:
  88
  89 CRUSH Map Devices
  90 -----------------
  91
  92 Devices are individual ``ceph-osd`` daemons that can store data.  You
  93 will normally have one defined here for each OSD daemon in your
  94 cluster.  Devices are identified by an id (a non-negative integer) and
  95 a name, normally ``osd.N`` where ``N`` is the device id.
  96
  97 Devices may also have a *device class* associated with them (e.g.,
  98 ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
  99 crush rule.
 100
 101 ::
 102
 103         # devices
 104         device {num} {osd.name} [class {class}]
 105
 106 For example::
 107
 108         # devices
 109         device 0 osd.0 class ssd
 110         device 1 osd.1 class hdd
 111         device 2 osd.2
 112         device 3 osd.3
 113
 114 In most cases, each device maps to a single ``ceph-osd`` daemon.  This
 115 is normally a single storage device, a pair of devices (for example,
 116 one for data and one for a journal or metadata), or in some cases a
 117 small RAID device.
 118
 119
 120
 121
 122
 123 CRUSH Map Bucket Types
 124 ----------------------
 125
 126 The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
 127 a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
 128 physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
 129 Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
 130 media.
 131
 132 .. tip:: The term "bucket" used in the context of CRUSH means a node in
 133    the hierarchy, i.e. a location or a piece of physical hardware. It
 134    is a different concept from the term "bucket" when used in the
 135    context of RADOS Gateway APIs.
 136
 137 To add a bucket type to the CRUSH map, create a new line under your list of
 138 bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
 139 By convention, there is one leaf bucket and it is ``type 0``;  however, you may
 140 give it any name you like (e.g., osd, disk, drive, storage, etc.)::
 141
 142         #types
 143         type {num} {bucket-name}
 144
 145 For example::
 146
 147         # types
 148         type 0 osd
 149         type 1 host
 150         type 2 chassis
 151         type 3 rack
 152         type 4 row
 153         type 5 pdu
 154         type 6 pod
 155         type 7 room
 156         type 8 datacenter
 157         type 9 region
 158         type 10 root
 159
 160
 161
 162 .. _crushmapbuckets:
 163
 164 CRUSH Map Bucket Hierarchy
 165 --------------------------
 166
 167 The CRUSH algorithm distributes data objects among storage devices according
 168 to a per-device weight value, approximating a uniform probability distribution.
 169 CRUSH distributes objects and their replicas according to the hierarchical
 170 cluster map you define. Your CRUSH map represents the available storage
 171 devices and the logical elements that contain them.
 172
 173 To map placement groups to OSDs across failure domains, a CRUSH map defines a
 174 hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
 175 map). The purpose of creating a bucket hierarchy is to segregate the
 176 leaf nodes by their failure domains, such as hosts, chassis, racks, power
 177 distribution units, pods, rows, rooms, and data centers. With the exception of
 178 the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
 179 you may define it according to your own needs.
 180
 181 We recommend adapting your CRUSH map to your firms's hardware naming conventions
 182 and using instances names that reflect the physical hardware. Your naming
 183 practice can make it easier to administer the cluster and troubleshoot
 184 problems when an OSD and/or other hardware malfunctions and the administrator
 185 need access to physical hardware.
 186
 187 In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
 188 and two node buckets named ``host`` and ``rack`` respectively.
 189
 190 .. ditaa::
 191                            +-----------+
 192                            | {o}rack   |
 193                            |   Bucket  |
 194                            +-----+-----+
 195                                  |
 196                  +---------------+---------------+
 197                  |                               |
 198            +-----+-----+                   +-----+-----+
 199            | {o}host   |                   | {o}host   |
 200            |   Bucket  |                   |   Bucket  |
 201            +-----+-----+                   +-----+-----+
 202                  |                               |
 203          +-------+-------+               +-------+-------+
 204          |               |               |               |
 205    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 206    |    osd    |   |    osd    |   |    osd    |   |    osd    |
 207    |   Bucket  |   |   Bucket  |   |   Bucket  |   |   Bucket  |
 208    +-----------+   +-----------+   +-----------+   +-----------+
 209
 210 .. note:: The higher numbered ``rack`` bucket type aggregates the lower
 211    numbered ``host`` bucket type.
 212
 213 Since leaf nodes reflect storage devices declared under the ``#devices`` list
 214 at the beginning of the CRUSH map, you do not need to declare them as bucket
 215 instances. The second lowest bucket type in your hierarchy usually aggregates
 216 the devices (i.e., it's usually the computer containing the storage media, and
 217 uses whatever term you prefer to describe it, such as  "node", "computer",
 218 "server," "host", "machine", etc.). In high density environments, it is
 219 increasingly common to see multiple hosts/nodes per chassis. You should account
 220 for chassis failure too--e.g., the need to pull a chassis if a node fails may
 221 result in bringing down numerous hosts/nodes and their OSDs.
 222
 223 When declaring a bucket instance, you must specify its type, give it a unique
 224 name (string), assign it a unique ID expressed as a negative integer (optional),
 225 specify a weight relative to the total capacity/capability of its item(s),
 226 specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
 227 reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
 228 The items may consist of node buckets or leaves. Items may have a weight that
 229 reflects the relative weight of the item.
 230
 231 You may declare a node bucket with the following syntax::
 232
 233         [bucket-type] [bucket-name] {
 234                 id [a unique negative numeric ID]
 235                 weight [the relative capacity/capability of the item(s)]
 236                 alg [the bucket type: uniform | list | tree | straw ]
 237                 hash [the hash type: 0 by default]
 238                 item [item-name] weight [weight]
 239         }
 240
 241 For example, using the diagram above, we would define two host buckets
 242 and one rack bucket. The OSDs are declared as items within the host buckets::
 243
 244         host node1 {
 245                 id -1
 246                 alg straw
 247                 hash 0
 248                 item osd.0 weight 1.00
 249                 item osd.1 weight 1.00
 250         }
 251
 252         host node2 {
 253                 id -2
 254                 alg straw
 255                 hash 0
 256                 item osd.2 weight 1.00
 257                 item osd.3 weight 1.00
 258         }
 259
 260         rack rack1 {
 261                 id -3
 262                 alg straw
 263                 hash 0
 264                 item node1 weight 2.00
 265                 item node2 weight 2.00
 266         }
 267
 268 .. note:: In the foregoing example, note that the rack bucket does not contain
 269    any OSDs. Rather it contains lower level host buckets, and includes the
 270    sum total of their weight in the item entry.
 271
 272 .. topic:: Bucket Types
 273
 274    Ceph supports four bucket types, each representing a tradeoff between
 275    performance and reorganization efficiency. If you are unsure of which bucket
 276    type to use, we recommend using a ``straw`` bucket.  For a detailed
 277    discussion of bucket types, refer to
 278    `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
 279    and more specifically to **Section 3.4**. The bucket types are:
 280
 281         #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
 282            weight. For example, when firms commission or decommission hardware, they
 283            typically do so with many machines that have exactly the same physical
 284            configuration (e.g., bulk purchases). When storage devices have exactly
 285            the same weight, you may use the ``uniform`` bucket type, which allows
 286            CRUSH to map replicas into uniform buckets in constant time. With
 287            non-uniform weights, you should use another bucket algorithm.
 288
 289         #. **List**: List buckets aggregate their content as linked lists. Based on
 290            the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
 291            a list is a natural and intuitive choice for an **expanding cluster**:
 292            either an object is relocated to the newest device with some appropriate
 293            probability, or it remains on the older devices as before. The result is
 294            optimal data migration when items are added to the bucket. Items removed
 295            from the middle or tail of the list, however, can result in a signiﬁcant
 296            amount of unnecessary movement, making list buckets most suitable for
 297            circumstances in which they **never (or very rarely) shrink**.
 298
 299         #. **Tree**: Tree buckets use a binary search tree. They are more efficient
 300            than list buckets when a bucket contains a larger set of items. Based on
 301            the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
 302            tree buckets reduce the placement time to O(log :sub:`n`), making them
 303            suitable for managing much larger sets of devices or nested buckets.
 304
 305         #. **Straw:** List and Tree buckets use a divide and conquer strategy
 306            in a way that either gives certain items precedence (e.g., those
 307            at the beginning of a list) or obviates the need to consider entire
 308            subtrees of items at all. That improves the performance of the replica
 309            placement process, but can also introduce suboptimal reorganization
 310            behavior when the contents of a bucket change due an addition, removal,
 311            or re-weighting of an item. The straw bucket type allows all items to
 312            fairly “compete” against each other for replica placement through a
 313            process analogous to a draw of straws.
 314
 315 .. topic:: Hash
 316
 317    Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
 318    Enter ``0`` as your hash setting to select ``rjenkins1``.
 319
 320
 321 .. _weightingbucketitems:
 322
 323 .. topic:: Weighting Bucket Items
 324
 325    Ceph expresses bucket weights as doubles, which allows for fine
 326    weighting. A weight is the relative difference between device capacities. We
 327    recommend using ``1.00`` as the relative weight for a 1TB storage device.
 328    In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
 329    and a weight of ``3.00`` would represent approximately 3TB. Higher level
 330    buckets have a weight that is the sum total of the leaf items aggregated by
 331    the bucket.
 332
 333    A bucket item weight is one dimensional, but you may also calculate your
 334    item weights to reflect the performance of the storage drive. For example,
 335    if you have many 1TB drives where some have relatively low data transfer
 336    rate and the others have a relatively high data transfer rate, you may
 337    weight them differently, even though they have the same capacity (e.g.,
 338    a weight of 0.80 for the first set of drives with lower total throughput,
 339    and 1.20 for the second set of drives with higher total throughput).
 340
 341
 342 .. _crushmaprules:
 343
 344 CRUSH Map Rules
 345 ---------------
 346
 347 CRUSH maps support the notion of 'CRUSH rules', which are the rules that
 348 determine data placement for a pool. The default CRUSH map has a rule for each
 349 pool. For large clusters, you will likely create many pools where each pool may
 350 have its own non-default CRUSH rule.
 351
 352 .. note:: In most cases, you will not need to modify the default rule. When
 353    you create a new pool, by default the rule will be set to ``0``.
 354
 355
 356 CRUSH rules define placement and replication strategies or distribution policies
 357 that allow you to specify exactly how CRUSH places object replicas. For
 358 example, you might create a rule selecting a pair of targets for 2-way
 359 mirroring, another rule for selecting three targets in two different data
 360 centers for 3-way mirroring, and yet another rule for erasure coding over six
 361 storage devices. For a detailed discussion of CRUSH rules, refer to
 362 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
 363 and more specifically to **Section 3.2**.
 364
 365 A rule takes the following form::
 366
 367         rule <rulename> {
 368
 369                 ruleset <ruleset>
 370                 type [ replicated | erasure ]
 371                 min_size <min-size>
 372                 max_size <max-size>
 373                 step take <bucket-name> [class <device-class>]
 374                 step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
 375                 step emit
 376         }
 377
 378
 379 ``ruleset``
 380
 381 :Description: A unique whole number for identifying the rule. The name ``ruleset``
 382               is a carry-over from the past, when it was possible to have multiple
 383               CRUSH rules per pool.
 384
 385 :Purpose: A component of the rule mask.
 386 :Type: Integer
 387 :Required: Yes
 388 :Default: 0
 389
 390
 391 ``type``
 392
 393 :Description: Describes a rule for either a storage drive (replicated)
 394               or a RAID.
 395
 396 :Purpose: A component of the rule mask.
 397 :Type: String
 398 :Required: Yes
 399 :Default: ``replicated``
 400 :Valid Values: Currently only ``replicated`` and ``erasure``
 401
 402 ``min_size``
 403
 404 :Description: If a pool makes fewer replicas than this number, CRUSH will
 405               **NOT** select this rule.
 406
 407 :Type: Integer
 408 :Purpose: A component of the rule mask.
 409 :Required: Yes
 410 :Default: ``1``
 411
 412 ``max_size``
 413
 414 :Description: If a pool makes more replicas than this number, CRUSH will
 415               **NOT** select this rule.
 416
 417 :Type: Integer
 418 :Purpose: A component of the rule mask.
 419 :Required: Yes
 420 :Default: 10
 421
 422
 423 ``step take <bucket-name> [class <device-class>]``
 424
 425 :Description: Takes a bucket name, and begins iterating down the tree.
 426               If the ``device-class`` is specified, it must match
 427               a class previously used when defining a device. All
 428               devices that do not belong to the class are excluded.
 429 :Purpose: A component of the rule.
 430 :Required: Yes
 431 :Example: ``step take data``
 432
 433
 434 ``step choose firstn {num} type {bucket-type}``
 435
 436 :Description: Selects the number of buckets of the given type. The number is
 437               usually the number of replicas in the pool (i.e., pool size).
 438
 439               - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
 440               - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
 441               - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
 442
 443 :Purpose: A component of the rule.
 444 :Prerequisite: Follows ``step take`` or ``step choose``.
 445 :Example: ``step choose firstn 1 type row``
 446
 447
 448 ``step chooseleaf firstn {num} type {bucket-type}``
 449
 450 :Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
 451               node from the subtree of each bucket in the set of buckets. The
 452               number of buckets in the set is usually the number of replicas in
 453               the pool (i.e., pool size).
 454
 455               - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
 456               - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
 457               - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
 458
 459 :Purpose: A component of the rule. Usage removes the need to select a device using two steps.
 460 :Prerequisite: Follows ``step take`` or ``step choose``.
 461 :Example: ``step chooseleaf firstn 0 type row``
 462
 463
 464
 465 ``step emit``
 466
 467 :Description: Outputs the current value and empties the stack. Typically used
 468               at the end of a rule, but may also be used to pick from different
 469               trees in the same rule.
 470
 471 :Purpose: A component of the rule.
 472 :Prerequisite: Follows ``step choose``.
 473 :Example: ``step emit``
 474
 475 .. important:: A given CRUSH rule may be assigned to multiple pools, but it
 476    is not possible for a single pool to have multiple CRUSH rules.
 477
 478
 479 Placing Different Pools on Different OSDS:
 480 ==========================================
 481
 482 Suppose you want to have most pools default to OSDs backed by large hard drives,
 483 but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
 484 It's possible to have multiple independent CRUSH hierarchies within the same
 485 CRUSH map. Define two hierarchies with two different root nodes--one for hard
 486 disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
 487 below::
 488
 489   device 0 osd.0
 490   device 1 osd.1
 491   device 2 osd.2
 492   device 3 osd.3
 493   device 4 osd.4
 494   device 5 osd.5
 495   device 6 osd.6
 496   device 7 osd.7
 497
 498         host ceph-osd-ssd-server-1 {
 499                 id -1
 500                 alg straw
 501                 hash 0
 502                 item osd.0 weight 1.00
 503                 item osd.1 weight 1.00
 504         }
 505
 506         host ceph-osd-ssd-server-2 {
 507                 id -2
 508                 alg straw
 509                 hash 0
 510                 item osd.2 weight 1.00
 511                 item osd.3 weight 1.00
 512         }
 513
 514         host ceph-osd-platter-server-1 {
 515                 id -3
 516                 alg straw
 517                 hash 0
 518                 item osd.4 weight 1.00
 519                 item osd.5 weight 1.00
 520         }
 521
 522         host ceph-osd-platter-server-2 {
 523                 id -4
 524                 alg straw
 525                 hash 0
 526                 item osd.6 weight 1.00
 527                 item osd.7 weight 1.00
 528         }
 529
 530         root platter {
 531                 id -5
 532                 alg straw
 533                 hash 0
 534                 item ceph-osd-platter-server-1 weight 2.00
 535                 item ceph-osd-platter-server-2 weight 2.00
 536         }
 537
 538         root ssd {
 539                 id -6
 540                 alg straw
 541                 hash 0
 542                 item ceph-osd-ssd-server-1 weight 2.00
 543                 item ceph-osd-ssd-server-2 weight 2.00
 544         }
 545
 546         rule data {
 547                 ruleset 0
 548                 type replicated
 549                 min_size 2
 550                 max_size 2
 551                 step take platter
 552                 step chooseleaf firstn 0 type host
 553                 step emit
 554         }
 555
 556         rule metadata {
 557                 ruleset 1
 558                 type replicated
 559                 min_size 0
 560                 max_size 10
 561                 step take platter
 562                 step chooseleaf firstn 0 type host
 563                 step emit
 564         }
 565
 566         rule rbd {
 567                 ruleset 2
 568                 type replicated
 569                 min_size 0
 570                 max_size 10
 571                 step take platter
 572                 step chooseleaf firstn 0 type host
 573                 step emit
 574         }
 575
 576         rule platter {
 577                 ruleset 3
 578                 type replicated
 579                 min_size 0
 580                 max_size 10
 581                 step take platter
 582                 step chooseleaf firstn 0 type host
 583                 step emit
 584         }
 585
 586         rule ssd {
 587                 ruleset 4
 588                 type replicated
 589                 min_size 0
 590                 max_size 4
 591                 step take ssd
 592                 step chooseleaf firstn 0 type host
 593                 step emit
 594         }
 595
 596         rule ssd-primary {
 597                 ruleset 5
 598                 type replicated
 599                 min_size 5
 600                 max_size 10
 601                 step take ssd
 602                 step chooseleaf firstn 1 type host
 603                 step emit
 604                 step take platter
 605                 step chooseleaf firstn -1 type host
 606                 step emit
 607         }
 608
 609 You can then set a pool to use the SSD rule by::
 610
 611   ceph osd pool set <poolname> crush_ruleset 4
 612
 613 Similarly, using the ``ssd-primary`` rule will cause each placement group in the
 614 pool to be placed with an SSD as the primary and platters as the replicas.
 615
 616
 617 Tuning CRUSH, the hard way
 618 --------------------------
 619
 620 If you can ensure that all clients are running recent code, you can
 621 adjust the tunables by extracting the CRUSH map, modifying the values,
 622 and reinjecting it into the cluster.
 623
 624 * Extract the latest CRUSH map::
 625
 626         ceph osd getcrushmap -o /tmp/crush
 627
 628 * Adjust tunables.  These values appear to offer the best behavior
 629   for both large and small clusters we tested with.  You will need to
 630   additionally specify the ``--enable-unsafe-tunables`` argument to
 631   ``crushtool`` for this to work.  Please use this option with
 632   extreme care.::
 633
 634         crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
 635
 636 * Reinject modified map::
 637
 638         ceph osd setcrushmap -i /tmp/crush.new
 639
 640 Legacy values
 641 -------------
 642
 643 For reference, the legacy values for the CRUSH tunables can be set
 644 with::
 645
 646    crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
 647
 648 Again, the special ``--enable-unsafe-tunables`` option is required.
 649 Further, as noted above, be careful running old versions of the
 650 ``ceph-osd`` daemon after reverting to legacy values as the feature
 651 bit is not perfectly enforced.