]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map.rst
use the buster suite for getting the source package for now
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
CommitLineData
7c673cae
FG
1============
2 CRUSH Maps
3============
4
5The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
f67539c2 6determines how to store and retrieve data by computing storage locations.
7c673cae
FG
7CRUSH empowers Ceph clients to communicate with OSDs directly rather than
8through a centralized server or broker. With an algorithmically determined
9method of storing and retrieving data, Ceph avoids a single point of failure, a
10performance bottleneck, and a physical limit to its scalability.
11
f67539c2
TL
12CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly
13map data to OSDs, distributing it across the cluster according to configured
14replication policy and failure domain. For a detailed discussion of CRUSH, see
7c673cae
FG
15`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
f67539c2
TL
17CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy
18of 'buckets' for aggregating devices and buckets, and
19rules that govern how CRUSH replicates data within the cluster's pools. By
7c673cae 20reflecting the underlying physical organization of the installation, CRUSH can
f67539c2
TL
21model (and thereby address) the potential for correlated device failures.
22Typical factors include chassis, racks, physical proximity, a shared power
23source, and shared networking. By encoding this information into the cluster
24map, CRUSH placement
25policies distribute object replicas across failure domains while
26maintaining the desired distribution. For example, to address the
7c673cae
FG
27possibility of concurrent failures, it may be desirable to ensure that data
28replicas are on devices using different shelves, racks, power supplies,
29controllers, and/or physical locations.
30
f67539c2
TL
31When you deploy OSDs they are automatically added to the CRUSH map under a
32``host`` bucket named for the node on which they run. This,
33combined with the configured CRUSH failure domain, ensures that replicas or
34erasure code shards are distributed across hosts and that a single host or other
35failure will not affect availability. For larger clusters, administrators must
36carefully consider their choice of failure domain. Separating replicas across racks,
37for example, is typical for mid- to large-sized clusters.
7c673cae
FG
38
39
40CRUSH Location
41==============
42
f67539c2
TL
43The location of an OSD within the CRUSH map's hierarchy is
44referred to as a ``CRUSH location``. This location specifier takes the
45form of a list of key and value pairs. For
c07f9fc5 46example, if an OSD is in a particular row, rack, chassis and host, and
f67539c2
TL
47is part of the 'default' CRUSH root (which is the case for most
48clusters), its CRUSH location could be described as::
7c673cae
FG
49
50 root=default row=a rack=a2 chassis=a2a host=a2a1
51
52Note:
53
54#. Note that the order of the keys does not matter.
55#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
f67539c2
TL
56 these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``,
57 ``rack``, ``chassis`` and ``host``.
58 These defined types suffice for almost all clusters, but can be customized
59 by modifying the CRUSH map.
7c673cae 60#. Not all keys need to be specified. For example, by default, Ceph
f67539c2 61 automatically sets an ``OSD``'s location to be
7c673cae
FG
62 ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
63
f67539c2
TL
64The CRUSH location for an OSD can be defined by adding the ``crush location``
65option in ``ceph.conf``. Each time the OSD starts,
c07f9fc5 66it verifies it is in the correct location in the CRUSH map and, if it is not,
11fdf7f2 67it moves itself. To disable this automatic CRUSH map management, add the
c07f9fc5 68following to your configuration file in the ``[osd]`` section::
7c673cae
FG
69
70 osd crush update on start = false
71
f67539c2
TL
72Note that in most cases you will not need to manually configure this.
73
c07f9fc5 74
7c673cae
FG
75Custom location hooks
76---------------------
77
c07f9fc5 78A customized location hook can be used to generate a more complete
f67539c2 79CRUSH location on startup. The CRUSH location is based on, in order
11fdf7f2 80of preference:
c07f9fc5 81
f67539c2 82#. A ``crush location`` option in ``ceph.conf``
c07f9fc5 83#. A default of ``root=default host=HOSTNAME`` where the hostname is
f67539c2 84 derived from the ``hostname -s`` command
c07f9fc5 85
f67539c2
TL
86A script can be written to provide additional
87location fields (for example, ``rack`` or ``datacenter``) and the
c07f9fc5 88hook enabled via the config option::
7c673cae 89
f67539c2 90 crush location hook = /path/to/customized-ceph-crush-location
7c673cae
FG
91
92This hook is passed several arguments (below) and should output a single line
f67539c2 93to ``stdout`` with the CRUSH location description.::
7c673cae 94
11fdf7f2 95 --cluster CLUSTER --id ID --type TYPE
7c673cae 96
f67539c2 97where the cluster name is typically ``ceph``, the ``id`` is the daemon
11fdf7f2 98identifier (e.g., the OSD number or daemon identifier), and the daemon
f67539c2 99type is ``osd``, ``mds``, etc.
11fdf7f2 100
f67539c2
TL
101For example, a simple hook that additionally specifies a rack location
102based on a value in the file ``/etc/rack`` might be::
11fdf7f2
TL
103
104 #!/bin/sh
105 echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
7c673cae
FG
106
107
c07f9fc5
FG
108CRUSH structure
109===============
7c673cae 110
f67539c2
TL
111The CRUSH map consists of a hierarchy that describes
112the physical topology of the cluster and a set of rules defining
113data placement policy. The hierarchy has
114devices (OSDs) at the leaves, and internal nodes
c07f9fc5
FG
115corresponding to other physical features or groupings: hosts, racks,
116rows, datacenters, and so on. The rules describe how replicas are
117placed in terms of that hierarchy (e.g., 'three replicas in different
118racks').
7c673cae 119
c07f9fc5
FG
120Devices
121-------
7c673cae 122
f67539c2
TL
123Devices are individual OSDs that store data, usually one for each storage drive.
124Devices are identified by an ``id``
125(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id.
7c673cae 126
f67539c2
TL
127Since the Luminous release, devices may also have a *device class* assigned (e.g.,
128``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by
129CRUSH rules. This is especially useful when mixing device types within hosts.
7c673cae 130
c07f9fc5 131Types and Buckets
7c673cae
FG
132-----------------
133
c07f9fc5
FG
134A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
135racks, rows, etc. The CRUSH map defines a series of *types* that are
f67539c2
TL
136used to describe these nodes. Default types include:
137
138- ``osd`` (or ``device``)
139- ``host``
140- ``chassis``
141- ``rack``
142- ``row``
143- ``pdu``
144- ``pod``
145- ``room``
146- ``datacenter``
147- ``zone``
148- ``region``
149- ``root``
150
151Most clusters use only a handful of these types, and others
c07f9fc5
FG
152can be defined as needed.
153
154The hierarchy is built with devices (normally type ``osd``) at the
155leaves, interior nodes with non-device types, and a root node of type
156``root``. For example,
157
158.. ditaa::
159
160 +-----------------+
11fdf7f2 161 |{o}root default |
c07f9fc5 162 +--------+--------+
7c673cae 163 |
11fdf7f2
TL
164 +---------------+---------------+
165 | |
166 +------+------+ +------+------+
167 |{o}host foo | |{o}host bar |
168 +------+------+ +------+------+
7c673cae 169 | |
7c673cae
FG
170 +-------+-------+ +-------+-------+
171 | | | |
172 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
11fdf7f2 173 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
7c673cae
FG
174 +-----------+ +-----------+ +-----------+ +-----------+
175
c07f9fc5 176Each node (device or bucket) in the hierarchy has a *weight*
f67539c2 177that indicates the relative proportion of the total
c07f9fc5
FG
178data that device or hierarchy subtree should store. Weights are set
179at the leaves, indicating the size of the device, and automatically
f67539c2 180sum up the tree, such that the weight of the ``root`` node
c07f9fc5
FG
181will be the total of all devices contained beneath it. Normally
182weights are in units of terabytes (TB).
183
f67539c2
TL
184You can get a simple view the of CRUSH hierarchy for your cluster,
185including weights, with::
c07f9fc5 186
f67539c2 187 ceph osd tree
c07f9fc5
FG
188
189Rules
190-----
191
f67539c2
TL
192CRUSH Rules define policy about how data is distributed across the devices
193in the hierarchy. They define placement and replication strategies or
c07f9fc5 194distribution policies that allow you to specify exactly how CRUSH
f67539c2
TL
195places data replicas. For example, you might create a rule selecting
196a pair of targets for two-way mirroring, another rule for selecting
197three targets in two different data centers for three-way mirroring, and
198yet another rule for erasure coding (EC) across six storage devices. For a
c07f9fc5
FG
199detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
200Scalable, Decentralized Placement of Replicated Data`_, and more
201specifically to **Section 3.2**.
202
f67539c2 203CRUSH rules can be created via the CLI by
c07f9fc5
FG
204specifying the *pool type* they will be used for (replicated or
205erasure coded), the *failure domain*, and optionally a *device class*.
206In rare cases rules must be written by hand by manually editing the
207CRUSH map.
11fdf7f2 208
c07f9fc5 209You can see what rules are defined for your cluster with::
7c673cae 210
c07f9fc5 211 ceph osd crush rule ls
7c673cae 212
c07f9fc5 213You can view the contents of the rules with::
7c673cae 214
c07f9fc5 215 ceph osd crush rule dump
7c673cae 216
d2e6a577
FG
217Device classes
218--------------
219
f67539c2
TL
220Each device can optionally have a *class* assigned. By
221default, OSDs automatically set their class at startup to
d2e6a577
FG
222`hdd`, `ssd`, or `nvme` based on the type of device they are backed
223by.
224
225The device class for one or more OSDs can be explicitly set with::
226
227 ceph osd crush set-device-class <class> <osd-name> [...]
228
229Once a device class is set, it cannot be changed to another class
230until the old class is unset with::
231
232 ceph osd crush rm-device-class <osd-name> [...]
233
234This allows administrators to set device classes without the class
235being changed on OSD restart or by some other script.
236
237A placement rule that targets a specific device class can be created with::
238
239 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
240
241A pool can then be changed to use the new rule with::
242
243 ceph osd pool set <pool-name> crush_rule <rule-name>
244
245Device classes are implemented by creating a "shadow" CRUSH hierarchy
246for each device class in use that contains only devices of that class.
f67539c2
TL
247CRUSH rules can then distribute data over the shadow hierarchy.
248This approach is fully backward compatible with
d2e6a577
FG
249old Ceph clients. You can view the CRUSH hierarchy with shadow items
250with::
251
252 ceph osd crush tree --show-shadow
253
f64942e4
AA
254For older clusters created before Luminous that relied on manually
255crafted CRUSH maps to maintain per-device-type hierarchies, there is a
256*reclassify* tool available to help transition to device classes
257without triggering data movement (see :ref:`crush-reclassify`).
258
7c673cae 259
c07f9fc5
FG
260Weights sets
261------------
7c673cae 262
c07f9fc5
FG
263A *weight set* is an alternative set of weights to use when
264calculating data placement. The normal weights associated with each
265device in the CRUSH map are set based on the device size and indicate
266how much data we *should* be storing where. However, because CRUSH is
f67539c2
TL
267a "probabilistic" pseudorandom placement process, there is always some
268variation from this ideal distribution, in the same way that rolling a
269die sixty times will not result in rolling exactly 10 ones and 10
270sixes. Weight sets allow the cluster to perform numerical optimization
c07f9fc5
FG
271based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
272a balanced distribution.
273
274There are two types of weight sets supported:
275
276 #. A **compat** weight set is a single alternative set of weights for
277 each device and node in the cluster. This is not well-suited for
278 correcting for all anomalies (for example, placement groups for
279 different pools may be different sizes and have different load
280 levels, but will be mostly treated the same by the balancer).
281 However, compat weight sets have the huge advantage that they are
282 *backward compatible* with previous versions of Ceph, which means
283 that even though weight sets were first introduced in Luminous
284 v12.2.z, older clients (e.g., firefly) can still connect to the
285 cluster when a compat weight set is being used to balance data.
286 #. A **per-pool** weight set is more flexible in that it allows
287 placement to be optimized for each data pool. Additionally,
288 weights can be adjusted for each position of placement, allowing
11fdf7f2 289 the optimizer to correct for a subtle skew of data toward devices
c07f9fc5
FG
290 with small weights relative to their peers (and effect that is
291 usually only apparently in very large clusters but which can cause
292 balancing problems).
293
294When weight sets are in use, the weights associated with each node in
295the hierarchy is visible as a separate column (labeled either
296``(compat)`` or the pool name) from the command::
297
f67539c2 298 ceph osd tree
c07f9fc5
FG
299
300When both *compat* and *per-pool* weight sets are in use, data
301placement for a particular pool will use its own per-pool weight set
302if present. If not, it will use the compat weight set if present. If
303neither are present, it will use the normal CRUSH weights.
304
305Although weight sets can be set up and manipulated by hand, it is
f67539c2
TL
306recommended that the ``ceph-mgr`` *balancer* module be enabled to do so
307automatically when running Luminous or later releases.
c07f9fc5
FG
308
309
310Modifying the CRUSH map
311=======================
7c673cae
FG
312
313.. _addosd:
314
315Add/Move an OSD
c07f9fc5 316---------------
7c673cae 317
c07f9fc5
FG
318.. note: OSDs are normally automatically added to the CRUSH map when
319 the OSD is created. This command is rarely needed.
7c673cae 320
c07f9fc5 321To add or move an OSD in the CRUSH map of a running cluster::
7c673cae 322
c07f9fc5 323 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
7c673cae
FG
324
325Where:
326
7c673cae
FG
327``name``
328
11fdf7f2 329:Description: The full name of the OSD.
7c673cae
FG
330:Type: String
331:Required: Yes
332:Example: ``osd.0``
333
334
335``weight``
336
c07f9fc5 337:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
7c673cae
FG
338:Type: Double
339:Required: Yes
340:Example: ``2.0``
341
342
343``root``
344
c07f9fc5 345:Description: The root node of the tree in which the OSD resides (normally ``default``)
7c673cae
FG
346:Type: Key/value pair.
347:Required: Yes
348:Example: ``root=default``
349
350
351``bucket-type``
352
11fdf7f2 353:Description: You may specify the OSD's location in the CRUSH hierarchy.
7c673cae
FG
354:Type: Key/value pairs.
355:Required: No
356:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
357
358
c07f9fc5
FG
359The following example adds ``osd.0`` to the hierarchy, or moves the
360OSD from a previous location. ::
7c673cae 361
c07f9fc5 362 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7c673cae
FG
363
364
c07f9fc5
FG
365Adjust OSD weight
366-----------------
367
368.. note: Normally OSDs automatically add themselves to the CRUSH map
369 with the correct weight when they are created. This command
370 is rarely needed.
7c673cae 371
f67539c2 372To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute
7c673cae
FG
373the following::
374
c07f9fc5 375 ceph osd crush reweight {name} {weight}
7c673cae
FG
376
377Where:
378
379``name``
380
11fdf7f2 381:Description: The full name of the OSD.
7c673cae
FG
382:Type: String
383:Required: Yes
384:Example: ``osd.0``
385
386
387``weight``
388
11fdf7f2 389:Description: The CRUSH weight for the OSD.
7c673cae
FG
390:Type: Double
391:Required: Yes
392:Example: ``2.0``
393
394
395.. _removeosd:
396
397Remove an OSD
c07f9fc5
FG
398-------------
399
400.. note: OSDs are normally removed from the CRUSH as part of the
401 ``ceph osd purge`` command. This command is rarely needed.
7c673cae 402
c07f9fc5
FG
403To remove an OSD from the CRUSH map of a running cluster, execute the
404following::
7c673cae 405
c07f9fc5 406 ceph osd crush remove {name}
7c673cae
FG
407
408Where:
409
410``name``
411
11fdf7f2 412:Description: The full name of the OSD.
7c673cae
FG
413:Type: String
414:Required: Yes
415:Example: ``osd.0``
416
c07f9fc5 417
7c673cae 418Add a Bucket
c07f9fc5
FG
419------------
420
f67539c2 421.. note: Buckets are implicitly created when an OSD is added
c07f9fc5 422 that specifies a ``{bucket-type}={bucket-name}`` as part of its
f67539c2 423 location, if a bucket with that name does not already exist. This
c07f9fc5 424 command is typically used when manually adjusting the structure of the
f67539c2
TL
425 hierarchy after OSDs have been created. One use is to move a
426 series of hosts underneath a new rack-level bucket; another is to
427 add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't
428 receive data until you're ready, at which time you would move them to the
429 ``default`` or other root as described below.
7c673cae 430
c07f9fc5
FG
431To add a bucket in the CRUSH map of a running cluster, execute the
432``ceph osd crush add-bucket`` command::
7c673cae 433
c07f9fc5 434 ceph osd crush add-bucket {bucket-name} {bucket-type}
7c673cae
FG
435
436Where:
437
438``bucket-name``
439
440:Description: The full name of the bucket.
441:Type: String
442:Required: Yes
443:Example: ``rack12``
444
445
446``bucket-type``
447
448:Description: The type of the bucket. The type must already exist in the hierarchy.
449:Type: String
450:Required: Yes
451:Example: ``rack``
452
453
454The following example adds the ``rack12`` bucket to the hierarchy::
455
c07f9fc5 456 ceph osd crush add-bucket rack12 rack
7c673cae
FG
457
458Move a Bucket
c07f9fc5 459-------------
7c673cae 460
c07f9fc5
FG
461To move a bucket to a different location or position in the CRUSH map
462hierarchy, execute the following::
7c673cae 463
c07f9fc5 464 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
7c673cae
FG
465
466Where:
467
468``bucket-name``
469
470:Description: The name of the bucket to move/reposition.
471:Type: String
472:Required: Yes
473:Example: ``foo-bar-1``
474
475``bucket-type``
476
11fdf7f2 477:Description: You may specify the bucket's location in the CRUSH hierarchy.
7c673cae
FG
478:Type: Key/value pairs.
479:Required: No
480:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
481
482Remove a Bucket
c07f9fc5 483---------------
7c673cae 484
f67539c2 485To remove a bucket from the CRUSH hierarchy, execute the following::
7c673cae 486
c07f9fc5 487 ceph osd crush remove {bucket-name}
7c673cae
FG
488
489.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
490
491Where:
492
493``bucket-name``
494
495:Description: The name of the bucket that you'd like to remove.
496:Type: String
497:Required: Yes
498:Example: ``rack12``
499
500The following example removes the ``rack12`` bucket from the hierarchy::
501
c07f9fc5
FG
502 ceph osd crush remove rack12
503
504Creating a compat weight set
505----------------------------
506
507.. note: This step is normally done automatically by the ``balancer``
508 module when enabled.
509
510To create a *compat* weight set::
511
512 ceph osd crush weight-set create-compat
513
514Weights for the compat weight set can be adjusted with::
515
516 ceph osd crush weight-set reweight-compat {name} {weight}
517
518The compat weight set can be destroyed with::
519
520 ceph osd crush weight-set rm-compat
521
522Creating per-pool weight sets
523-----------------------------
524
525To create a weight set for a specific pool,::
526
527 ceph osd crush weight-set create {pool-name} {mode}
528
529.. note:: Per-pool weight sets require that all servers and daemons
530 run Luminous v12.2.z or later.
531
532Where:
533
534``pool-name``
535
536:Description: The name of a RADOS pool
537:Type: String
538:Required: Yes
539:Example: ``rbd``
540
541``mode``
542
543:Description: Either ``flat`` or ``positional``. A *flat* weight set
544 has a single weight for each device or bucket. A
545 *positional* weight set has a potentially different
546 weight for each position in the resulting placement
547 mapping. For example, if a pool has a replica count of
548 3, then a positional weight set will have three weights
549 for each device and bucket.
550:Type: String
551:Required: Yes
552:Example: ``flat``
553
554To adjust the weight of an item in a weight set::
555
556 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
557
558To list existing weight sets,::
559
560 ceph osd crush weight-set ls
561
562To remove a weight set,::
563
564 ceph osd crush weight-set rm {pool-name}
565
566Creating a rule for a replicated pool
567-------------------------------------
568
569For a replicated pool, the primary decision when creating the CRUSH
570rule is what the failure domain is going to be. For example, if a
571failure domain of ``host`` is selected, then CRUSH will ensure that
f67539c2 572each replica of the data is stored on a unique host. If ``rack``
c07f9fc5 573is selected, then each replica will be stored in a different rack.
f67539c2
TL
574What failure domain you choose primarily depends on the size and
575topology of your cluster.
c07f9fc5 576
f67539c2 577In most cases the entire cluster hierarchy is nested beneath a root node
c07f9fc5
FG
578named ``default``. If you have customized your hierarchy, you may
579want to create a rule nested at some other node in the hierarchy. It
580doesn't matter what type is associated with that node (it doesn't have
581to be a ``root`` node).
582
583It is also possible to create a rule that restricts data placement to
584a specific *class* of device. By default, Ceph OSDs automatically
585classify themselves as either ``hdd`` or ``ssd``, depending on the
586underlying type of device being used. These classes can also be
587customized.
588
589To create a replicated rule,::
590
591 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
592
593Where:
594
595``name``
596
597:Description: The name of the rule
598:Type: String
599:Required: Yes
600:Example: ``rbd-rule``
601
602``root``
603
604:Description: The name of the node under which data should be placed.
605:Type: String
606:Required: Yes
607:Example: ``default``
608
609``failure-domain-type``
610
611:Description: The type of CRUSH nodes across which we should separate replicas.
612:Type: String
613:Required: Yes
614:Example: ``rack``
615
616``class``
617
f67539c2 618:Description: The device class on which data should be placed.
c07f9fc5
FG
619:Type: String
620:Required: No
621:Example: ``ssd``
622
623Creating a rule for an erasure coded pool
624-----------------------------------------
625
f67539c2
TL
626For an erasure-coded (EC) pool, the same basic decisions need to be made:
627what is the failure domain, which node in the
c07f9fc5
FG
628hierarchy will data be placed under (usually ``default``), and will
629placement be restricted to a specific device class. Erasure code
630pools are created a bit differently, however, because they need to be
631constructed carefully based on the erasure code being used. For this reason,
632you must include this information in the *erasure code profile*. A CRUSH
633rule will then be created from that either explicitly or automatically when
634the profile is used to create a pool.
635
636The erasure code profiles can be listed with::
637
638 ceph osd erasure-code-profile ls
639
640An existing profile can be viewed with::
641
642 ceph osd erasure-code-profile get {profile-name}
643
644Normally profiles should never be modified; instead, a new profile
645should be created and used when creating a new pool or creating a new
646rule for an existing pool.
647
648An erasure code profile consists of a set of key=value pairs. Most of
649these control the behavior of the erasure code that is encoding data
650in the pool. Those that begin with ``crush-``, however, affect the
651CRUSH rule that is created.
652
653The erasure code profile properties of interest are:
654
f67539c2
TL
655 * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``].
656 * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``].
657 * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used].
c07f9fc5
FG
658 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
659
660Once a profile is defined, you can create a CRUSH rule with::
661
662 ceph osd crush rule create-erasure {name} {profile-name}
663
664.. note: When creating a new pool, it is not actually necessary to
665 explicitly create the rule. If the erasure code profile alone is
666 specified and the rule argument is left off then Ceph will create
667 the CRUSH rule automatically.
668
669Deleting rules
670--------------
671
672Rules that are not in use by pools can be deleted with::
673
674 ceph osd crush rule rm {rule-name}
675
7c673cae 676
11fdf7f2
TL
677.. _crush-map-tunables:
678
7c673cae
FG
679Tunables
680========
681
682Over time, we have made (and continue to make) improvements to the
683CRUSH algorithm used to calculate the placement of data. In order to
684support the change in behavior, we have introduced a series of tunable
685options that control whether the legacy or improved variation of the
686algorithm is used.
687
688In order to use newer tunables, both clients and servers must support
689the new version of CRUSH. For this reason, we have created
690``profiles`` that are named after the Ceph version in which they were
691introduced. For example, the ``firefly`` tunables are first supported
f67539c2 692by the Firefly release, and will not work with older (e.g., Dumpling)
7c673cae
FG
693clients. Once a given set of tunables are changed from the legacy
694default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
695clients who do not support the new CRUSH features from connecting to
696the cluster.
697
698argonaut (legacy)
699-----------------
700
f67539c2
TL
701The legacy CRUSH behavior used by Argonaut and older releases works
702fine for most clusters, provided there are not many OSDs that have
7c673cae
FG
703been marked out.
704
705bobtail (CRUSH_TUNABLES2)
706-------------------------
707
f67539c2 708The ``bobtail`` tunable profile fixes a few key misbehaviors:
7c673cae
FG
709
710 * For hierarchies with a small number of devices in the leaf buckets,
711 some PGs map to fewer than the desired number of replicas. This
712 commonly happens for hierarchies with "host" nodes with a small
713 number (1-3) of OSDs nested beneath each one.
714
f67539c2 715 * For large clusters, some small percentages of PGs map to fewer than
7c673cae 716 the desired number of OSDs. This is more prevalent when there are
f67539c2 717 mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``).
7c673cae
FG
718
719 * When some OSDs are marked out, the data tends to get redistributed
720 to nearby OSDs instead of across the entire hierarchy.
721
722The new tunables are:
723
724 * ``choose_local_tries``: Number of local retries. Legacy value is
725 2, optimal value is 0.
726
727 * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
728 is 0.
729
730 * ``choose_total_tries``: Total number of attempts to choose an item.
731 Legacy value was 19, subsequent testing indicates that a value of
732 50 is more appropriate for typical clusters. For extremely large
733 clusters, a larger value might be necessary.
734
735 * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
736 will retry, or only try once and allow the original placement to
737 retry. Legacy default is 0, optimal value is 1.
738
739Migration impact:
740
f67539c2 741 * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount
7c673cae
FG
742 of data movement. Use caution on a cluster that is already
743 populated with data.
744
745firefly (CRUSH_TUNABLES3)
746-------------------------
747
f67539c2
TL
748The ``firefly`` tunable profile fixes a problem
749with ``chooseleaf`` CRUSH rule behavior that tends to result in PG
7c673cae
FG
750mappings with too few results when too many OSDs have been marked out.
751
752The new tunable is:
753
754 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
f67539c2
TL
755 start with a non-zero value of ``r``, based on how many attempts the
756 parent has already made. Legacy default is ``0``, but with this value
7c673cae 757 CRUSH is sometimes unable to find a mapping. The optimal value (in
f67539c2 758 terms of computational cost and correctness) is ``1``.
7c673cae 759
11fdf7f2 760Migration impact:
7c673cae 761
f67539c2
TL
762 * For existing clusters that house lots of data, changing
763 from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5``
764 will allow CRUSH to still find a valid mapping but will cause less data
765 to move.
7c673cae
FG
766
767straw_calc_version tunable (introduced with Firefly too)
768--------------------------------------------------------
769
770There were some problems with the internal weights calculated and
f67539c2
TL
771stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when
772there were items with a CRUSH weight of ``0``, or both a mix of different and
773unique weights, CRUSH would distribute data incorrectly (i.e.,
7c673cae
FG
774not in proportion to the weights).
775
776The new tunable is:
777
f67539c2
TL
778 * ``straw_calc_version``: A value of ``0`` preserves the old, broken
779 internal weight calculation; a value of ``1`` fixes the behavior.
7c673cae
FG
780
781Migration impact:
782
f67539c2 783 * Moving to straw_calc_version ``1`` and then adjusting a straw bucket
7c673cae
FG
784 (by adding, removing, or reweighting an item, or by using the
785 reweight-all command) can trigger a small to moderate amount of
786 data movement *if* the cluster has hit one of the problematic
787 conditions.
788
789This tunable option is special because it has absolutely no impact
790concerning the required kernel version in the client side.
791
792hammer (CRUSH_V4)
793-----------------
794
f67539c2 795The ``hammer`` tunable profile does not affect the
7c673cae
FG
796mapping of existing CRUSH maps simply by changing the profile. However:
797
f67539c2
TL
798 * There is a new bucket algorithm (``straw2``) supported. The new
799 ``straw2`` bucket algorithm fixes several limitations in the original
800 ``straw``. Specifically, the old ``straw`` buckets would
7c673cae
FG
801 change some mappings that should have changed when a weight was
802 adjusted, while ``straw2`` achieves the original goal of only
803 changing mappings to or from the bucket item whose weight has
804 changed.
805
806 * ``straw2`` is the default for any newly created buckets.
807
808Migration impact:
809
810 * Changing a bucket type from ``straw`` to ``straw2`` will result in
811 a reasonably small amount of data movement, depending on how much
812 the bucket item weights vary from each other. When the weights are
813 all the same no data will move, and when item weights vary
814 significantly there will be more movement.
815
816jewel (CRUSH_TUNABLES5)
817-----------------------
818
f67539c2 819The ``jewel`` tunable profile improves the
7c673cae 820overall behavior of CRUSH such that significantly fewer mappings
f67539c2
TL
821change when an OSD is marked out of the cluster. This results in
822significantly less data movement.
7c673cae
FG
823
824The new tunable is:
825
826 * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
827 use a better value for an inner loop that greatly reduces the number
f67539c2
TL
828 of mapping changes when an OSD is marked out. The legacy value is ``0``,
829 while the new value of ``1`` uses the new approach.
7c673cae
FG
830
831Migration impact:
832
833 * Changing this value on an existing cluster will result in a very
834 large amount of data movement as almost every PG mapping is likely
835 to change.
836
837
838
839
840Which client versions support CRUSH_TUNABLES
841--------------------------------------------
842
843 * argonaut series, v0.48.1 or later
844 * v0.49 or later
845 * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
846
847Which client versions support CRUSH_TUNABLES2
848---------------------------------------------
849
850 * v0.55 or later, including bobtail series (v0.56.x)
851 * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
852
853Which client versions support CRUSH_TUNABLES3
854---------------------------------------------
855
856 * v0.78 (firefly) or later
857 * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
858
859Which client versions support CRUSH_V4
860--------------------------------------
861
862 * v0.94 (hammer) or later
863 * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
864
865Which client versions support CRUSH_TUNABLES5
866---------------------------------------------
867
868 * v10.0.2 (jewel) or later
869 * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
870
871Warning when tunables are non-optimal
872-------------------------------------
873
874Starting with version v0.74, Ceph will issue a health warning if the
875current CRUSH tunables don't include all the optimal values from the
876``default`` profile (see below for the meaning of the ``default`` profile).
877To make this warning go away, you have two options:
878
8791. Adjust the tunables on the existing cluster. Note that this will
880 result in some data movement (possibly as much as 10%). This is the
881 preferred route, but should be taken with care on a production cluster
882 where the data movement may affect performance. You can enable optimal
883 tunables with::
884
885 ceph osd crush tunables optimal
886
887 If things go poorly (e.g., too much load) and not very much
888 progress has been made, or there is a client compatibility problem
f67539c2 889 (old kernel CephFS or RBD clients, or pre-Bobtail ``librados``
7c673cae
FG
890 clients), you can switch back with::
891
892 ceph osd crush tunables legacy
893
8942. You can make the warning go away without making any changes to CRUSH by
895 adding the following option to your ceph.conf ``[mon]`` section::
896
897 mon warn on legacy crush tunables = false
898
899 For the change to take effect, you will need to restart the monitors, or
900 apply the option to running monitors with::
901
11fdf7f2 902 ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
7c673cae
FG
903
904
905A few important points
906----------------------
907
908 * Adjusting these values will result in the shift of some PGs between
909 storage nodes. If the Ceph cluster is already storing a lot of
910 data, be prepared for some fraction of the data to move.
911 * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
912 feature bits of new connections as soon as they get
913 the updated map. However, already-connected clients are
914 effectively grandfathered in, and will misbehave if they do not
915 support the new feature.
916 * If the CRUSH tunables are set to non-legacy values and then later
11fdf7f2 917 changed back to the default values, ``ceph-osd`` daemons will not be
7c673cae
FG
918 required to support the feature. However, the OSD peering process
919 requires examining and understanding old maps. Therefore, you
920 should not run old versions of the ``ceph-osd`` daemon
921 if the cluster has previously used non-legacy CRUSH values, even if
922 the latest version of the map has been switched back to using the
923 legacy defaults.
924
925Tuning CRUSH
926------------
927
f67539c2
TL
928The simplest way to adjust CRUSH tunables is by applying them in matched
929sets known as *profiles*. As of the Octopus release these are:
7c673cae
FG
930
931 * ``legacy``: the legacy behavior from argonaut and earlier.
932 * ``argonaut``: the legacy values supported by the original argonaut release
933 * ``bobtail``: the values supported by the bobtail release
934 * ``firefly``: the values supported by the firefly release
c07f9fc5
FG
935 * ``hammer``: the values supported by the hammer release
936 * ``jewel``: the values supported by the jewel release
7c673cae
FG
937 * ``optimal``: the best (ie optimal) values of the current version of Ceph
938 * ``default``: the default values of a new cluster installed from
939 scratch. These values, which depend on the current version of Ceph,
f67539c2 940 are hardcoded and are generally a mix of optimal and legacy values.
7c673cae 941 These values generally match the ``optimal`` profile of the previous
f67539c2
TL
942 LTS release, or the most recent release for which we generally expect
943 most users to have up-to-date clients for.
7c673cae 944
f67539c2 945You can apply a profile to a running cluster with the command::
7c673cae
FG
946
947 ceph osd crush tunables {PROFILE}
948
f67539c2
TL
949Note that this may result in data movement, potentially quite a bit. Study
950release notes and documentation carefully before changing the profile on a
951running cluster, and consider throttling recovery/backfill parameters to
952limit the impact of a bolus of backfill.
7c673cae
FG
953
954
c07f9fc5 955.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
7c673cae 956
7c673cae 957
c07f9fc5
FG
958Primary Affinity
959================
7c673cae 960
f67539c2
TL
961When a Ceph Client reads or writes data, it first contacts the primary OSD in
962each affected PG's acting set. By default, the first OSD in the acting set is
963the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is
964listed first and thus is the primary (aka lead) OSD. Sometimes we know that an
965OSD is less well suited to act as the lead than are other OSDs (e.g., it has
966a slow drive or a slow controller). To prevent performance bottlenecks
c07f9fc5 967(especially on read operations) while maximizing utilization of your hardware,
f67539c2
TL
968you can influence the selection of primary OSDs by adjusting primary affinity
969values, or by crafting a CRUSH rule that selects preferred OSDs first.
7c673cae 970
f67539c2
TL
971Tuning primary OSD selection is mainly useful for replicated pools, because
972by default read operations are served from the primary OSD for each PG.
973For erasure coded (EC) pools, a way to speed up read operations is to enable
974**fast read** as described in :ref:`pool-settings`.
7c673cae 975
f67539c2
TL
976A common scenario for primary affinity is when a cluster contains
977a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
9783.84TB SATA SSDs. On average the latter will be assigned double the number of
979PGs and thus will serve double the number of write and read operations, thus
980they'll be busier than the former. A rough assignment of primary affinity
981inversely proportional to OSD size won't be 100% optimal, but it can readily
982achieve a 15% improvement in overall read throughput by utilizing SATA
983interface bandwidth and CPU cycles more evenly.
7c673cae 984
f67539c2
TL
985By default, all ceph OSDs have primary affinity of ``1``, which indicates that
986any OSD may act as a primary with equal probability.
987
988You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose
989the OSD as primary in a PG's acting set.::
990
991 ceph osd primary-affinity <osd-id> <weight>
7c673cae 992
f67539c2
TL
993You may set an OSD's primary affinity to a real number in the range
994``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary
995and ``1`` indicates that an OSD may be used as a primary. When the weight is
996between these extremes, it is less likely that
997CRUSH will select that OSD as a primary. The process for
998selecting the lead OSD is more nuanced than a simple probability based on
999relative affinity values, but measurable results can be achieved even with
1000first-order approximations of desirable values.
1001
1002Custom CRUSH Rules
1003------------------
1004
1005There are occasional clusters that balance cost and performance by mixing SSDs
1006and HDDs in the same replicated pool. By setting the primary affinity of HDD
1007OSDs to ``0`` one can direct operations to the SSD in each acting set. An
1008alternative is to define a CRUSH rule that always selects an SSD OSD as the
1009first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting
1010set will contain exactly one SSD OSD as the primary with the balance on HDDs.
1011
1012For example, the CRUSH rule below::
1013
1014 rule mixed_replicated_rule {
1015 id 11
1016 type replicated
1017 min_size 1
1018 max_size 10
1019 step take default class ssd
1020 step chooseleaf firstn 1 type host
1021 step emit
1022 step take default class hdd
1023 step chooseleaf firstn 0 type host
1024 step emit
1025 }
1026
1027chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool
1028this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different
1029hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD
1030OSDs.
1031
1032This extra storage requirement can be avoided by placing SSDs and HDDs in
1033different hosts with the tradeoff that hosts with SSDs will receive all client
1034requests. You may thus consider faster CPU(s) for SSD hosts and more modest
1035ones for HDD nodes, since the latter will normally only service recovery
1036operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly
1037must not contain the same servers::
1038
1039 rule mixed_replicated_rule_two {
1040 id 1
1041 type replicated
1042 min_size 1
1043 max_size 10
1044 step take ssd_hosts class ssd
1045 step chooseleaf firstn 1 type host
1046 step emit
1047 step take hdd_hosts class hdd
1048 step chooseleaf firstn -1 type host
1049 step emit
1050 }
1051
1052
1053
1054Note also that on failure of an SSD, requests to a PG will be served temporarily
1055from a (slower) HDD OSD until the PG's data has been replicated onto the replacement
1056primary SSD OSD.
7c673cae 1057