5 The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
6 determines how to store and retrieve data by computing storage locations.
7 CRUSH empowers Ceph clients to communicate with OSDs directly rather than
8 through a centralized server or broker. With an algorithmically determined
9 method of storing and retrieving data, Ceph avoids a single point of failure, a
10 performance bottleneck, and a physical limit to its scalability.
12 CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly
13 map data to OSDs, distributing it across the cluster according to configured
14 replication policy and failure domain. For a detailed discussion of CRUSH, see
15 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
17 CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy
18 of 'buckets' for aggregating devices and buckets, and
19 rules that govern how CRUSH replicates data within the cluster's pools. By
20 reflecting the underlying physical organization of the installation, CRUSH can
21 model (and thereby address) the potential for correlated device failures.
22 Typical factors include chassis, racks, physical proximity, a shared power
23 source, and shared networking. By encoding this information into the cluster
25 policies distribute object replicas across failure domains while
26 maintaining the desired distribution. For example, to address the
27 possibility of concurrent failures, it may be desirable to ensure that data
28 replicas are on devices using different shelves, racks, power supplies,
29 controllers, and/or physical locations.
31 When you deploy OSDs they are automatically added to the CRUSH map under a
32 ``host`` bucket named for the node on which they run. This,
33 combined with the configured CRUSH failure domain, ensures that replicas or
34 erasure code shards are distributed across hosts and that a single host or other
35 failure will not affect availability. For larger clusters, administrators must
36 carefully consider their choice of failure domain. Separating replicas across racks,
37 for example, is typical for mid- to large-sized clusters.
43 The location of an OSD within the CRUSH map's hierarchy is
44 referred to as a ``CRUSH location``. This location specifier takes the
45 form of a list of key and value pairs. For
46 example, if an OSD is in a particular row, rack, chassis and host, and
47 is part of the 'default' CRUSH root (which is the case for most
48 clusters), its CRUSH location could be described as::
50 root=default row=a rack=a2 chassis=a2a host=a2a1
54 #. Note that the order of the keys does not matter.
55 #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
56 these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``,
57 ``rack``, ``chassis`` and ``host``.
58 These defined types suffice for almost all clusters, but can be customized
59 by modifying the CRUSH map.
60 #. Not all keys need to be specified. For example, by default, Ceph
61 automatically sets an ``OSD``'s location to be
62 ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
64 The CRUSH location for an OSD can be defined by adding the ``crush location``
65 option in ``ceph.conf``. Each time the OSD starts,
66 it verifies it is in the correct location in the CRUSH map and, if it is not,
67 it moves itself. To disable this automatic CRUSH map management, add the
68 following to your configuration file in the ``[osd]`` section::
70 osd crush update on start = false
72 Note that in most cases you will not need to manually configure this.
78 A customized location hook can be used to generate a more complete
79 CRUSH location on startup. The CRUSH location is based on, in order
82 #. A ``crush location`` option in ``ceph.conf``
83 #. A default of ``root=default host=HOSTNAME`` where the hostname is
84 derived from the ``hostname -s`` command
86 A script can be written to provide additional
87 location fields (for example, ``rack`` or ``datacenter``) and the
88 hook enabled via the config option::
90 crush location hook = /path/to/customized-ceph-crush-location
92 This hook is passed several arguments (below) and should output a single line
93 to ``stdout`` with the CRUSH location description.::
95 --cluster CLUSTER --id ID --type TYPE
97 where the cluster name is typically ``ceph``, the ``id`` is the daemon
98 identifier (e.g., the OSD number or daemon identifier), and the daemon
99 type is ``osd``, ``mds``, etc.
101 For example, a simple hook that additionally specifies a rack location
102 based on a value in the file ``/etc/rack`` might be::
105 echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
111 The CRUSH map consists of a hierarchy that describes
112 the physical topology of the cluster and a set of rules defining
113 data placement policy. The hierarchy has
114 devices (OSDs) at the leaves, and internal nodes
115 corresponding to other physical features or groupings: hosts, racks,
116 rows, datacenters, and so on. The rules describe how replicas are
117 placed in terms of that hierarchy (e.g., 'three replicas in different
123 Devices are individual OSDs that store data, usually one for each storage drive.
124 Devices are identified by an ``id``
125 (a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id.
127 Since the Luminous release, devices may also have a *device class* assigned (e.g.,
128 ``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by
129 CRUSH rules. This is especially useful when mixing device types within hosts.
131 .. _crush_map_default_types:
136 A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
137 racks, rows, etc. The CRUSH map defines a series of *types* that are
138 used to describe these nodes. Default types include:
140 - ``osd`` (or ``device``)
153 Most clusters use only a handful of these types, and others
154 can be defined as needed.
156 The hierarchy is built with devices (normally type ``osd``) at the
157 leaves, interior nodes with non-device types, and a root node of type
158 ``root``. For example,
166 +---------------+---------------+
168 +------+------+ +------+------+
169 |{o}host foo | |{o}host bar |
170 +------+------+ +------+------+
172 +-------+-------+ +-------+-------+
174 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
175 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
176 +-----------+ +-----------+ +-----------+ +-----------+
178 Each node (device or bucket) in the hierarchy has a *weight*
179 that indicates the relative proportion of the total
180 data that device or hierarchy subtree should store. Weights are set
181 at the leaves, indicating the size of the device, and automatically
182 sum up the tree, such that the weight of the ``root`` node
183 will be the total of all devices contained beneath it. Normally
184 weights are in units of terabytes (TB).
186 You can get a simple view the of CRUSH hierarchy for your cluster,
187 including weights, with::
194 CRUSH Rules define policy about how data is distributed across the devices
195 in the hierarchy. They define placement and replication strategies or
196 distribution policies that allow you to specify exactly how CRUSH
197 places data replicas. For example, you might create a rule selecting
198 a pair of targets for two-way mirroring, another rule for selecting
199 three targets in two different data centers for three-way mirroring, and
200 yet another rule for erasure coding (EC) across six storage devices. For a
201 detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
202 Scalable, Decentralized Placement of Replicated Data`_, and more
203 specifically to **Section 3.2**.
205 CRUSH rules can be created via the CLI by
206 specifying the *pool type* they will be used for (replicated or
207 erasure coded), the *failure domain*, and optionally a *device class*.
208 In rare cases rules must be written by hand by manually editing the
211 You can see what rules are defined for your cluster with::
213 ceph osd crush rule ls
215 You can view the contents of the rules with::
217 ceph osd crush rule dump
222 Each device can optionally have a *class* assigned. By
223 default, OSDs automatically set their class at startup to
224 `hdd`, `ssd`, or `nvme` based on the type of device they are backed
227 The device class for one or more OSDs can be explicitly set with::
229 ceph osd crush set-device-class <class> <osd-name> [...]
231 Once a device class is set, it cannot be changed to another class
232 until the old class is unset with::
234 ceph osd crush rm-device-class <osd-name> [...]
236 This allows administrators to set device classes without the class
237 being changed on OSD restart or by some other script.
239 A placement rule that targets a specific device class can be created with::
241 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
243 A pool can then be changed to use the new rule with::
245 ceph osd pool set <pool-name> crush_rule <rule-name>
247 Device classes are implemented by creating a "shadow" CRUSH hierarchy
248 for each device class in use that contains only devices of that class.
249 CRUSH rules can then distribute data over the shadow hierarchy.
250 This approach is fully backward compatible with
251 old Ceph clients. You can view the CRUSH hierarchy with shadow items
254 ceph osd crush tree --show-shadow
256 For older clusters created before Luminous that relied on manually
257 crafted CRUSH maps to maintain per-device-type hierarchies, there is a
258 *reclassify* tool available to help transition to device classes
259 without triggering data movement (see :ref:`crush-reclassify`).
265 A *weight set* is an alternative set of weights to use when
266 calculating data placement. The normal weights associated with each
267 device in the CRUSH map are set based on the device size and indicate
268 how much data we *should* be storing where. However, because CRUSH is
269 a "probabilistic" pseudorandom placement process, there is always some
270 variation from this ideal distribution, in the same way that rolling a
271 die sixty times will not result in rolling exactly 10 ones and 10
272 sixes. Weight sets allow the cluster to perform numerical optimization
273 based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
274 a balanced distribution.
276 There are two types of weight sets supported:
278 #. A **compat** weight set is a single alternative set of weights for
279 each device and node in the cluster. This is not well-suited for
280 correcting for all anomalies (for example, placement groups for
281 different pools may be different sizes and have different load
282 levels, but will be mostly treated the same by the balancer).
283 However, compat weight sets have the huge advantage that they are
284 *backward compatible* with previous versions of Ceph, which means
285 that even though weight sets were first introduced in Luminous
286 v12.2.z, older clients (e.g., firefly) can still connect to the
287 cluster when a compat weight set is being used to balance data.
288 #. A **per-pool** weight set is more flexible in that it allows
289 placement to be optimized for each data pool. Additionally,
290 weights can be adjusted for each position of placement, allowing
291 the optimizer to correct for a subtle skew of data toward devices
292 with small weights relative to their peers (and effect that is
293 usually only apparently in very large clusters but which can cause
296 When weight sets are in use, the weights associated with each node in
297 the hierarchy is visible as a separate column (labeled either
298 ``(compat)`` or the pool name) from the command::
302 When both *compat* and *per-pool* weight sets are in use, data
303 placement for a particular pool will use its own per-pool weight set
304 if present. If not, it will use the compat weight set if present. If
305 neither are present, it will use the normal CRUSH weights.
307 Although weight sets can be set up and manipulated by hand, it is
308 recommended that the ``ceph-mgr`` *balancer* module be enabled to do so
309 automatically when running Luminous or later releases.
312 Modifying the CRUSH map
313 =======================
320 .. note: OSDs are normally automatically added to the CRUSH map when
321 the OSD is created. This command is rarely needed.
323 To add or move an OSD in the CRUSH map of a running cluster::
325 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
331 :Description: The full name of the OSD.
339 :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
347 :Description: The root node of the tree in which the OSD resides (normally ``default``)
348 :Type: Key/value pair.
350 :Example: ``root=default``
355 :Description: You may specify the OSD's location in the CRUSH hierarchy.
356 :Type: Key/value pairs.
358 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
361 The following example adds ``osd.0`` to the hierarchy, or moves the
362 OSD from a previous location. ::
364 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
370 .. note: Normally OSDs automatically add themselves to the CRUSH map
371 with the correct weight when they are created. This command
374 To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute
377 ceph osd crush reweight {name} {weight}
383 :Description: The full name of the OSD.
391 :Description: The CRUSH weight for the OSD.
402 .. note: OSDs are normally removed from the CRUSH as part of the
403 ``ceph osd purge`` command. This command is rarely needed.
405 To remove an OSD from the CRUSH map of a running cluster, execute the
408 ceph osd crush remove {name}
414 :Description: The full name of the OSD.
423 .. note: Buckets are implicitly created when an OSD is added
424 that specifies a ``{bucket-type}={bucket-name}`` as part of its
425 location, if a bucket with that name does not already exist. This
426 command is typically used when manually adjusting the structure of the
427 hierarchy after OSDs have been created. One use is to move a
428 series of hosts underneath a new rack-level bucket; another is to
429 add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't
430 receive data until you're ready, at which time you would move them to the
431 ``default`` or other root as described below.
433 To add a bucket in the CRUSH map of a running cluster, execute the
434 ``ceph osd crush add-bucket`` command::
436 ceph osd crush add-bucket {bucket-name} {bucket-type}
442 :Description: The full name of the bucket.
450 :Description: The type of the bucket. The type must already exist in the hierarchy.
456 The following example adds the ``rack12`` bucket to the hierarchy::
458 ceph osd crush add-bucket rack12 rack
463 To move a bucket to a different location or position in the CRUSH map
464 hierarchy, execute the following::
466 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
472 :Description: The name of the bucket to move/reposition.
475 :Example: ``foo-bar-1``
479 :Description: You may specify the bucket's location in the CRUSH hierarchy.
480 :Type: Key/value pairs.
482 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
487 To remove a bucket from the CRUSH hierarchy, execute the following::
489 ceph osd crush remove {bucket-name}
491 .. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
497 :Description: The name of the bucket that you'd like to remove.
502 The following example removes the ``rack12`` bucket from the hierarchy::
504 ceph osd crush remove rack12
506 Creating a compat weight set
507 ----------------------------
509 .. note: This step is normally done automatically by the ``balancer``
512 To create a *compat* weight set::
514 ceph osd crush weight-set create-compat
516 Weights for the compat weight set can be adjusted with::
518 ceph osd crush weight-set reweight-compat {name} {weight}
520 The compat weight set can be destroyed with::
522 ceph osd crush weight-set rm-compat
524 Creating per-pool weight sets
525 -----------------------------
527 To create a weight set for a specific pool,::
529 ceph osd crush weight-set create {pool-name} {mode}
531 .. note:: Per-pool weight sets require that all servers and daemons
532 run Luminous v12.2.z or later.
538 :Description: The name of a RADOS pool
545 :Description: Either ``flat`` or ``positional``. A *flat* weight set
546 has a single weight for each device or bucket. A
547 *positional* weight set has a potentially different
548 weight for each position in the resulting placement
549 mapping. For example, if a pool has a replica count of
550 3, then a positional weight set will have three weights
551 for each device and bucket.
556 To adjust the weight of an item in a weight set::
558 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
560 To list existing weight sets,::
562 ceph osd crush weight-set ls
564 To remove a weight set,::
566 ceph osd crush weight-set rm {pool-name}
568 Creating a rule for a replicated pool
569 -------------------------------------
571 For a replicated pool, the primary decision when creating the CRUSH
572 rule is what the failure domain is going to be. For example, if a
573 failure domain of ``host`` is selected, then CRUSH will ensure that
574 each replica of the data is stored on a unique host. If ``rack``
575 is selected, then each replica will be stored in a different rack.
576 What failure domain you choose primarily depends on the size and
577 topology of your cluster.
579 In most cases the entire cluster hierarchy is nested beneath a root node
580 named ``default``. If you have customized your hierarchy, you may
581 want to create a rule nested at some other node in the hierarchy. It
582 doesn't matter what type is associated with that node (it doesn't have
583 to be a ``root`` node).
585 It is also possible to create a rule that restricts data placement to
586 a specific *class* of device. By default, Ceph OSDs automatically
587 classify themselves as either ``hdd`` or ``ssd``, depending on the
588 underlying type of device being used. These classes can also be
591 To create a replicated rule,::
593 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
599 :Description: The name of the rule
602 :Example: ``rbd-rule``
606 :Description: The name of the node under which data should be placed.
609 :Example: ``default``
611 ``failure-domain-type``
613 :Description: The type of CRUSH nodes across which we should separate replicas.
620 :Description: The device class on which data should be placed.
625 Creating a rule for an erasure coded pool
626 -----------------------------------------
628 For an erasure-coded (EC) pool, the same basic decisions need to be made:
629 what is the failure domain, which node in the
630 hierarchy will data be placed under (usually ``default``), and will
631 placement be restricted to a specific device class. Erasure code
632 pools are created a bit differently, however, because they need to be
633 constructed carefully based on the erasure code being used. For this reason,
634 you must include this information in the *erasure code profile*. A CRUSH
635 rule will then be created from that either explicitly or automatically when
636 the profile is used to create a pool.
638 The erasure code profiles can be listed with::
640 ceph osd erasure-code-profile ls
642 An existing profile can be viewed with::
644 ceph osd erasure-code-profile get {profile-name}
646 Normally profiles should never be modified; instead, a new profile
647 should be created and used when creating a new pool or creating a new
648 rule for an existing pool.
650 An erasure code profile consists of a set of key=value pairs. Most of
651 these control the behavior of the erasure code that is encoding data
652 in the pool. Those that begin with ``crush-``, however, affect the
653 CRUSH rule that is created.
655 The erasure code profile properties of interest are:
657 * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``].
658 * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``].
659 * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used].
660 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
662 Once a profile is defined, you can create a CRUSH rule with::
664 ceph osd crush rule create-erasure {name} {profile-name}
666 .. note: When creating a new pool, it is not actually necessary to
667 explicitly create the rule. If the erasure code profile alone is
668 specified and the rule argument is left off then Ceph will create
669 the CRUSH rule automatically.
674 Rules that are not in use by pools can be deleted with::
676 ceph osd crush rule rm {rule-name}
679 .. _crush-map-tunables:
684 Over time, we have made (and continue to make) improvements to the
685 CRUSH algorithm used to calculate the placement of data. In order to
686 support the change in behavior, we have introduced a series of tunable
687 options that control whether the legacy or improved variation of the
690 In order to use newer tunables, both clients and servers must support
691 the new version of CRUSH. For this reason, we have created
692 ``profiles`` that are named after the Ceph version in which they were
693 introduced. For example, the ``firefly`` tunables are first supported
694 by the Firefly release, and will not work with older (e.g., Dumpling)
695 clients. Once a given set of tunables are changed from the legacy
696 default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
697 clients who do not support the new CRUSH features from connecting to
703 The legacy CRUSH behavior used by Argonaut and older releases works
704 fine for most clusters, provided there are not many OSDs that have
707 bobtail (CRUSH_TUNABLES2)
708 -------------------------
710 The ``bobtail`` tunable profile fixes a few key misbehaviors:
712 * For hierarchies with a small number of devices in the leaf buckets,
713 some PGs map to fewer than the desired number of replicas. This
714 commonly happens for hierarchies with "host" nodes with a small
715 number (1-3) of OSDs nested beneath each one.
717 * For large clusters, some small percentages of PGs map to fewer than
718 the desired number of OSDs. This is more prevalent when there are
719 multiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``).
721 * When some OSDs are marked out, the data tends to get redistributed
722 to nearby OSDs instead of across the entire hierarchy.
724 The new tunables are:
726 * ``choose_local_tries``: Number of local retries. Legacy value is
727 2, optimal value is 0.
729 * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
732 * ``choose_total_tries``: Total number of attempts to choose an item.
733 Legacy value was 19, subsequent testing indicates that a value of
734 50 is more appropriate for typical clusters. For extremely large
735 clusters, a larger value might be necessary.
737 * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
738 will retry, or only try once and allow the original placement to
739 retry. Legacy default is 0, optimal value is 1.
743 * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount
744 of data movement. Use caution on a cluster that is already
747 firefly (CRUSH_TUNABLES3)
748 -------------------------
750 The ``firefly`` tunable profile fixes a problem
751 with ``chooseleaf`` CRUSH rule behavior that tends to result in PG
752 mappings with too few results when too many OSDs have been marked out.
756 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
757 start with a non-zero value of ``r``, based on how many attempts the
758 parent has already made. Legacy default is ``0``, but with this value
759 CRUSH is sometimes unable to find a mapping. The optimal value (in
760 terms of computational cost and correctness) is ``1``.
764 * For existing clusters that house lots of data, changing
765 from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5``
766 will allow CRUSH to still find a valid mapping but will cause less data
769 straw_calc_version tunable (introduced with Firefly too)
770 --------------------------------------------------------
772 There were some problems with the internal weights calculated and
773 stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when
774 there were items with a CRUSH weight of ``0``, or both a mix of different and
775 unique weights, CRUSH would distribute data incorrectly (i.e.,
776 not in proportion to the weights).
780 * ``straw_calc_version``: A value of ``0`` preserves the old, broken
781 internal weight calculation; a value of ``1`` fixes the behavior.
785 * Moving to straw_calc_version ``1`` and then adjusting a straw bucket
786 (by adding, removing, or reweighting an item, or by using the
787 reweight-all command) can trigger a small to moderate amount of
788 data movement *if* the cluster has hit one of the problematic
791 This tunable option is special because it has absolutely no impact
792 concerning the required kernel version in the client side.
797 The ``hammer`` tunable profile does not affect the
798 mapping of existing CRUSH maps simply by changing the profile. However:
800 * There is a new bucket algorithm (``straw2``) supported. The new
801 ``straw2`` bucket algorithm fixes several limitations in the original
802 ``straw``. Specifically, the old ``straw`` buckets would
803 change some mappings that should have changed when a weight was
804 adjusted, while ``straw2`` achieves the original goal of only
805 changing mappings to or from the bucket item whose weight has
808 * ``straw2`` is the default for any newly created buckets.
812 * Changing a bucket type from ``straw`` to ``straw2`` will result in
813 a reasonably small amount of data movement, depending on how much
814 the bucket item weights vary from each other. When the weights are
815 all the same no data will move, and when item weights vary
816 significantly there will be more movement.
818 jewel (CRUSH_TUNABLES5)
819 -----------------------
821 The ``jewel`` tunable profile improves the
822 overall behavior of CRUSH such that significantly fewer mappings
823 change when an OSD is marked out of the cluster. This results in
824 significantly less data movement.
828 * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
829 use a better value for an inner loop that greatly reduces the number
830 of mapping changes when an OSD is marked out. The legacy value is ``0``,
831 while the new value of ``1`` uses the new approach.
835 * Changing this value on an existing cluster will result in a very
836 large amount of data movement as almost every PG mapping is likely
842 Which client versions support CRUSH_TUNABLES
843 --------------------------------------------
845 * argonaut series, v0.48.1 or later
847 * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
849 Which client versions support CRUSH_TUNABLES2
850 ---------------------------------------------
852 * v0.55 or later, including bobtail series (v0.56.x)
853 * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
855 Which client versions support CRUSH_TUNABLES3
856 ---------------------------------------------
858 * v0.78 (firefly) or later
859 * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
861 Which client versions support CRUSH_V4
862 --------------------------------------
864 * v0.94 (hammer) or later
865 * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
867 Which client versions support CRUSH_TUNABLES5
868 ---------------------------------------------
870 * v10.0.2 (jewel) or later
871 * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
873 Warning when tunables are non-optimal
874 -------------------------------------
876 Starting with version v0.74, Ceph will issue a health warning if the
877 current CRUSH tunables don't include all the optimal values from the
878 ``default`` profile (see below for the meaning of the ``default`` profile).
879 To make this warning go away, you have two options:
881 1. Adjust the tunables on the existing cluster. Note that this will
882 result in some data movement (possibly as much as 10%). This is the
883 preferred route, but should be taken with care on a production cluster
884 where the data movement may affect performance. You can enable optimal
887 ceph osd crush tunables optimal
889 If things go poorly (e.g., too much load) and not very much
890 progress has been made, or there is a client compatibility problem
891 (old kernel CephFS or RBD clients, or pre-Bobtail ``librados``
892 clients), you can switch back with::
894 ceph osd crush tunables legacy
896 2. You can make the warning go away without making any changes to CRUSH by
897 adding the following option to your ceph.conf ``[mon]`` section::
899 mon warn on legacy crush tunables = false
901 For the change to take effect, you will need to restart the monitors, or
902 apply the option to running monitors with::
904 ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
907 A few important points
908 ----------------------
910 * Adjusting these values will result in the shift of some PGs between
911 storage nodes. If the Ceph cluster is already storing a lot of
912 data, be prepared for some fraction of the data to move.
913 * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
914 feature bits of new connections as soon as they get
915 the updated map. However, already-connected clients are
916 effectively grandfathered in, and will misbehave if they do not
917 support the new feature.
918 * If the CRUSH tunables are set to non-legacy values and then later
919 changed back to the default values, ``ceph-osd`` daemons will not be
920 required to support the feature. However, the OSD peering process
921 requires examining and understanding old maps. Therefore, you
922 should not run old versions of the ``ceph-osd`` daemon
923 if the cluster has previously used non-legacy CRUSH values, even if
924 the latest version of the map has been switched back to using the
930 The simplest way to adjust CRUSH tunables is by applying them in matched
931 sets known as *profiles*. As of the Octopus release these are:
933 * ``legacy``: the legacy behavior from argonaut and earlier.
934 * ``argonaut``: the legacy values supported by the original argonaut release
935 * ``bobtail``: the values supported by the bobtail release
936 * ``firefly``: the values supported by the firefly release
937 * ``hammer``: the values supported by the hammer release
938 * ``jewel``: the values supported by the jewel release
939 * ``optimal``: the best (ie optimal) values of the current version of Ceph
940 * ``default``: the default values of a new cluster installed from
941 scratch. These values, which depend on the current version of Ceph,
942 are hardcoded and are generally a mix of optimal and legacy values.
943 These values generally match the ``optimal`` profile of the previous
944 LTS release, or the most recent release for which we generally expect
945 most users to have up-to-date clients for.
947 You can apply a profile to a running cluster with the command::
949 ceph osd crush tunables {PROFILE}
951 Note that this may result in data movement, potentially quite a bit. Study
952 release notes and documentation carefully before changing the profile on a
953 running cluster, and consider throttling recovery/backfill parameters to
954 limit the impact of a bolus of backfill.
957 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
963 When a Ceph Client reads or writes data, it first contacts the primary OSD in
964 each affected PG's acting set. By default, the first OSD in the acting set is
965 the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is
966 listed first and thus is the primary (aka lead) OSD. Sometimes we know that an
967 OSD is less well suited to act as the lead than are other OSDs (e.g., it has
968 a slow drive or a slow controller). To prevent performance bottlenecks
969 (especially on read operations) while maximizing utilization of your hardware,
970 you can influence the selection of primary OSDs by adjusting primary affinity
971 values, or by crafting a CRUSH rule that selects preferred OSDs first.
973 Tuning primary OSD selection is mainly useful for replicated pools, because
974 by default read operations are served from the primary OSD for each PG.
975 For erasure coded (EC) pools, a way to speed up read operations is to enable
976 **fast read** as described in :ref:`pool-settings`.
978 A common scenario for primary affinity is when a cluster contains
979 a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
980 3.84TB SATA SSDs. On average the latter will be assigned double the number of
981 PGs and thus will serve double the number of write and read operations, thus
982 they'll be busier than the former. A rough assignment of primary affinity
983 inversely proportional to OSD size won't be 100% optimal, but it can readily
984 achieve a 15% improvement in overall read throughput by utilizing SATA
985 interface bandwidth and CPU cycles more evenly.
987 By default, all ceph OSDs have primary affinity of ``1``, which indicates that
988 any OSD may act as a primary with equal probability.
990 You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose
991 the OSD as primary in a PG's acting set.::
993 ceph osd primary-affinity <osd-id> <weight>
995 You may set an OSD's primary affinity to a real number in the range
996 ``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary
997 and ``1`` indicates that an OSD may be used as a primary. When the weight is
998 between these extremes, it is less likely that
999 CRUSH will select that OSD as a primary. The process for
1000 selecting the lead OSD is more nuanced than a simple probability based on
1001 relative affinity values, but measurable results can be achieved even with
1002 first-order approximations of desirable values.
1007 There are occasional clusters that balance cost and performance by mixing SSDs
1008 and HDDs in the same replicated pool. By setting the primary affinity of HDD
1009 OSDs to ``0`` one can direct operations to the SSD in each acting set. An
1010 alternative is to define a CRUSH rule that always selects an SSD OSD as the
1011 first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting
1012 set will contain exactly one SSD OSD as the primary with the balance on HDDs.
1014 For example, the CRUSH rule below::
1016 rule mixed_replicated_rule {
1021 step take default class ssd
1022 step chooseleaf firstn 1 type host
1024 step take default class hdd
1025 step chooseleaf firstn 0 type host
1029 chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool
1030 this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different
1031 hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD
1034 This extra storage requirement can be avoided by placing SSDs and HDDs in
1035 different hosts with the tradeoff that hosts with SSDs will receive all client
1036 requests. You may thus consider faster CPU(s) for SSD hosts and more modest
1037 ones for HDD nodes, since the latter will normally only service recovery
1038 operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly
1039 must not contain the same servers::
1041 rule mixed_replicated_rule_two {
1046 step take ssd_hosts class ssd
1047 step chooseleaf firstn 1 type host
1049 step take hdd_hosts class hdd
1050 step chooseleaf firstn -1 type host
1056 Note also that on failure of an SSD, requests to a PG will be served temporarily
1057 from a (slower) HDD OSD until the PG's data has been replicated onto the replacement