]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map.rst
update ceph source to reef 18.2.0
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
CommitLineData
7c673cae
FG
1============
2 CRUSH Maps
3============
4
5The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
1e59de90
TL
6computes storage locations in order to determine how to store and retrieve
7data. CRUSH allows Ceph clients to communicate with OSDs directly rather than
8through a centralized server or broker. By using an algorithmically-determined
7c673cae
FG
9method of storing and retrieving data, Ceph avoids a single point of failure, a
10performance bottleneck, and a physical limit to its scalability.
11
1e59de90
TL
12CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
13distributing the data across the cluster in accordance with configured
14replication policy and failure domains. For a detailed discussion of CRUSH, see
7c673cae
FG
15`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
1e59de90
TL
17CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
18hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
19replicates data within the cluster's pools. By reflecting the underlying
20physical organization of the installation, CRUSH can model (and thereby
21address) the potential for correlated device failures. Some factors relevant
22to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
23power source, shared networking, and failure domains. By encoding this
24information into the CRUSH map, CRUSH placement policies distribute object
25replicas across failure domains while maintaining the desired distribution. For
26example, to address the possibility of concurrent failures, it might be
27desirable to ensure that data replicas are on devices that reside in or rely
28upon different shelves, racks, power supplies, controllers, or physical
29locations.
30
31When OSDs are deployed, they are automatically added to the CRUSH map under a
32``host`` bucket that is named for the node on which the OSDs run. This
33behavior, combined with the configured CRUSH failure domain, ensures that
34replicas or erasure-code shards are distributed across hosts and that the
35failure of a single host or other kinds of failures will not affect
36availability. For larger clusters, administrators must carefully consider their
37choice of failure domain. For example, distributing replicas across racks is
38typical for mid- to large-sized clusters.
7c673cae
FG
39
40
41CRUSH Location
42==============
43
1e59de90
TL
44The location of an OSD within the CRUSH map's hierarchy is referred to as its
45``CRUSH location``. The specification of a CRUSH location takes the form of a
46list of key-value pairs. For example, if an OSD is in a particular row, rack,
47chassis, and host, and is also part of the 'default' CRUSH root (which is the
48case for most clusters), its CRUSH location can be specified as follows::
7c673cae
FG
49
50 root=default row=a rack=a2 chassis=a2a host=a2a1
51
1e59de90 52.. note::
7c673cae 53
1e59de90
TL
54 #. The order of the keys does not matter.
55 #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
56 valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
57 ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
58 types suffice for nearly all clusters, but can be customized by
59 modifying the CRUSH map.
60 #. Not all keys need to be specified. For example, by default, Ceph
61 automatically sets an ``OSD``'s location as ``root=default
62 host=HOSTNAME`` (as determined by the output of ``hostname -s``).
7c673cae 63
1e59de90
TL
64The CRUSH location for an OSD can be modified by adding the ``crush location``
65option in ``ceph.conf``. When this option has been added, every time the OSD
66starts it verifies that it is in the correct location in the CRUSH map and
67moves itself if it is not. To disable this automatic CRUSH map management, add
68the following to the ``ceph.conf`` configuration file in the ``[osd]``
69section::
7c673cae 70
1e59de90 71 osd crush update on start = false
7c673cae 72
1e59de90 73Note that this action is unnecessary in most cases.
f67539c2 74
c07f9fc5 75
7c673cae
FG
76Custom location hooks
77---------------------
78
1e59de90
TL
79A custom location hook can be used to generate a more complete CRUSH location
80on startup. The CRUSH location is determined by, in order of preference:
c07f9fc5 81
f67539c2 82#. A ``crush location`` option in ``ceph.conf``
1e59de90
TL
83#. A default of ``root=default host=HOSTNAME`` where the hostname is determined
84 by the output of the ``hostname -s`` command
c07f9fc5 85
1e59de90
TL
86A script can be written to provide additional location fields (for example,
87``rack`` or ``datacenter``) and the hook can be enabled via the following
88config option::
7c673cae 89
1e59de90 90 crush location hook = /path/to/customized-ceph-crush-location
7c673cae 91
1e59de90
TL
92This hook is passed several arguments (see below). The hook outputs a single
93line to ``stdout`` that contains the CRUSH location description. The output
94resembles the following:::
7c673cae 95
11fdf7f2 96 --cluster CLUSTER --id ID --type TYPE
7c673cae 97
1e59de90
TL
98Here the cluster name is typically ``ceph``, the ``id`` is the daemon
99identifier or (in the case of OSDs) the OSD number, and the daemon type is
100``osd``, ``mds, ``mgr``, or ``mon``.
11fdf7f2 101
1e59de90
TL
102For example, a simple hook that specifies a rack location via a value in the
103file ``/etc/rack`` might be as follows::
11fdf7f2
TL
104
105 #!/bin/sh
106 echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
7c673cae
FG
107
108
c07f9fc5
FG
109CRUSH structure
110===============
7c673cae 111
1e59de90
TL
112The CRUSH map consists of (1) a hierarchy that describes the physical topology
113of the cluster and (2) a set of rules that defines data placement policy. The
114hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
115other physical features or groupings: hosts, racks, rows, data centers, and so
116on. The rules determine how replicas are placed in terms of that hierarchy (for
117example, 'three replicas in different racks').
7c673cae 118
c07f9fc5
FG
119Devices
120-------
7c673cae 121
1e59de90
TL
122Devices are individual OSDs that store data (usually one device for each
123storage drive). Devices are identified by an ``id`` (a non-negative integer)
124and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
7c673cae 125
1e59de90
TL
126In Luminous and later releases, OSDs can have a *device class* assigned (for
127example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
128rules. Device classes are especially useful when mixing device types within
129hosts.
7c673cae 130
20effc67
TL
131.. _crush_map_default_types:
132
c07f9fc5 133Types and Buckets
7c673cae
FG
134-----------------
135
1e59de90
TL
136"Bucket", in the context of CRUSH, is a term for any of the internal nodes in
137the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
138*types* that are used to identify these nodes. Default types include:
f67539c2
TL
139
140- ``osd`` (or ``device``)
141- ``host``
142- ``chassis``
143- ``rack``
144- ``row``
145- ``pdu``
146- ``pod``
147- ``room``
148- ``datacenter``
149- ``zone``
150- ``region``
151- ``root``
152
1e59de90
TL
153Most clusters use only a handful of these types, and other types can be defined
154as needed.
155
156The hierarchy is built with devices (normally of type ``osd``) at the leaves
157and non-device types as the internal nodes. The root node is of type ``root``.
158For example:
c07f9fc5 159
c07f9fc5
FG
160
161.. ditaa::
162
1e59de90 163 +-----------------+
11fdf7f2 164 |{o}root default |
1e59de90 165 +--------+--------+
7c673cae 166 |
11fdf7f2
TL
167 +---------------+---------------+
168 | |
169 +------+------+ +------+------+
1e59de90 170 |{o}host foo | |{o}host bar |
11fdf7f2 171 +------+------+ +------+------+
7c673cae 172 | |
7c673cae
FG
173 +-------+-------+ +-------+-------+
174 | | | |
175 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
1e59de90 176 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
7c673cae
FG
177 +-----------+ +-----------+ +-----------+ +-----------+
178
c07f9fc5 179
1e59de90
TL
180Each node (device or bucket) in the hierarchy has a *weight* that indicates the
181relative proportion of the total data that should be stored by that device or
182hierarchy subtree. Weights are set at the leaves, indicating the size of the
183device. These weights automatically sum in an 'up the tree' direction: that is,
184the weight of the ``root`` node will be the sum of the weights of all devices
185contained under it. Weights are typically measured in tebibytes (TiB).
186
187To get a simple view of the cluster's CRUSH hierarchy, including weights, run
188the following command:
c07f9fc5 189
39ae355f
TL
190.. prompt:: bash $
191
192 ceph osd tree
c07f9fc5
FG
193
194Rules
195-----
196
1e59de90
TL
197CRUSH rules define policy governing how data is distributed across the devices
198in the hierarchy. The rules define placement as well as replication strategies
199or distribution policies that allow you to specify exactly how CRUSH places
200data replicas. For example, you might create one rule selecting a pair of
201targets for two-way mirroring, another rule for selecting three targets in two
202different data centers for three-way replication, and yet another rule for
203erasure coding across six storage devices. For a detailed discussion of CRUSH
204rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
205Placement of Replicated Data`_.
206
207CRUSH rules can be created via the command-line by specifying the *pool type*
208that they will govern (replicated or erasure coded), the *failure domain*, and
209optionally a *device class*. In rare cases, CRUSH rules must be created by
210manually editing the CRUSH map.
211
212To see the rules that are defined for the cluster, run the following command:
39ae355f
TL
213
214.. prompt:: bash $
215
216 ceph osd crush rule ls
7c673cae 217
1e59de90 218To view the contents of the rules, run the following command:
7c673cae 219
39ae355f 220.. prompt:: bash $
7c673cae 221
39ae355f 222 ceph osd crush rule dump
7c673cae 223
05a536ef
TL
224.. _device_classes:
225
d2e6a577
FG
226Device classes
227--------------
228
1e59de90
TL
229Each device can optionally have a *class* assigned. By default, OSDs
230automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
231accordance with the type of device they are backed by.
d2e6a577 232
1e59de90
TL
233To explicitly set the device class of one or more OSDs, run a command of the
234following form:
d2e6a577 235
39ae355f
TL
236.. prompt:: bash $
237
238 ceph osd crush set-device-class <class> <osd-name> [...]
d2e6a577 239
1e59de90
TL
240Once a device class has been set, it cannot be changed to another class until
241the old class is unset. To remove the old class of one or more OSDs, run a
242command of the following form:
39ae355f
TL
243
244.. prompt:: bash $
d2e6a577 245
39ae355f 246 ceph osd crush rm-device-class <osd-name> [...]
d2e6a577 247
1e59de90
TL
248This restriction allows administrators to set device classes that won't be
249changed on OSD restart or by a script.
d2e6a577 250
1e59de90
TL
251To create a placement rule that targets a specific device class, run a command
252of the following form:
d2e6a577 253
39ae355f 254.. prompt:: bash $
d2e6a577 255
39ae355f 256 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
d2e6a577 257
1e59de90
TL
258To apply the new placement rule to a specific pool, run a command of the
259following form:
39ae355f
TL
260
261.. prompt:: bash $
262
263 ceph osd pool set <pool-name> crush_rule <rule-name>
d2e6a577 264
1e59de90
TL
265Device classes are implemented by creating one or more "shadow" CRUSH
266hierarchies. For each device class in use, there will be a shadow hierarchy
267that contains only devices of that class. CRUSH rules can then distribute data
268across the relevant shadow hierarchy. This approach is fully backward
269compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
270items displayed, run the following command:
39ae355f 271
1e59de90 272.. prompt:: bash #
d2e6a577 273
39ae355f 274 ceph osd crush tree --show-shadow
d2e6a577 275
1e59de90
TL
276Some older clusters that were created before the Luminous release rely on
277manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
278clusters, there is a *reclassify* tool available that can help them transition
279to device classes without triggering unwanted data movement (see
280:ref:`crush-reclassify`).
281
282Weight sets
283-----------
284
285A *weight set* is an alternative set of weights to use when calculating data
286placement. The normal weights associated with each device in the CRUSH map are
287set in accordance with the device size and indicate how much data should be
288stored where. However, because CRUSH is a probabilistic pseudorandom placement
289process, there is always some variation from this ideal distribution (in the
290same way that rolling a die sixty times will likely not result in exactly ten
291ones and ten sixes). Weight sets allow the cluster to perform numerical
292optimization based on the specifics of your cluster (for example: hierarchy,
293pools) to achieve a balanced distribution.
294
295Ceph supports two types of weight sets:
296
297#. A **compat** weight set is a single alternative set of weights for each
298 device and each node in the cluster. Compat weight sets cannot be expected
299 to correct all anomalies (for example, PGs for different pools might be of
300 different sizes and have different load levels, but are mostly treated alike
301 by the balancer). However, they have the major advantage of being *backward
302 compatible* with previous versions of Ceph. This means that even though
303 weight sets were first introduced in Luminous v12.2.z, older clients (for
304 example, Firefly) can still connect to the cluster when a compat weight set
305 is being used to balance data.
306
307#. A **per-pool** weight set is more flexible in that it allows placement to
308 be optimized for each data pool. Additionally, weights can be adjusted
309 for each position of placement, allowing the optimizer to correct for a
310 subtle skew of data toward devices with small weights relative to their
311 peers (an effect that is usually apparent only in very large clusters
312 but that can cause balancing problems).
313
314When weight sets are in use, the weights associated with each node in the
315hierarchy are visible in a separate column (labeled either as ``(compat)`` or
316as the pool name) in the output of the following command:
317
318.. prompt:: bash #
39ae355f
TL
319
320 ceph osd tree
c07f9fc5 321
1e59de90
TL
322If both *compat* and *per-pool* weight sets are in use, data placement for a
323particular pool will use its own per-pool weight set if present. If only
324*compat* weight sets are in use, data placement will use the compat weight set.
325If neither are in use, data placement will use the normal CRUSH weights.
c07f9fc5 326
1e59de90
TL
327Although weight sets can be set up and adjusted manually, we recommend enabling
328the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
329cluster is running Luminous or a later release.
c07f9fc5
FG
330
331Modifying the CRUSH map
332=======================
7c673cae
FG
333
334.. _addosd:
335
1e59de90
TL
336Adding/Moving an OSD
337--------------------
338
339.. note:: Under normal conditions, OSDs automatically add themselves to the
340 CRUSH map when they are created. The command in this section is rarely
341 needed.
7c673cae 342
7c673cae 343
1e59de90
TL
344To add or move an OSD in the CRUSH map of a running cluster, run a command of
345the following form:
39ae355f
TL
346
347.. prompt:: bash $
7c673cae 348
39ae355f 349 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
7c673cae 350
1e59de90 351For details on this command's parameters, see the following:
7c673cae 352
7c673cae 353``name``
1e59de90
TL
354 :Description: The full name of the OSD.
355 :Type: String
356 :Required: Yes
357 :Example: ``osd.0``
7c673cae
FG
358
359
360``weight``
1e59de90
TL
361 :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB).
362 :Type: Double
363 :Required: Yes
364 :Example: ``2.0``
7c673cae
FG
365
366
367``root``
1e59de90
TL
368 :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``).
369 :Type: Key-value pair.
370 :Required: Yes
371 :Example: ``root=default``
7c673cae
FG
372
373
374``bucket-type``
1e59de90
TL
375 :Description: The OSD's location in the CRUSH hierarchy.
376 :Type: Key-value pairs.
377 :Required: No
378 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
7c673cae 379
1e59de90
TL
380In the following example, the command adds ``osd.0`` to the hierarchy, or moves
381``osd.0`` from a previous location:
7c673cae 382
39ae355f
TL
383.. prompt:: bash $
384
385 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7c673cae
FG
386
387
1e59de90
TL
388Adjusting OSD weight
389--------------------
c07f9fc5 390
1e59de90
TL
391.. note:: Under normal conditions, OSDs automatically add themselves to the
392 CRUSH map with the correct weight when they are created. The command in this
393 section is rarely needed.
7c673cae 394
1e59de90
TL
395To adjust an OSD's CRUSH weight in a running cluster, run a command of the
396following form:
39ae355f
TL
397
398.. prompt:: bash $
7c673cae 399
39ae355f 400 ceph osd crush reweight {name} {weight}
7c673cae 401
1e59de90 402For details on this command's parameters, see the following:
7c673cae
FG
403
404``name``
1e59de90
TL
405 :Description: The full name of the OSD.
406 :Type: String
407 :Required: Yes
408 :Example: ``osd.0``
7c673cae
FG
409
410
411``weight``
1e59de90
TL
412 :Description: The CRUSH weight of the OSD.
413 :Type: Double
414 :Required: Yes
415 :Example: ``2.0``
7c673cae
FG
416
417
418.. _removeosd:
419
1e59de90
TL
420Removing an OSD
421---------------
c07f9fc5 422
1e59de90
TL
423.. note:: OSDs are normally removed from the CRUSH map as a result of the
424 `ceph osd purge`` command. This command is rarely needed.
7c673cae 425
1e59de90
TL
426To remove an OSD from the CRUSH map of a running cluster, run a command of the
427following form:
39ae355f
TL
428
429.. prompt:: bash $
7c673cae 430
39ae355f 431 ceph osd crush remove {name}
7c673cae 432
1e59de90 433For details on the ``name`` parameter, see the following:
7c673cae
FG
434
435``name``
1e59de90
TL
436 :Description: The full name of the OSD.
437 :Type: String
438 :Required: Yes
439 :Example: ``osd.0``
7c673cae 440
c07f9fc5 441
1e59de90
TL
442Adding a CRUSH Bucket
443---------------------
c07f9fc5 444
1e59de90
TL
445.. note:: Buckets are implicitly created when an OSD is added and the command
446 that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the
447 OSD's location (provided that a bucket with that name does not already
448 exist). The command in this section is typically used when manually
449 adjusting the structure of the hierarchy after OSDs have already been
450 created. One use of this command is to move a series of hosts to a new
451 rack-level bucket. Another use of this command is to add new ``host``
452 buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive
453 any data until they are ready to receive data. When they are ready, move the
454 buckets to the ``default`` root or to any other root as described below.
7c673cae 455
1e59de90
TL
456To add a bucket in the CRUSH map of a running cluster, run a command of the
457following form:
7c673cae 458
39ae355f
TL
459.. prompt:: bash $
460
461 ceph osd crush add-bucket {bucket-name} {bucket-type}
7c673cae 462
1e59de90 463For details on this command's parameters, see the following:
7c673cae
FG
464
465``bucket-name``
1e59de90
TL
466 :Description: The full name of the bucket.
467 :Type: String
468 :Required: Yes
469 :Example: ``rack12``
7c673cae
FG
470
471
472``bucket-type``
1e59de90
TL
473 :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy.
474 :Type: String
475 :Required: Yes
476 :Example: ``rack``
7c673cae 477
1e59de90 478In the following example, the command adds the ``rack12`` bucket to the hierarchy:
39ae355f
TL
479
480.. prompt:: bash $
7c673cae 481
39ae355f 482 ceph osd crush add-bucket rack12 rack
7c673cae 483
1e59de90
TL
484Moving a Bucket
485---------------
7c673cae 486
c07f9fc5 487To move a bucket to a different location or position in the CRUSH map
1e59de90 488hierarchy, run a command of the following form:
7c673cae 489
39ae355f
TL
490.. prompt:: bash $
491
492 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
7c673cae 493
1e59de90 494For details on this command's parameters, see the following:
7c673cae
FG
495
496``bucket-name``
1e59de90
TL
497 :Description: The name of the bucket that you are moving.
498 :Type: String
499 :Required: Yes
500 :Example: ``foo-bar-1``
7c673cae
FG
501
502``bucket-type``
1e59de90
TL
503 :Description: The bucket's new location in the CRUSH hierarchy.
504 :Type: Key-value pairs.
505 :Required: No
506 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
7c673cae 507
1e59de90
TL
508Removing a Bucket
509-----------------
7c673cae 510
1e59de90
TL
511To remove a bucket from the CRUSH hierarchy, run a command of the following
512form:
39ae355f
TL
513
514.. prompt:: bash $
7c673cae 515
39ae355f 516 ceph osd crush remove {bucket-name}
7c673cae 517
1e59de90
TL
518.. note:: A bucket must already be empty before it is removed from the CRUSH
519 hierarchy. In other words, there must not be OSDs or any other CRUSH buckets
520 within it.
7c673cae 521
1e59de90 522For details on the ``bucket-name`` parameter, see the following:
7c673cae
FG
523
524``bucket-name``
1e59de90
TL
525 :Description: The name of the bucket that is being removed.
526 :Type: String
527 :Required: Yes
528 :Example: ``rack12``
7c673cae 529
1e59de90
TL
530In the following example, the command removes the ``rack12`` bucket from the
531hierarchy:
7c673cae 532
39ae355f
TL
533.. prompt:: bash $
534
535 ceph osd crush remove rack12
c07f9fc5
FG
536
537Creating a compat weight set
538----------------------------
539
1e59de90
TL
540.. note:: Normally this action is done automatically if needed by the
541 ``balancer`` module (provided that the module is enabled).
c07f9fc5 542
1e59de90 543To create a *compat* weight set, run the following command:
39ae355f
TL
544
545.. prompt:: bash $
c07f9fc5 546
39ae355f 547 ceph osd crush weight-set create-compat
c07f9fc5 548
1e59de90
TL
549To adjust the weights of the compat weight set, run a command of the following
550form:
c07f9fc5 551
39ae355f 552.. prompt:: bash $
c07f9fc5 553
39ae355f 554 ceph osd crush weight-set reweight-compat {name} {weight}
c07f9fc5 555
1e59de90 556To destroy the compat weight set, run the following command:
39ae355f
TL
557
558.. prompt:: bash $
559
560 ceph osd crush weight-set rm-compat
c07f9fc5
FG
561
562Creating per-pool weight sets
563-----------------------------
564
1e59de90
TL
565To create a weight set for a specific pool, run a command of the following
566form:
39ae355f
TL
567
568.. prompt:: bash $
c07f9fc5 569
39ae355f 570 ceph osd crush weight-set create {pool-name} {mode}
c07f9fc5 571
1e59de90
TL
572.. note:: Per-pool weight sets can be used only if all servers and daemons are
573 running Luminous v12.2.z or a later release.
c07f9fc5 574
1e59de90 575For details on this command's parameters, see the following:
c07f9fc5
FG
576
577``pool-name``
1e59de90
TL
578 :Description: The name of a RADOS pool.
579 :Type: String
580 :Required: Yes
581 :Example: ``rbd``
c07f9fc5
FG
582
583``mode``
1e59de90
TL
584 :Description: Either ``flat`` or ``positional``. A *flat* weight set
585 assigns a single weight to all devices or buckets. A
586 *positional* weight set has a potentially different
587 weight for each position in the resulting placement
588 mapping. For example: if a pool has a replica count of
589 ``3``, then a positional weight set will have three
590 weights for each device and bucket.
591 :Type: String
592 :Required: Yes
593 :Example: ``flat``
594
595To adjust the weight of an item in a weight set, run a command of the following
596form:
39ae355f
TL
597
598.. prompt:: bash $
599
600 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
c07f9fc5 601
1e59de90 602To list existing weight sets, run the following command:
c07f9fc5 603
39ae355f 604.. prompt:: bash $
c07f9fc5 605
39ae355f 606 ceph osd crush weight-set ls
c07f9fc5 607
1e59de90 608To remove a weight set, run a command of the following form:
c07f9fc5 609
39ae355f
TL
610.. prompt:: bash $
611
612 ceph osd crush weight-set rm {pool-name}
c07f9fc5 613
1e59de90 614
c07f9fc5
FG
615Creating a rule for a replicated pool
616-------------------------------------
617
1e59de90
TL
618When you create a CRUSH rule for a replicated pool, there is an important
619decision to make: selecting a failure domain. For example, if you select a
620failure domain of ``host``, then CRUSH will ensure that each replica of the
621data is stored on a unique host. Alternatively, if you select a failure domain
622of ``rack``, then each replica of the data will be stored in a different rack.
623Your selection of failure domain should be guided by the size and its CRUSH
624topology.
625
626The entire cluster hierarchy is typically nested beneath a root node that is
627named ``default``. If you have customized your hierarchy, you might want to
628create a rule nested beneath some other node in the hierarchy. In creating
629this rule for the customized hierarchy, the node type doesn't matter, and in
630particular the rule does not have to be nested beneath a ``root`` node.
631
632It is possible to create a rule that restricts data placement to a specific
633*class* of device. By default, Ceph OSDs automatically classify themselves as
634either ``hdd`` or ``ssd`` in accordance with the underlying type of device
635being used. These device classes can be customized. One might set the ``device
636class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
637them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
638and pools may be flexibly constrained to use (or avoid using) specific subsets
639of OSDs based on specific requirements.
640
641To create a rule for a replicated pool, run a command of the following form:
39ae355f
TL
642
643.. prompt:: bash $
c07f9fc5 644
39ae355f 645 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
c07f9fc5 646
1e59de90 647For details on this command's parameters, see the following:
c07f9fc5
FG
648
649``name``
1e59de90
TL
650 :Description: The name of the rule.
651 :Type: String
652 :Required: Yes
653 :Example: ``rbd-rule``
c07f9fc5
FG
654
655``root``
1e59de90
TL
656 :Description: The name of the CRUSH hierarchy node under which data is to be placed.
657 :Type: String
658 :Required: Yes
659 :Example: ``default``
c07f9fc5
FG
660
661``failure-domain-type``
1e59de90
TL
662 :Description: The type of CRUSH nodes used for the replicas of the failure domain.
663 :Type: String
664 :Required: Yes
665 :Example: ``rack``
c07f9fc5
FG
666
667``class``
1e59de90
TL
668 :Description: The device class on which data is to be placed.
669 :Type: String
670 :Required: No
671 :Example: ``ssd``
c07f9fc5 672
1e59de90 673Creating a rule for an erasure-coded pool
c07f9fc5
FG
674-----------------------------------------
675
1e59de90
TL
676For an erasure-coded pool, similar decisions need to be made: what the failure
677domain is, which node in the hierarchy data will be placed under (usually
678``default``), and whether placement is restricted to a specific device class.
679However, erasure-code pools are created in a different way: there is a need to
680construct them carefully with reference to the erasure code plugin in use. For
681this reason, these decisions must be incorporated into the **erasure-code
682profile**. A CRUSH rule will then be created from the erasure-code profile,
683either explicitly or automatically when the profile is used to create a pool.
c07f9fc5 684
1e59de90 685To list the erasure-code profiles, run the following command:
39ae355f
TL
686
687.. prompt:: bash $
688
689 ceph osd erasure-code-profile ls
c07f9fc5 690
1e59de90 691To view a specific existing profile, run a command of the following form:
c07f9fc5 692
39ae355f 693.. prompt:: bash $
c07f9fc5 694
39ae355f 695 ceph osd erasure-code-profile get {profile-name}
c07f9fc5 696
1e59de90
TL
697Under normal conditions, profiles should never be modified; instead, a new
698profile should be created and used when creating either a new pool or a new
c07f9fc5
FG
699rule for an existing pool.
700
1e59de90
TL
701An erasure-code profile consists of a set of key-value pairs. Most of these
702key-value pairs govern the behavior of the erasure code that encodes data in
703the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH
704rule that is created.
c07f9fc5 705
1e59de90 706The relevant erasure-code profile properties are as follows:
c07f9fc5 707
1e59de90
TL
708 * **crush-root**: the name of the CRUSH node under which to place data
709 [default: ``default``].
710 * **crush-failure-domain**: the CRUSH bucket type used in the distribution of
711 erasure-coded shards [default: ``host``].
712 * **crush-device-class**: the device class on which to place data [default:
713 none, which means that all devices are used].
714 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
715 number of erasure-code shards, affecting the resulting CRUSH rule.
c07f9fc5 716
1e59de90
TL
717 After a profile is defined, you can create a CRUSH rule by running a command
718 of the following form:
c07f9fc5 719
39ae355f
TL
720.. prompt:: bash $
721
722 ceph osd crush rule create-erasure {name} {profile-name}
c07f9fc5 723
1e59de90
TL
724.. note: When creating a new pool, it is not necessary to create the rule
725 explicitly. If only the erasure-code profile is specified and the rule
726 argument is omitted, then Ceph will create the CRUSH rule automatically.
727
c07f9fc5
FG
728
729Deleting rules
730--------------
731
1e59de90
TL
732To delete rules that are not in use by pools, run a command of the following
733form:
39ae355f
TL
734
735.. prompt:: bash $
c07f9fc5 736
39ae355f 737 ceph osd crush rule rm {rule-name}
c07f9fc5 738
11fdf7f2
TL
739.. _crush-map-tunables:
740
7c673cae
FG
741Tunables
742========
743
1e59de90
TL
744The CRUSH algorithm that is used to calculate the placement of data has been
745improved over time. In order to support changes in behavior, we have provided
746users with sets of tunables that determine which legacy or optimal version of
747CRUSH is to be used.
7c673cae 748
1e59de90
TL
749In order to use newer tunables, all Ceph clients and daemons must support the
750new major release of CRUSH. Because of this requirement, we have created
7c673cae 751``profiles`` that are named after the Ceph version in which they were
1e59de90
TL
752introduced. For example, the ``firefly`` tunables were first supported by the
753Firefly release and do not work with older clients (for example, clients
754running Dumpling). After a cluster's tunables profile is changed from a legacy
755set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options
756will prevent older clients that do not support the new CRUSH features from
757connecting to the cluster.
7c673cae
FG
758
759argonaut (legacy)
760-----------------
761
1e59de90
TL
762The legacy CRUSH behavior used by Argonaut and older releases works fine for
763most clusters, provided that not many OSDs have been marked ``out``.
7c673cae
FG
764
765bobtail (CRUSH_TUNABLES2)
766-------------------------
767
1e59de90 768The ``bobtail`` tunable profile provides the following improvements:
7c673cae 769
1e59de90
TL
770 * For hierarchies with a small number of devices in leaf buckets, some PGs
771 might map to fewer than the desired number of replicas, resulting in
772 ``undersized`` PGs. This is known to happen in the case of hierarchies with
773 ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each
774 host.
7c673cae 775
1e59de90
TL
776 * For large clusters, a small percentage of PGs might map to fewer than the
777 desired number of OSDs. This is known to happen when there are multiple
778 hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
779 ``osd``).
7c673cae 780
1e59de90 781 * When one or more OSDs are marked ``out``, data tends to be redistributed
7c673cae
FG
782 to nearby OSDs instead of across the entire hierarchy.
783
1e59de90 784The tunables introduced in the Bobtail release are as follows:
7c673cae 785
1e59de90
TL
786 * ``choose_local_tries``: Number of local retries. The legacy value is ``2``,
787 and the optimal value is ``0``.
7c673cae 788
1e59de90
TL
789 * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal
790 value is 0.
7c673cae 791
1e59de90
TL
792 * ``choose_total_tries``: Total number of attempts to choose an item. The
793 legacy value is ``19``, but subsequent testing indicates that a value of
794 ``50`` is more appropriate for typical clusters. For extremely large
795 clusters, an even larger value might be necessary.
7c673cae 796
1e59de90
TL
797 * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will
798 retry, or try only once and allow the original placement to retry. The
799 legacy default is ``0``, and the optimal value is ``1``.
7c673cae
FG
800
801Migration impact:
802
1e59de90
TL
803 * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a
804 moderate amount of data movement. Use caution on a cluster that is already
7c673cae
FG
805 populated with data.
806
807firefly (CRUSH_TUNABLES3)
808-------------------------
809
1e59de90
TL
810chooseleaf_vary_r
811~~~~~~~~~~~~~~~~~
7c673cae 812
1e59de90
TL
813This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step
814behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs.
7c673cae 815
1e59de90
TL
816This profile was introduced in the Firefly release, and adds a new tunable as follows:
817
818 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start
819 with a non-zero value of ``r``, as determined by the number of attempts the
820 parent has already made. The legacy default value is ``0``, but with this
821 value CRUSH is sometimes unable to find a mapping. The optimal value (in
f67539c2 822 terms of computational cost and correctness) is ``1``.
7c673cae 823
11fdf7f2 824Migration impact:
7c673cae 825
1e59de90
TL
826 * For existing clusters that store a great deal of data, changing this tunable
827 from ``0`` to ``1`` will trigger a large amount of data migration; a value
828 of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will
829 cause less data to move.
7c673cae 830
1e59de90
TL
831straw_calc_version tunable
832~~~~~~~~~~~~~~~~~~~~~~~~~~
7c673cae 833
1e59de90
TL
834There were problems with the internal weights calculated and stored in the
835CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH
836weight of ``0`` or with a mix of different and unique weights, CRUSH would
837distribute data incorrectly (that is, not in proportion to the weights).
7c673cae 838
1e59de90 839This tunable, introduced in the Firefly release, is as follows:
7c673cae 840
f67539c2 841 * ``straw_calc_version``: A value of ``0`` preserves the old, broken
1e59de90 842 internal-weight calculation; a value of ``1`` fixes the problem.
7c673cae
FG
843
844Migration impact:
845
1e59de90
TL
846 * Changing this tunable to a value of ``1`` and then adjusting a straw bucket
847 (either by adding, removing, or reweighting an item or by using the
848 reweight-all command) can trigger a small to moderate amount of data
849 movement provided that the cluster has hit one of the problematic
7c673cae
FG
850 conditions.
851
1e59de90
TL
852This tunable option is notable in that it has absolutely no impact on the
853required kernel version in the client side.
7c673cae
FG
854
855hammer (CRUSH_V4)
856-----------------
857
1e59de90
TL
858The ``hammer`` tunable profile does not affect the mapping of existing CRUSH
859maps simply by changing the profile. However:
7c673cae 860
1e59de90
TL
861 * There is a new bucket algorithm supported: ``straw2``. This new algorithm
862 fixes several limitations in the original ``straw``. More specifically, the
863 old ``straw`` buckets would change some mappings that should not have
864 changed when a weight was adjusted, while ``straw2`` achieves the original
865 goal of changing mappings only to or from the bucket item whose weight has
7c673cae
FG
866 changed.
867
1e59de90 868 * The ``straw2`` type is the default type for any newly created buckets.
7c673cae
FG
869
870Migration impact:
871
1e59de90
TL
872 * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small
873 amount of data movement, depending on how much the bucket items' weights
874 vary from each other. When the weights are all the same no data will move,
875 and the more variance there is in the weights the more movement there will
876 be.
7c673cae
FG
877
878jewel (CRUSH_TUNABLES5)
879-----------------------
880
1e59de90
TL
881The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a
882result, significantly fewer mappings change when an OSD is marked ``out`` of
883the cluster. This improvement results in significantly less data movement.
7c673cae 884
1e59de90 885The new tunable introduced in the Jewel release is as follows:
7c673cae 886
1e59de90
TL
887 * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt
888 will use a better value for an inner loop that greatly reduces the number of
889 mapping changes when an OSD is marked ``out``. The legacy value is ``0``,
890 and the new value of ``1`` uses the new approach.
7c673cae
FG
891
892Migration impact:
893
1e59de90
TL
894 * Changing this value on an existing cluster will result in a very large
895 amount of data movement because nearly every PG mapping is likely to change.
7c673cae 896
1e59de90 897Client versions that support CRUSH_TUNABLES2
7c673cae
FG
898--------------------------------------------
899
1e59de90
TL
900 * v0.55 and later, including Bobtail (v0.56.x)
901 * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients)
7c673cae 902
1e59de90
TL
903Client versions that support CRUSH_TUNABLES3
904--------------------------------------------
7c673cae 905
1e59de90
TL
906 * v0.78 (Firefly) and later
907 * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients)
7c673cae 908
1e59de90
TL
909Client versions that support CRUSH_V4
910-------------------------------------
7c673cae 911
1e59de90
TL
912 * v0.94 (Hammer) and later
913 * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients)
7c673cae 914
1e59de90
TL
915Client versions that support CRUSH_TUNABLES5
916--------------------------------------------
7c673cae 917
1e59de90
TL
918 * v10.0.2 (Jewel) and later
919 * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients)
7c673cae 920
1e59de90
TL
921"Non-optimal tunables" warning
922------------------------------
7c673cae 923
1e59de90
TL
924In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush
925map has non-optimal tunables") if any of the current CRUSH tunables have
926non-optimal values: that is, if any fail to have the optimal values from the
927:ref:` ``default`` profile
928<rados_operations_crush_map_default_profile_definition>`. There are two
929different ways to silence the alert:
7c673cae 930
1e59de90
TL
9311. Adjust the CRUSH tunables on the existing cluster so as to render them
932 optimal. Making this adjustment will trigger some data movement
933 (possibly as much as 10%). This approach is generally preferred to the
934 other approach, but special care must be taken in situations where
935 data movement might affect performance: for example, in production clusters.
936 To enable optimal tunables, run the following command:
39ae355f
TL
937
938 .. prompt:: bash $
7c673cae
FG
939
940 ceph osd crush tunables optimal
941
1e59de90
TL
942 There are several potential problems that might make it preferable to revert
943 to the previous values of the tunables. The new values might generate too
944 much load for the cluster to handle, the new values might unacceptably slow
945 the operation of the cluster, or there might be a client-compatibility
946 problem. Such client-compatibility problems can arise when using old-kernel
947 CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to
948 the previous values of the tunables, run the following command:
39ae355f
TL
949
950 .. prompt:: bash $
7c673cae
FG
951
952 ceph osd crush tunables legacy
953
1e59de90
TL
9542. To silence the alert without making any changes to CRUSH,
955 add the following option to the ``[mon]`` section of your ceph.conf file::
7c673cae 956
1e59de90 957 mon_warn_on_legacy_crush_tunables = false
7c673cae 958
1e59de90
TL
959 In order for this change to take effect, you will need to either restart
960 the monitors or run the following command to apply the option to the
961 monitors while they are still running:
39ae355f
TL
962
963 .. prompt:: bash $
7c673cae 964
11fdf7f2 965 ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
7c673cae
FG
966
967
7c673cae
FG
968Tuning CRUSH
969------------
970
1e59de90
TL
971When making adjustments to CRUSH tunables, keep the following considerations in
972mind:
973
974 * Adjusting the values of CRUSH tunables will result in the shift of one or
975 more PGs from one storage node to another. If the Ceph cluster is already
976 storing a great deal of data, be prepared for significant data movement.
977 * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they
978 immediately begin rejecting new connections from clients that do not support
979 the new feature. However, already-connected clients are effectively
980 grandfathered in, and any of these clients that do not support the new
981 feature will malfunction.
982 * If the CRUSH tunables are set to newer (non-legacy) values and subsequently
983 reverted to the legacy values, ``ceph-osd`` daemons will not be required to
984 support any of the newer CRUSH features associated with the newer
985 (non-legacy) values. However, the OSD peering process requires the
986 examination and understanding of old maps. For this reason, **if the cluster
987 has previously used non-legacy CRUSH values, do not run old versions of
988 the** ``ceph-osd`` **daemon** -- even if the latest version of the map has
989 been reverted so as to use the legacy defaults.
990
991The simplest way to adjust CRUSH tunables is to apply them in matched sets
992known as *profiles*. As of the Octopus release, Ceph supports the following
993profiles:
994
995 * ``legacy``: The legacy behavior from argonaut and earlier.
996 * ``argonaut``: The legacy values supported by the argonaut release.
997 * ``bobtail``: The values supported by the bobtail release.
998 * ``firefly``: The values supported by the firefly release.
999 * ``hammer``: The values supported by the hammer release.
1000 * ``jewel``: The values supported by the jewel release.
1001 * ``optimal``: The best values for the current version of Ceph.
1002 .. _rados_operations_crush_map_default_profile_definition:
1003 * ``default``: The default values of a new cluster that has been installed
1004 from scratch. These values, which depend on the current version of Ceph, are
1005 hardcoded and are typically a mix of optimal and legacy values. These
1006 values often correspond to the ``optimal`` profile of either the previous
1007 LTS (long-term service) release or the most recent release for which most
1008 users are expected to have up-to-date clients.
1009
1010To apply a profile to a running cluster, run a command of the following form:
7c673cae 1011
39ae355f
TL
1012.. prompt:: bash $
1013
1014 ceph osd crush tunables {PROFILE}
7c673cae 1015
1e59de90
TL
1016This action might trigger a great deal of data movement. Consult release notes
1017and documentation before changing the profile on a running cluster. Consider
1018throttling recovery and backfill parameters in order to limit the backfill
1019resulting from a specific change.
7c673cae 1020
39ae355f 1021.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
7c673cae 1022
7c673cae 1023
1e59de90
TL
1024Tuning Primary OSD Selection
1025============================
7c673cae 1026
1e59de90 1027When a Ceph client reads or writes data, it first contacts the primary OSD in
f67539c2 1028each affected PG's acting set. By default, the first OSD in the acting set is
1e59de90
TL
1029the primary OSD (also known as the "lead OSD"). For example, in the acting set
1030``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD.
1031However, sometimes it is clear that an OSD is not well suited to act as the
1032lead as compared with other OSDs (for example, if the OSD has a slow drive or a
1033slow controller). To prevent performance bottlenecks (especially on read
1034operations) and at the same time maximize the utilization of your hardware, you
1035can influence the selection of the primary OSD either by adjusting "primary
1036affinity" values, or by crafting a CRUSH rule that selects OSDs that are better
1037suited to act as the lead rather than other OSDs.
1038
1039To determine whether tuning Ceph's selection of primary OSDs will improve
1040cluster performance, pool redundancy strategy must be taken into account. For
1041replicated pools, this tuning can be especially useful, because by default read
1042operations are served from the primary OSD of each PG. For erasure-coded pools,
1043however, the speed of read operations can be increased by enabling **fast
1044read** (see :ref:`pool-settings`).
1045
1046Primary Affinity
1047----------------
1048
1049**Primary affinity** is a characteristic of an OSD that governs the likelihood
1050that a given OSD will be selected as the primary OSD (or "lead OSD") in a given
1051acting set. A primary affinity value can be any real number in the range ``0``
1052to ``1``, inclusive.
1053
1054As an example of a common scenario in which it can be useful to adjust primary
1055affinity values, let us suppose that a cluster contains a mix of drive sizes:
1056for example, suppose it contains some older racks with 1.9 TB SATA SSDs and
1057some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned
1058twice the number of PGs and will thus serve twice the number of write and read
1059operations -- they will be busier than the former. In such a scenario, you
1060might make a rough assignment of primary affinity as inversely proportional to
1061OSD size. Such an assignment will not be 100% optimal, but it can readily
1062achieve a 15% improvement in overall read throughput by means of a more even
1063utilization of SATA interface bandwidth and CPU cycles. This example is not
1064merely a thought experiment meant to illustrate the theoretical benefits of
1065adjusting primary affinity values; this fifteen percent improvement was
1066achieved on an actual Ceph cluster.
1067
1068By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster
1069in which every OSD has this default value, all OSDs are equally likely to act
1070as a primary OSD.
1071
1072By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less
1073likely to select the OSD as primary in a PG's acting set. To change the weight
1074value associated with a specific OSD's primary affinity, run a command of the
1075following form:
f67539c2 1076
39ae355f 1077.. prompt:: bash $
7c673cae 1078
39ae355f
TL
1079 ceph osd primary-affinity <osd-id> <weight>
1080
1e59de90
TL
1081The primary affinity of an OSD can be set to any real number in the range
1082``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as
1083primary and ``1`` indicates that the OSD is maximally likely to be used as a
1084primary. When the weight is between these extremes, its value indicates roughly
1085how likely it is that CRUSH will select the OSD associated with it as a
1086primary.
1087
1088The process by which CRUSH selects the lead OSD is not a mere function of a
1089simple probability determined by relative affinity values. Nevertheless,
1090measurable results can be achieved even with first-order approximations of
1091desirable primary affinity values.
1092
f67539c2
TL
1093
1094Custom CRUSH Rules
1095------------------
1096
1e59de90
TL
1097Some clusters balance cost and performance by mixing SSDs and HDDs in the same
1098replicated pool. By setting the primary affinity of HDD OSDs to ``0``,
1099operations will be directed to an SSD OSD in each acting set. Alternatively,
1100you can define a CRUSH rule that always selects an SSD OSD as the primary OSD
1101and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting
1102set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs.
1103
1104For example, see the following CRUSH rule::
1105
1106 rule mixed_replicated_rule {
1107 id 11
1108 type replicated
1109 step take default class ssd
1110 step chooseleaf firstn 1 type host
1111 step emit
1112 step take default class hdd
1113 step chooseleaf firstn 0 type host
1114 step emit
1115 }
1116
1117This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool,
1118this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on
1119different hosts, because the first SSD OSD might be colocated with any of the
1120``N`` HDD OSDs.
1121
1122To avoid this extra storage requirement, you might place SSDs and HDDs in
1123different hosts. However, taking this approach means that all client requests
1124will be received by hosts with SSDs. For this reason, it might be advisable to
1125have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the
1126latter will under normal circumstances perform only recovery operations. Here
1127the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement
1128not to contain any of the same servers, as seen in the following CRUSH rule::
f67539c2
TL
1129
1130 rule mixed_replicated_rule_two {
1131 id 1
1132 type replicated
f67539c2
TL
1133 step take ssd_hosts class ssd
1134 step chooseleaf firstn 1 type host
1135 step emit
1136 step take hdd_hosts class hdd
1137 step chooseleaf firstn -1 type host
1138 step emit
1139 }
1140
1e59de90
TL
1141.. note:: If a primary SSD OSD fails, then requests to the associated PG will
1142 be temporarily served from a slower HDD OSD until the PG's data has been
1143 replicated onto the replacement primary SSD OSD.
f67539c2 1144
7c673cae 1145