]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map.rst
bump version to 18.2.4-pve3
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
CommitLineData
7c673cae
FG
1============
2 CRUSH Maps
3============
4
5The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
1e59de90
TL
6computes storage locations in order to determine how to store and retrieve
7data. CRUSH allows Ceph clients to communicate with OSDs directly rather than
8through a centralized server or broker. By using an algorithmically-determined
7c673cae
FG
9method of storing and retrieving data, Ceph avoids a single point of failure, a
10performance bottleneck, and a physical limit to its scalability.
11
1e59de90
TL
12CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
13distributing the data across the cluster in accordance with configured
14replication policy and failure domains. For a detailed discussion of CRUSH, see
7c673cae
FG
15`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
1e59de90
TL
17CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
18hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
19replicates data within the cluster's pools. By reflecting the underlying
20physical organization of the installation, CRUSH can model (and thereby
21address) the potential for correlated device failures. Some factors relevant
22to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
23power source, shared networking, and failure domains. By encoding this
24information into the CRUSH map, CRUSH placement policies distribute object
25replicas across failure domains while maintaining the desired distribution. For
26example, to address the possibility of concurrent failures, it might be
27desirable to ensure that data replicas are on devices that reside in or rely
28upon different shelves, racks, power supplies, controllers, or physical
29locations.
30
31When OSDs are deployed, they are automatically added to the CRUSH map under a
32``host`` bucket that is named for the node on which the OSDs run. This
33behavior, combined with the configured CRUSH failure domain, ensures that
34replicas or erasure-code shards are distributed across hosts and that the
35failure of a single host or other kinds of failures will not affect
36availability. For larger clusters, administrators must carefully consider their
37choice of failure domain. For example, distributing replicas across racks is
38typical for mid- to large-sized clusters.
7c673cae
FG
39
40
41CRUSH Location
42==============
43
1e59de90
TL
44The location of an OSD within the CRUSH map's hierarchy is referred to as its
45``CRUSH location``. The specification of a CRUSH location takes the form of a
46list of key-value pairs. For example, if an OSD is in a particular row, rack,
47chassis, and host, and is also part of the 'default' CRUSH root (which is the
48case for most clusters), its CRUSH location can be specified as follows::
7c673cae
FG
49
50 root=default row=a rack=a2 chassis=a2a host=a2a1
51
1e59de90 52.. note::
7c673cae 53
1e59de90
TL
54 #. The order of the keys does not matter.
55 #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
56 valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
57 ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
58 types suffice for nearly all clusters, but can be customized by
59 modifying the CRUSH map.
7c673cae 60
f38dd50b
TL
61The CRUSH location for an OSD can be set by adding the ``crush_location``
62option in ``ceph.conf``, example:
63
64 crush_location = root=default row=a rack=a2 chassis=a2a host=a2a1
65
66When this option has been added, every time the OSD
1e59de90
TL
67starts it verifies that it is in the correct location in the CRUSH map and
68moves itself if it is not. To disable this automatic CRUSH map management, add
69the following to the ``ceph.conf`` configuration file in the ``[osd]``
70section::
7c673cae 71
f38dd50b 72 osd_crush_update_on_start = false
7c673cae 73
1e59de90 74Note that this action is unnecessary in most cases.
f67539c2 75
f38dd50b
TL
76If the ``crush_location`` is not set explicitly,
77a default of ``root=default host=HOSTNAME`` is used for ``OSD``s,
78where the hostname is determined by the output of the ``hostname -s`` command.
79
80.. note:: If you switch from this default to an explicitly set ``crush_location``,
81 do not forget to include ``root=default`` because existing CRUSH rules refer to it.
c07f9fc5 82
7c673cae
FG
83Custom location hooks
84---------------------
85
f38dd50b
TL
86A custom location hook can be used to generate a more complete CRUSH location,
87on startup.
88
89This is useful when some location fields are not known at the time
90``ceph.conf`` is written (for example, fields ``rack`` or ``datacenter``
91when deploying a single configuration across multiple datacenters).
c07f9fc5 92
f38dd50b
TL
93If configured, executed, and parsed successfully, the hook's output replaces
94any previously set CRUSH location.
c07f9fc5 95
f38dd50b
TL
96The hook hook can be enabled in ``ceph.conf`` by providing a path to an
97executable file (often a script), example::
7c673cae 98
f38dd50b 99 crush_location_hook = /path/to/customized-ceph-crush-location
7c673cae 100
1e59de90 101This hook is passed several arguments (see below). The hook outputs a single
f38dd50b
TL
102line to ``stdout`` that contains the CRUSH location description. The arguments
103resemble the following:::
7c673cae 104
11fdf7f2 105 --cluster CLUSTER --id ID --type TYPE
7c673cae 106
1e59de90
TL
107Here the cluster name is typically ``ceph``, the ``id`` is the daemon
108identifier or (in the case of OSDs) the OSD number, and the daemon type is
f38dd50b 109``osd``, ``mds``, ``mgr``, or ``mon``.
11fdf7f2 110
1e59de90 111For example, a simple hook that specifies a rack location via a value in the
f38dd50b 112file ``/etc/rack`` (assuming it contains no spaces) might be as follows::
11fdf7f2
TL
113
114 #!/bin/sh
f38dd50b 115 echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)"
7c673cae
FG
116
117
c07f9fc5
FG
118CRUSH structure
119===============
7c673cae 120
1e59de90
TL
121The CRUSH map consists of (1) a hierarchy that describes the physical topology
122of the cluster and (2) a set of rules that defines data placement policy. The
123hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
124other physical features or groupings: hosts, racks, rows, data centers, and so
125on. The rules determine how replicas are placed in terms of that hierarchy (for
126example, 'three replicas in different racks').
7c673cae 127
c07f9fc5
FG
128Devices
129-------
7c673cae 130
1e59de90
TL
131Devices are individual OSDs that store data (usually one device for each
132storage drive). Devices are identified by an ``id`` (a non-negative integer)
133and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
7c673cae 134
1e59de90
TL
135In Luminous and later releases, OSDs can have a *device class* assigned (for
136example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
137rules. Device classes are especially useful when mixing device types within
138hosts.
7c673cae 139
20effc67
TL
140.. _crush_map_default_types:
141
c07f9fc5 142Types and Buckets
7c673cae
FG
143-----------------
144
1e59de90
TL
145"Bucket", in the context of CRUSH, is a term for any of the internal nodes in
146the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
147*types* that are used to identify these nodes. Default types include:
f67539c2
TL
148
149- ``osd`` (or ``device``)
150- ``host``
151- ``chassis``
152- ``rack``
153- ``row``
154- ``pdu``
155- ``pod``
156- ``room``
157- ``datacenter``
158- ``zone``
159- ``region``
160- ``root``
161
1e59de90
TL
162Most clusters use only a handful of these types, and other types can be defined
163as needed.
164
165The hierarchy is built with devices (normally of type ``osd``) at the leaves
166and non-device types as the internal nodes. The root node is of type ``root``.
167For example:
c07f9fc5 168
c07f9fc5
FG
169
170.. ditaa::
171
1e59de90 172 +-----------------+
11fdf7f2 173 |{o}root default |
1e59de90 174 +--------+--------+
7c673cae 175 |
11fdf7f2
TL
176 +---------------+---------------+
177 | |
178 +------+------+ +------+------+
1e59de90 179 |{o}host foo | |{o}host bar |
11fdf7f2 180 +------+------+ +------+------+
7c673cae 181 | |
7c673cae
FG
182 +-------+-------+ +-------+-------+
183 | | | |
184 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
1e59de90 185 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
7c673cae
FG
186 +-----------+ +-----------+ +-----------+ +-----------+
187
c07f9fc5 188
1e59de90
TL
189Each node (device or bucket) in the hierarchy has a *weight* that indicates the
190relative proportion of the total data that should be stored by that device or
191hierarchy subtree. Weights are set at the leaves, indicating the size of the
192device. These weights automatically sum in an 'up the tree' direction: that is,
193the weight of the ``root`` node will be the sum of the weights of all devices
194contained under it. Weights are typically measured in tebibytes (TiB).
195
196To get a simple view of the cluster's CRUSH hierarchy, including weights, run
197the following command:
c07f9fc5 198
39ae355f
TL
199.. prompt:: bash $
200
201 ceph osd tree
c07f9fc5
FG
202
203Rules
204-----
205
1e59de90
TL
206CRUSH rules define policy governing how data is distributed across the devices
207in the hierarchy. The rules define placement as well as replication strategies
208or distribution policies that allow you to specify exactly how CRUSH places
209data replicas. For example, you might create one rule selecting a pair of
210targets for two-way mirroring, another rule for selecting three targets in two
211different data centers for three-way replication, and yet another rule for
212erasure coding across six storage devices. For a detailed discussion of CRUSH
213rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
214Placement of Replicated Data`_.
215
216CRUSH rules can be created via the command-line by specifying the *pool type*
217that they will govern (replicated or erasure coded), the *failure domain*, and
218optionally a *device class*. In rare cases, CRUSH rules must be created by
219manually editing the CRUSH map.
220
221To see the rules that are defined for the cluster, run the following command:
39ae355f
TL
222
223.. prompt:: bash $
224
225 ceph osd crush rule ls
7c673cae 226
1e59de90 227To view the contents of the rules, run the following command:
7c673cae 228
39ae355f 229.. prompt:: bash $
7c673cae 230
39ae355f 231 ceph osd crush rule dump
7c673cae 232
05a536ef
TL
233.. _device_classes:
234
d2e6a577
FG
235Device classes
236--------------
237
1e59de90
TL
238Each device can optionally have a *class* assigned. By default, OSDs
239automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
240accordance with the type of device they are backed by.
d2e6a577 241
1e59de90
TL
242To explicitly set the device class of one or more OSDs, run a command of the
243following form:
d2e6a577 244
39ae355f
TL
245.. prompt:: bash $
246
247 ceph osd crush set-device-class <class> <osd-name> [...]
d2e6a577 248
1e59de90
TL
249Once a device class has been set, it cannot be changed to another class until
250the old class is unset. To remove the old class of one or more OSDs, run a
251command of the following form:
39ae355f
TL
252
253.. prompt:: bash $
d2e6a577 254
39ae355f 255 ceph osd crush rm-device-class <osd-name> [...]
d2e6a577 256
1e59de90
TL
257This restriction allows administrators to set device classes that won't be
258changed on OSD restart or by a script.
d2e6a577 259
1e59de90
TL
260To create a placement rule that targets a specific device class, run a command
261of the following form:
d2e6a577 262
39ae355f 263.. prompt:: bash $
d2e6a577 264
39ae355f 265 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
d2e6a577 266
1e59de90
TL
267To apply the new placement rule to a specific pool, run a command of the
268following form:
39ae355f
TL
269
270.. prompt:: bash $
271
272 ceph osd pool set <pool-name> crush_rule <rule-name>
d2e6a577 273
1e59de90
TL
274Device classes are implemented by creating one or more "shadow" CRUSH
275hierarchies. For each device class in use, there will be a shadow hierarchy
276that contains only devices of that class. CRUSH rules can then distribute data
277across the relevant shadow hierarchy. This approach is fully backward
278compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
279items displayed, run the following command:
39ae355f 280
1e59de90 281.. prompt:: bash #
d2e6a577 282
39ae355f 283 ceph osd crush tree --show-shadow
d2e6a577 284
1e59de90
TL
285Some older clusters that were created before the Luminous release rely on
286manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
287clusters, there is a *reclassify* tool available that can help them transition
288to device classes without triggering unwanted data movement (see
289:ref:`crush-reclassify`).
290
291Weight sets
292-----------
293
294A *weight set* is an alternative set of weights to use when calculating data
295placement. The normal weights associated with each device in the CRUSH map are
296set in accordance with the device size and indicate how much data should be
297stored where. However, because CRUSH is a probabilistic pseudorandom placement
298process, there is always some variation from this ideal distribution (in the
299same way that rolling a die sixty times will likely not result in exactly ten
300ones and ten sixes). Weight sets allow the cluster to perform numerical
301optimization based on the specifics of your cluster (for example: hierarchy,
302pools) to achieve a balanced distribution.
303
304Ceph supports two types of weight sets:
305
306#. A **compat** weight set is a single alternative set of weights for each
307 device and each node in the cluster. Compat weight sets cannot be expected
308 to correct all anomalies (for example, PGs for different pools might be of
309 different sizes and have different load levels, but are mostly treated alike
310 by the balancer). However, they have the major advantage of being *backward
311 compatible* with previous versions of Ceph. This means that even though
312 weight sets were first introduced in Luminous v12.2.z, older clients (for
313 example, Firefly) can still connect to the cluster when a compat weight set
314 is being used to balance data.
315
316#. A **per-pool** weight set is more flexible in that it allows placement to
317 be optimized for each data pool. Additionally, weights can be adjusted
318 for each position of placement, allowing the optimizer to correct for a
319 subtle skew of data toward devices with small weights relative to their
320 peers (an effect that is usually apparent only in very large clusters
321 but that can cause balancing problems).
322
323When weight sets are in use, the weights associated with each node in the
324hierarchy are visible in a separate column (labeled either as ``(compat)`` or
325as the pool name) in the output of the following command:
326
327.. prompt:: bash #
39ae355f
TL
328
329 ceph osd tree
c07f9fc5 330
1e59de90
TL
331If both *compat* and *per-pool* weight sets are in use, data placement for a
332particular pool will use its own per-pool weight set if present. If only
333*compat* weight sets are in use, data placement will use the compat weight set.
334If neither are in use, data placement will use the normal CRUSH weights.
c07f9fc5 335
1e59de90
TL
336Although weight sets can be set up and adjusted manually, we recommend enabling
337the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
338cluster is running Luminous or a later release.
c07f9fc5
FG
339
340Modifying the CRUSH map
341=======================
7c673cae
FG
342
343.. _addosd:
344
1e59de90
TL
345Adding/Moving an OSD
346--------------------
347
348.. note:: Under normal conditions, OSDs automatically add themselves to the
349 CRUSH map when they are created. The command in this section is rarely
350 needed.
7c673cae 351
7c673cae 352
1e59de90
TL
353To add or move an OSD in the CRUSH map of a running cluster, run a command of
354the following form:
39ae355f
TL
355
356.. prompt:: bash $
7c673cae 357
39ae355f 358 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
7c673cae 359
1e59de90 360For details on this command's parameters, see the following:
7c673cae 361
7c673cae 362``name``
1e59de90
TL
363 :Description: The full name of the OSD.
364 :Type: String
365 :Required: Yes
366 :Example: ``osd.0``
7c673cae
FG
367
368
369``weight``
1e59de90
TL
370 :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB).
371 :Type: Double
372 :Required: Yes
373 :Example: ``2.0``
7c673cae
FG
374
375
376``root``
1e59de90
TL
377 :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``).
378 :Type: Key-value pair.
379 :Required: Yes
380 :Example: ``root=default``
7c673cae
FG
381
382
383``bucket-type``
1e59de90
TL
384 :Description: The OSD's location in the CRUSH hierarchy.
385 :Type: Key-value pairs.
386 :Required: No
387 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
7c673cae 388
1e59de90
TL
389In the following example, the command adds ``osd.0`` to the hierarchy, or moves
390``osd.0`` from a previous location:
7c673cae 391
39ae355f
TL
392.. prompt:: bash $
393
394 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7c673cae
FG
395
396
1e59de90
TL
397Adjusting OSD weight
398--------------------
c07f9fc5 399
1e59de90
TL
400.. note:: Under normal conditions, OSDs automatically add themselves to the
401 CRUSH map with the correct weight when they are created. The command in this
402 section is rarely needed.
7c673cae 403
1e59de90
TL
404To adjust an OSD's CRUSH weight in a running cluster, run a command of the
405following form:
39ae355f
TL
406
407.. prompt:: bash $
7c673cae 408
39ae355f 409 ceph osd crush reweight {name} {weight}
7c673cae 410
1e59de90 411For details on this command's parameters, see the following:
7c673cae
FG
412
413``name``
1e59de90
TL
414 :Description: The full name of the OSD.
415 :Type: String
416 :Required: Yes
417 :Example: ``osd.0``
7c673cae
FG
418
419
420``weight``
1e59de90
TL
421 :Description: The CRUSH weight of the OSD.
422 :Type: Double
423 :Required: Yes
424 :Example: ``2.0``
7c673cae
FG
425
426
427.. _removeosd:
428
1e59de90
TL
429Removing an OSD
430---------------
c07f9fc5 431
1e59de90
TL
432.. note:: OSDs are normally removed from the CRUSH map as a result of the
433 `ceph osd purge`` command. This command is rarely needed.
7c673cae 434
1e59de90
TL
435To remove an OSD from the CRUSH map of a running cluster, run a command of the
436following form:
39ae355f
TL
437
438.. prompt:: bash $
7c673cae 439
39ae355f 440 ceph osd crush remove {name}
7c673cae 441
1e59de90 442For details on the ``name`` parameter, see the following:
7c673cae
FG
443
444``name``
1e59de90
TL
445 :Description: The full name of the OSD.
446 :Type: String
447 :Required: Yes
448 :Example: ``osd.0``
7c673cae 449
c07f9fc5 450
1e59de90
TL
451Adding a CRUSH Bucket
452---------------------
c07f9fc5 453
1e59de90
TL
454.. note:: Buckets are implicitly created when an OSD is added and the command
455 that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the
456 OSD's location (provided that a bucket with that name does not already
457 exist). The command in this section is typically used when manually
458 adjusting the structure of the hierarchy after OSDs have already been
459 created. One use of this command is to move a series of hosts to a new
460 rack-level bucket. Another use of this command is to add new ``host``
461 buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive
462 any data until they are ready to receive data. When they are ready, move the
463 buckets to the ``default`` root or to any other root as described below.
7c673cae 464
1e59de90
TL
465To add a bucket in the CRUSH map of a running cluster, run a command of the
466following form:
7c673cae 467
39ae355f
TL
468.. prompt:: bash $
469
470 ceph osd crush add-bucket {bucket-name} {bucket-type}
7c673cae 471
1e59de90 472For details on this command's parameters, see the following:
7c673cae
FG
473
474``bucket-name``
1e59de90
TL
475 :Description: The full name of the bucket.
476 :Type: String
477 :Required: Yes
478 :Example: ``rack12``
7c673cae
FG
479
480
481``bucket-type``
1e59de90
TL
482 :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy.
483 :Type: String
484 :Required: Yes
485 :Example: ``rack``
7c673cae 486
1e59de90 487In the following example, the command adds the ``rack12`` bucket to the hierarchy:
39ae355f
TL
488
489.. prompt:: bash $
7c673cae 490
39ae355f 491 ceph osd crush add-bucket rack12 rack
7c673cae 492
1e59de90
TL
493Moving a Bucket
494---------------
7c673cae 495
c07f9fc5 496To move a bucket to a different location or position in the CRUSH map
1e59de90 497hierarchy, run a command of the following form:
7c673cae 498
39ae355f
TL
499.. prompt:: bash $
500
501 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
7c673cae 502
1e59de90 503For details on this command's parameters, see the following:
7c673cae
FG
504
505``bucket-name``
1e59de90
TL
506 :Description: The name of the bucket that you are moving.
507 :Type: String
508 :Required: Yes
509 :Example: ``foo-bar-1``
7c673cae
FG
510
511``bucket-type``
1e59de90
TL
512 :Description: The bucket's new location in the CRUSH hierarchy.
513 :Type: Key-value pairs.
514 :Required: No
515 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
7c673cae 516
1e59de90
TL
517Removing a Bucket
518-----------------
7c673cae 519
1e59de90
TL
520To remove a bucket from the CRUSH hierarchy, run a command of the following
521form:
39ae355f
TL
522
523.. prompt:: bash $
7c673cae 524
39ae355f 525 ceph osd crush remove {bucket-name}
7c673cae 526
1e59de90
TL
527.. note:: A bucket must already be empty before it is removed from the CRUSH
528 hierarchy. In other words, there must not be OSDs or any other CRUSH buckets
529 within it.
7c673cae 530
1e59de90 531For details on the ``bucket-name`` parameter, see the following:
7c673cae
FG
532
533``bucket-name``
1e59de90
TL
534 :Description: The name of the bucket that is being removed.
535 :Type: String
536 :Required: Yes
537 :Example: ``rack12``
7c673cae 538
1e59de90
TL
539In the following example, the command removes the ``rack12`` bucket from the
540hierarchy:
7c673cae 541
39ae355f
TL
542.. prompt:: bash $
543
544 ceph osd crush remove rack12
c07f9fc5
FG
545
546Creating a compat weight set
547----------------------------
548
1e59de90
TL
549.. note:: Normally this action is done automatically if needed by the
550 ``balancer`` module (provided that the module is enabled).
c07f9fc5 551
1e59de90 552To create a *compat* weight set, run the following command:
39ae355f
TL
553
554.. prompt:: bash $
c07f9fc5 555
39ae355f 556 ceph osd crush weight-set create-compat
c07f9fc5 557
1e59de90
TL
558To adjust the weights of the compat weight set, run a command of the following
559form:
c07f9fc5 560
39ae355f 561.. prompt:: bash $
c07f9fc5 562
39ae355f 563 ceph osd crush weight-set reweight-compat {name} {weight}
c07f9fc5 564
1e59de90 565To destroy the compat weight set, run the following command:
39ae355f
TL
566
567.. prompt:: bash $
568
569 ceph osd crush weight-set rm-compat
c07f9fc5
FG
570
571Creating per-pool weight sets
572-----------------------------
573
1e59de90
TL
574To create a weight set for a specific pool, run a command of the following
575form:
39ae355f
TL
576
577.. prompt:: bash $
c07f9fc5 578
39ae355f 579 ceph osd crush weight-set create {pool-name} {mode}
c07f9fc5 580
1e59de90
TL
581.. note:: Per-pool weight sets can be used only if all servers and daemons are
582 running Luminous v12.2.z or a later release.
c07f9fc5 583
1e59de90 584For details on this command's parameters, see the following:
c07f9fc5
FG
585
586``pool-name``
1e59de90
TL
587 :Description: The name of a RADOS pool.
588 :Type: String
589 :Required: Yes
590 :Example: ``rbd``
c07f9fc5
FG
591
592``mode``
1e59de90
TL
593 :Description: Either ``flat`` or ``positional``. A *flat* weight set
594 assigns a single weight to all devices or buckets. A
595 *positional* weight set has a potentially different
596 weight for each position in the resulting placement
597 mapping. For example: if a pool has a replica count of
598 ``3``, then a positional weight set will have three
599 weights for each device and bucket.
600 :Type: String
601 :Required: Yes
602 :Example: ``flat``
603
604To adjust the weight of an item in a weight set, run a command of the following
605form:
39ae355f
TL
606
607.. prompt:: bash $
608
609 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
c07f9fc5 610
1e59de90 611To list existing weight sets, run the following command:
c07f9fc5 612
39ae355f 613.. prompt:: bash $
c07f9fc5 614
39ae355f 615 ceph osd crush weight-set ls
c07f9fc5 616
1e59de90 617To remove a weight set, run a command of the following form:
c07f9fc5 618
39ae355f
TL
619.. prompt:: bash $
620
621 ceph osd crush weight-set rm {pool-name}
c07f9fc5 622
1e59de90 623
c07f9fc5
FG
624Creating a rule for a replicated pool
625-------------------------------------
626
1e59de90
TL
627When you create a CRUSH rule for a replicated pool, there is an important
628decision to make: selecting a failure domain. For example, if you select a
629failure domain of ``host``, then CRUSH will ensure that each replica of the
630data is stored on a unique host. Alternatively, if you select a failure domain
631of ``rack``, then each replica of the data will be stored in a different rack.
632Your selection of failure domain should be guided by the size and its CRUSH
633topology.
634
635The entire cluster hierarchy is typically nested beneath a root node that is
636named ``default``. If you have customized your hierarchy, you might want to
637create a rule nested beneath some other node in the hierarchy. In creating
638this rule for the customized hierarchy, the node type doesn't matter, and in
639particular the rule does not have to be nested beneath a ``root`` node.
640
641It is possible to create a rule that restricts data placement to a specific
642*class* of device. By default, Ceph OSDs automatically classify themselves as
643either ``hdd`` or ``ssd`` in accordance with the underlying type of device
644being used. These device classes can be customized. One might set the ``device
645class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
646them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
647and pools may be flexibly constrained to use (or avoid using) specific subsets
648of OSDs based on specific requirements.
649
650To create a rule for a replicated pool, run a command of the following form:
39ae355f
TL
651
652.. prompt:: bash $
c07f9fc5 653
39ae355f 654 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
c07f9fc5 655
1e59de90 656For details on this command's parameters, see the following:
c07f9fc5
FG
657
658``name``
1e59de90
TL
659 :Description: The name of the rule.
660 :Type: String
661 :Required: Yes
662 :Example: ``rbd-rule``
c07f9fc5
FG
663
664``root``
1e59de90
TL
665 :Description: The name of the CRUSH hierarchy node under which data is to be placed.
666 :Type: String
667 :Required: Yes
668 :Example: ``default``
c07f9fc5
FG
669
670``failure-domain-type``
1e59de90
TL
671 :Description: The type of CRUSH nodes used for the replicas of the failure domain.
672 :Type: String
673 :Required: Yes
674 :Example: ``rack``
c07f9fc5
FG
675
676``class``
1e59de90
TL
677 :Description: The device class on which data is to be placed.
678 :Type: String
679 :Required: No
680 :Example: ``ssd``
c07f9fc5 681
1e59de90 682Creating a rule for an erasure-coded pool
c07f9fc5
FG
683-----------------------------------------
684
1e59de90
TL
685For an erasure-coded pool, similar decisions need to be made: what the failure
686domain is, which node in the hierarchy data will be placed under (usually
687``default``), and whether placement is restricted to a specific device class.
688However, erasure-code pools are created in a different way: there is a need to
689construct them carefully with reference to the erasure code plugin in use. For
690this reason, these decisions must be incorporated into the **erasure-code
691profile**. A CRUSH rule will then be created from the erasure-code profile,
692either explicitly or automatically when the profile is used to create a pool.
c07f9fc5 693
1e59de90 694To list the erasure-code profiles, run the following command:
39ae355f
TL
695
696.. prompt:: bash $
697
698 ceph osd erasure-code-profile ls
c07f9fc5 699
1e59de90 700To view a specific existing profile, run a command of the following form:
c07f9fc5 701
39ae355f 702.. prompt:: bash $
c07f9fc5 703
39ae355f 704 ceph osd erasure-code-profile get {profile-name}
c07f9fc5 705
1e59de90
TL
706Under normal conditions, profiles should never be modified; instead, a new
707profile should be created and used when creating either a new pool or a new
c07f9fc5
FG
708rule for an existing pool.
709
1e59de90
TL
710An erasure-code profile consists of a set of key-value pairs. Most of these
711key-value pairs govern the behavior of the erasure code that encodes data in
712the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH
713rule that is created.
c07f9fc5 714
1e59de90 715The relevant erasure-code profile properties are as follows:
c07f9fc5 716
1e59de90
TL
717 * **crush-root**: the name of the CRUSH node under which to place data
718 [default: ``default``].
719 * **crush-failure-domain**: the CRUSH bucket type used in the distribution of
720 erasure-coded shards [default: ``host``].
721 * **crush-device-class**: the device class on which to place data [default:
722 none, which means that all devices are used].
723 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
724 number of erasure-code shards, affecting the resulting CRUSH rule.
c07f9fc5 725
1e59de90
TL
726 After a profile is defined, you can create a CRUSH rule by running a command
727 of the following form:
c07f9fc5 728
39ae355f
TL
729.. prompt:: bash $
730
731 ceph osd crush rule create-erasure {name} {profile-name}
c07f9fc5 732
1e59de90
TL
733.. note: When creating a new pool, it is not necessary to create the rule
734 explicitly. If only the erasure-code profile is specified and the rule
735 argument is omitted, then Ceph will create the CRUSH rule automatically.
736
c07f9fc5
FG
737
738Deleting rules
739--------------
740
1e59de90
TL
741To delete rules that are not in use by pools, run a command of the following
742form:
39ae355f
TL
743
744.. prompt:: bash $
c07f9fc5 745
39ae355f 746 ceph osd crush rule rm {rule-name}
c07f9fc5 747
11fdf7f2
TL
748.. _crush-map-tunables:
749
7c673cae
FG
750Tunables
751========
752
1e59de90
TL
753The CRUSH algorithm that is used to calculate the placement of data has been
754improved over time. In order to support changes in behavior, we have provided
755users with sets of tunables that determine which legacy or optimal version of
756CRUSH is to be used.
7c673cae 757
1e59de90
TL
758In order to use newer tunables, all Ceph clients and daemons must support the
759new major release of CRUSH. Because of this requirement, we have created
7c673cae 760``profiles`` that are named after the Ceph version in which they were
1e59de90
TL
761introduced. For example, the ``firefly`` tunables were first supported by the
762Firefly release and do not work with older clients (for example, clients
763running Dumpling). After a cluster's tunables profile is changed from a legacy
764set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options
765will prevent older clients that do not support the new CRUSH features from
766connecting to the cluster.
7c673cae
FG
767
768argonaut (legacy)
769-----------------
770
1e59de90
TL
771The legacy CRUSH behavior used by Argonaut and older releases works fine for
772most clusters, provided that not many OSDs have been marked ``out``.
7c673cae
FG
773
774bobtail (CRUSH_TUNABLES2)
775-------------------------
776
1e59de90 777The ``bobtail`` tunable profile provides the following improvements:
7c673cae 778
1e59de90
TL
779 * For hierarchies with a small number of devices in leaf buckets, some PGs
780 might map to fewer than the desired number of replicas, resulting in
781 ``undersized`` PGs. This is known to happen in the case of hierarchies with
782 ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each
783 host.
7c673cae 784
1e59de90
TL
785 * For large clusters, a small percentage of PGs might map to fewer than the
786 desired number of OSDs. This is known to happen when there are multiple
787 hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
788 ``osd``).
7c673cae 789
1e59de90 790 * When one or more OSDs are marked ``out``, data tends to be redistributed
7c673cae
FG
791 to nearby OSDs instead of across the entire hierarchy.
792
1e59de90 793The tunables introduced in the Bobtail release are as follows:
7c673cae 794
1e59de90
TL
795 * ``choose_local_tries``: Number of local retries. The legacy value is ``2``,
796 and the optimal value is ``0``.
7c673cae 797
1e59de90
TL
798 * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal
799 value is 0.
7c673cae 800
1e59de90
TL
801 * ``choose_total_tries``: Total number of attempts to choose an item. The
802 legacy value is ``19``, but subsequent testing indicates that a value of
803 ``50`` is more appropriate for typical clusters. For extremely large
804 clusters, an even larger value might be necessary.
7c673cae 805
1e59de90
TL
806 * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will
807 retry, or try only once and allow the original placement to retry. The
808 legacy default is ``0``, and the optimal value is ``1``.
7c673cae
FG
809
810Migration impact:
811
1e59de90
TL
812 * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a
813 moderate amount of data movement. Use caution on a cluster that is already
7c673cae
FG
814 populated with data.
815
816firefly (CRUSH_TUNABLES3)
817-------------------------
818
1e59de90
TL
819chooseleaf_vary_r
820~~~~~~~~~~~~~~~~~
7c673cae 821
1e59de90
TL
822This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step
823behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs.
7c673cae 824
1e59de90
TL
825This profile was introduced in the Firefly release, and adds a new tunable as follows:
826
827 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start
828 with a non-zero value of ``r``, as determined by the number of attempts the
829 parent has already made. The legacy default value is ``0``, but with this
830 value CRUSH is sometimes unable to find a mapping. The optimal value (in
f67539c2 831 terms of computational cost and correctness) is ``1``.
7c673cae 832
11fdf7f2 833Migration impact:
7c673cae 834
1e59de90
TL
835 * For existing clusters that store a great deal of data, changing this tunable
836 from ``0`` to ``1`` will trigger a large amount of data migration; a value
837 of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will
838 cause less data to move.
7c673cae 839
1e59de90
TL
840straw_calc_version tunable
841~~~~~~~~~~~~~~~~~~~~~~~~~~
7c673cae 842
1e59de90
TL
843There were problems with the internal weights calculated and stored in the
844CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH
845weight of ``0`` or with a mix of different and unique weights, CRUSH would
846distribute data incorrectly (that is, not in proportion to the weights).
7c673cae 847
1e59de90 848This tunable, introduced in the Firefly release, is as follows:
7c673cae 849
f67539c2 850 * ``straw_calc_version``: A value of ``0`` preserves the old, broken
1e59de90 851 internal-weight calculation; a value of ``1`` fixes the problem.
7c673cae
FG
852
853Migration impact:
854
1e59de90
TL
855 * Changing this tunable to a value of ``1`` and then adjusting a straw bucket
856 (either by adding, removing, or reweighting an item or by using the
857 reweight-all command) can trigger a small to moderate amount of data
858 movement provided that the cluster has hit one of the problematic
7c673cae
FG
859 conditions.
860
1e59de90
TL
861This tunable option is notable in that it has absolutely no impact on the
862required kernel version in the client side.
7c673cae
FG
863
864hammer (CRUSH_V4)
865-----------------
866
1e59de90
TL
867The ``hammer`` tunable profile does not affect the mapping of existing CRUSH
868maps simply by changing the profile. However:
7c673cae 869
1e59de90
TL
870 * There is a new bucket algorithm supported: ``straw2``. This new algorithm
871 fixes several limitations in the original ``straw``. More specifically, the
872 old ``straw`` buckets would change some mappings that should not have
873 changed when a weight was adjusted, while ``straw2`` achieves the original
874 goal of changing mappings only to or from the bucket item whose weight has
7c673cae
FG
875 changed.
876
1e59de90 877 * The ``straw2`` type is the default type for any newly created buckets.
7c673cae
FG
878
879Migration impact:
880
1e59de90
TL
881 * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small
882 amount of data movement, depending on how much the bucket items' weights
883 vary from each other. When the weights are all the same no data will move,
884 and the more variance there is in the weights the more movement there will
885 be.
7c673cae
FG
886
887jewel (CRUSH_TUNABLES5)
888-----------------------
889
1e59de90
TL
890The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a
891result, significantly fewer mappings change when an OSD is marked ``out`` of
892the cluster. This improvement results in significantly less data movement.
7c673cae 893
1e59de90 894The new tunable introduced in the Jewel release is as follows:
7c673cae 895
1e59de90
TL
896 * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt
897 will use a better value for an inner loop that greatly reduces the number of
898 mapping changes when an OSD is marked ``out``. The legacy value is ``0``,
899 and the new value of ``1`` uses the new approach.
7c673cae
FG
900
901Migration impact:
902
1e59de90
TL
903 * Changing this value on an existing cluster will result in a very large
904 amount of data movement because nearly every PG mapping is likely to change.
7c673cae 905
1e59de90 906Client versions that support CRUSH_TUNABLES2
7c673cae
FG
907--------------------------------------------
908
1e59de90
TL
909 * v0.55 and later, including Bobtail (v0.56.x)
910 * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients)
7c673cae 911
1e59de90
TL
912Client versions that support CRUSH_TUNABLES3
913--------------------------------------------
7c673cae 914
1e59de90
TL
915 * v0.78 (Firefly) and later
916 * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients)
7c673cae 917
1e59de90
TL
918Client versions that support CRUSH_V4
919-------------------------------------
7c673cae 920
1e59de90
TL
921 * v0.94 (Hammer) and later
922 * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients)
7c673cae 923
1e59de90
TL
924Client versions that support CRUSH_TUNABLES5
925--------------------------------------------
7c673cae 926
1e59de90
TL
927 * v10.0.2 (Jewel) and later
928 * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients)
7c673cae 929
1e59de90
TL
930"Non-optimal tunables" warning
931------------------------------
7c673cae 932
1e59de90
TL
933In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush
934map has non-optimal tunables") if any of the current CRUSH tunables have
935non-optimal values: that is, if any fail to have the optimal values from the
936:ref:` ``default`` profile
937<rados_operations_crush_map_default_profile_definition>`. There are two
938different ways to silence the alert:
7c673cae 939
1e59de90
TL
9401. Adjust the CRUSH tunables on the existing cluster so as to render them
941 optimal. Making this adjustment will trigger some data movement
942 (possibly as much as 10%). This approach is generally preferred to the
943 other approach, but special care must be taken in situations where
944 data movement might affect performance: for example, in production clusters.
945 To enable optimal tunables, run the following command:
39ae355f
TL
946
947 .. prompt:: bash $
7c673cae
FG
948
949 ceph osd crush tunables optimal
950
1e59de90
TL
951 There are several potential problems that might make it preferable to revert
952 to the previous values of the tunables. The new values might generate too
953 much load for the cluster to handle, the new values might unacceptably slow
954 the operation of the cluster, or there might be a client-compatibility
955 problem. Such client-compatibility problems can arise when using old-kernel
956 CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to
957 the previous values of the tunables, run the following command:
39ae355f
TL
958
959 .. prompt:: bash $
7c673cae
FG
960
961 ceph osd crush tunables legacy
962
1e59de90
TL
9632. To silence the alert without making any changes to CRUSH,
964 add the following option to the ``[mon]`` section of your ceph.conf file::
7c673cae 965
1e59de90 966 mon_warn_on_legacy_crush_tunables = false
7c673cae 967
1e59de90
TL
968 In order for this change to take effect, you will need to either restart
969 the monitors or run the following command to apply the option to the
970 monitors while they are still running:
39ae355f
TL
971
972 .. prompt:: bash $
7c673cae 973
11fdf7f2 974 ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
7c673cae
FG
975
976
7c673cae
FG
977Tuning CRUSH
978------------
979
1e59de90
TL
980When making adjustments to CRUSH tunables, keep the following considerations in
981mind:
982
983 * Adjusting the values of CRUSH tunables will result in the shift of one or
984 more PGs from one storage node to another. If the Ceph cluster is already
985 storing a great deal of data, be prepared for significant data movement.
986 * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they
987 immediately begin rejecting new connections from clients that do not support
988 the new feature. However, already-connected clients are effectively
989 grandfathered in, and any of these clients that do not support the new
990 feature will malfunction.
991 * If the CRUSH tunables are set to newer (non-legacy) values and subsequently
992 reverted to the legacy values, ``ceph-osd`` daemons will not be required to
993 support any of the newer CRUSH features associated with the newer
994 (non-legacy) values. However, the OSD peering process requires the
995 examination and understanding of old maps. For this reason, **if the cluster
996 has previously used non-legacy CRUSH values, do not run old versions of
997 the** ``ceph-osd`` **daemon** -- even if the latest version of the map has
998 been reverted so as to use the legacy defaults.
999
1000The simplest way to adjust CRUSH tunables is to apply them in matched sets
1001known as *profiles*. As of the Octopus release, Ceph supports the following
1002profiles:
1003
1004 * ``legacy``: The legacy behavior from argonaut and earlier.
1005 * ``argonaut``: The legacy values supported by the argonaut release.
1006 * ``bobtail``: The values supported by the bobtail release.
1007 * ``firefly``: The values supported by the firefly release.
1008 * ``hammer``: The values supported by the hammer release.
1009 * ``jewel``: The values supported by the jewel release.
1010 * ``optimal``: The best values for the current version of Ceph.
1011 .. _rados_operations_crush_map_default_profile_definition:
1012 * ``default``: The default values of a new cluster that has been installed
1013 from scratch. These values, which depend on the current version of Ceph, are
1014 hardcoded and are typically a mix of optimal and legacy values. These
1015 values often correspond to the ``optimal`` profile of either the previous
1016 LTS (long-term service) release or the most recent release for which most
1017 users are expected to have up-to-date clients.
1018
1019To apply a profile to a running cluster, run a command of the following form:
7c673cae 1020
39ae355f
TL
1021.. prompt:: bash $
1022
1023 ceph osd crush tunables {PROFILE}
7c673cae 1024
1e59de90
TL
1025This action might trigger a great deal of data movement. Consult release notes
1026and documentation before changing the profile on a running cluster. Consider
1027throttling recovery and backfill parameters in order to limit the backfill
1028resulting from a specific change.
7c673cae 1029
39ae355f 1030.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
7c673cae 1031
7c673cae 1032
1e59de90
TL
1033Tuning Primary OSD Selection
1034============================
7c673cae 1035
1e59de90 1036When a Ceph client reads or writes data, it first contacts the primary OSD in
f67539c2 1037each affected PG's acting set. By default, the first OSD in the acting set is
1e59de90
TL
1038the primary OSD (also known as the "lead OSD"). For example, in the acting set
1039``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD.
1040However, sometimes it is clear that an OSD is not well suited to act as the
1041lead as compared with other OSDs (for example, if the OSD has a slow drive or a
1042slow controller). To prevent performance bottlenecks (especially on read
1043operations) and at the same time maximize the utilization of your hardware, you
1044can influence the selection of the primary OSD either by adjusting "primary
1045affinity" values, or by crafting a CRUSH rule that selects OSDs that are better
1046suited to act as the lead rather than other OSDs.
1047
1048To determine whether tuning Ceph's selection of primary OSDs will improve
1049cluster performance, pool redundancy strategy must be taken into account. For
1050replicated pools, this tuning can be especially useful, because by default read
1051operations are served from the primary OSD of each PG. For erasure-coded pools,
1052however, the speed of read operations can be increased by enabling **fast
1053read** (see :ref:`pool-settings`).
1054
aee94f69
TL
1055.. _rados_ops_primary_affinity:
1056
1e59de90
TL
1057Primary Affinity
1058----------------
1059
1060**Primary affinity** is a characteristic of an OSD that governs the likelihood
1061that a given OSD will be selected as the primary OSD (or "lead OSD") in a given
1062acting set. A primary affinity value can be any real number in the range ``0``
1063to ``1``, inclusive.
1064
1065As an example of a common scenario in which it can be useful to adjust primary
1066affinity values, let us suppose that a cluster contains a mix of drive sizes:
1067for example, suppose it contains some older racks with 1.9 TB SATA SSDs and
1068some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned
1069twice the number of PGs and will thus serve twice the number of write and read
1070operations -- they will be busier than the former. In such a scenario, you
1071might make a rough assignment of primary affinity as inversely proportional to
1072OSD size. Such an assignment will not be 100% optimal, but it can readily
1073achieve a 15% improvement in overall read throughput by means of a more even
1074utilization of SATA interface bandwidth and CPU cycles. This example is not
1075merely a thought experiment meant to illustrate the theoretical benefits of
1076adjusting primary affinity values; this fifteen percent improvement was
1077achieved on an actual Ceph cluster.
1078
1079By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster
1080in which every OSD has this default value, all OSDs are equally likely to act
1081as a primary OSD.
1082
1083By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less
1084likely to select the OSD as primary in a PG's acting set. To change the weight
1085value associated with a specific OSD's primary affinity, run a command of the
1086following form:
f67539c2 1087
39ae355f 1088.. prompt:: bash $
7c673cae 1089
39ae355f
TL
1090 ceph osd primary-affinity <osd-id> <weight>
1091
1e59de90
TL
1092The primary affinity of an OSD can be set to any real number in the range
1093``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as
1094primary and ``1`` indicates that the OSD is maximally likely to be used as a
1095primary. When the weight is between these extremes, its value indicates roughly
1096how likely it is that CRUSH will select the OSD associated with it as a
1097primary.
1098
1099The process by which CRUSH selects the lead OSD is not a mere function of a
1100simple probability determined by relative affinity values. Nevertheless,
1101measurable results can be achieved even with first-order approximations of
1102desirable primary affinity values.
1103
f67539c2
TL
1104
1105Custom CRUSH Rules
1106------------------
1107
1e59de90
TL
1108Some clusters balance cost and performance by mixing SSDs and HDDs in the same
1109replicated pool. By setting the primary affinity of HDD OSDs to ``0``,
1110operations will be directed to an SSD OSD in each acting set. Alternatively,
1111you can define a CRUSH rule that always selects an SSD OSD as the primary OSD
1112and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting
1113set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs.
1114
1115For example, see the following CRUSH rule::
1116
1117 rule mixed_replicated_rule {
1118 id 11
1119 type replicated
1120 step take default class ssd
1121 step chooseleaf firstn 1 type host
1122 step emit
1123 step take default class hdd
1124 step chooseleaf firstn 0 type host
1125 step emit
1126 }
1127
1128This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool,
1129this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on
1130different hosts, because the first SSD OSD might be colocated with any of the
1131``N`` HDD OSDs.
1132
1133To avoid this extra storage requirement, you might place SSDs and HDDs in
1134different hosts. However, taking this approach means that all client requests
1135will be received by hosts with SSDs. For this reason, it might be advisable to
1136have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the
1137latter will under normal circumstances perform only recovery operations. Here
1138the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement
1139not to contain any of the same servers, as seen in the following CRUSH rule::
f67539c2
TL
1140
1141 rule mixed_replicated_rule_two {
1142 id 1
1143 type replicated
f67539c2
TL
1144 step take ssd_hosts class ssd
1145 step chooseleaf firstn 1 type host
1146 step emit
1147 step take hdd_hosts class hdd
1148 step chooseleaf firstn -1 type host
1149 step emit
1150 }
1151
1e59de90
TL
1152.. note:: If a primary SSD OSD fails, then requests to the associated PG will
1153 be temporarily served from a slower HDD OSD until the PG's data has been
1154 replicated onto the replacement primary SSD OSD.
f67539c2 1155
7c673cae 1156