]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | CRUSH Maps | |
3 | ============ | |
4 | ||
5 | The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm | |
f67539c2 | 6 | determines how to store and retrieve data by computing storage locations. |
7c673cae FG |
7 | CRUSH empowers Ceph clients to communicate with OSDs directly rather than |
8 | through a centralized server or broker. With an algorithmically determined | |
9 | method of storing and retrieving data, Ceph avoids a single point of failure, a | |
10 | performance bottleneck, and a physical limit to its scalability. | |
11 | ||
f67539c2 TL |
12 | CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly |
13 | map data to OSDs, distributing it across the cluster according to configured | |
14 | replication policy and failure domain. For a detailed discussion of CRUSH, see | |
7c673cae FG |
15 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ |
16 | ||
f67539c2 TL |
17 | CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy |
18 | of 'buckets' for aggregating devices and buckets, and | |
19 | rules that govern how CRUSH replicates data within the cluster's pools. By | |
7c673cae | 20 | reflecting the underlying physical organization of the installation, CRUSH can |
f67539c2 TL |
21 | model (and thereby address) the potential for correlated device failures. |
22 | Typical factors include chassis, racks, physical proximity, a shared power | |
23 | source, and shared networking. By encoding this information into the cluster | |
24 | map, CRUSH placement | |
25 | policies distribute object replicas across failure domains while | |
26 | maintaining the desired distribution. For example, to address the | |
7c673cae FG |
27 | possibility of concurrent failures, it may be desirable to ensure that data |
28 | replicas are on devices using different shelves, racks, power supplies, | |
29 | controllers, and/or physical locations. | |
30 | ||
f67539c2 TL |
31 | When you deploy OSDs they are automatically added to the CRUSH map under a |
32 | ``host`` bucket named for the node on which they run. This, | |
33 | combined with the configured CRUSH failure domain, ensures that replicas or | |
34 | erasure code shards are distributed across hosts and that a single host or other | |
35 | failure will not affect availability. For larger clusters, administrators must | |
36 | carefully consider their choice of failure domain. Separating replicas across racks, | |
37 | for example, is typical for mid- to large-sized clusters. | |
7c673cae FG |
38 | |
39 | ||
40 | CRUSH Location | |
41 | ============== | |
42 | ||
f67539c2 TL |
43 | The location of an OSD within the CRUSH map's hierarchy is |
44 | referred to as a ``CRUSH location``. This location specifier takes the | |
45 | form of a list of key and value pairs. For | |
c07f9fc5 | 46 | example, if an OSD is in a particular row, rack, chassis and host, and |
f67539c2 TL |
47 | is part of the 'default' CRUSH root (which is the case for most |
48 | clusters), its CRUSH location could be described as:: | |
7c673cae FG |
49 | |
50 | root=default row=a rack=a2 chassis=a2a host=a2a1 | |
51 | ||
52 | Note: | |
53 | ||
54 | #. Note that the order of the keys does not matter. | |
55 | #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default | |
f67539c2 TL |
56 | these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``, |
57 | ``rack``, ``chassis`` and ``host``. | |
58 | These defined types suffice for almost all clusters, but can be customized | |
59 | by modifying the CRUSH map. | |
7c673cae | 60 | #. Not all keys need to be specified. For example, by default, Ceph |
f67539c2 | 61 | automatically sets an ``OSD``'s location to be |
7c673cae FG |
62 | ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). |
63 | ||
f67539c2 TL |
64 | The CRUSH location for an OSD can be defined by adding the ``crush location`` |
65 | option in ``ceph.conf``. Each time the OSD starts, | |
c07f9fc5 | 66 | it verifies it is in the correct location in the CRUSH map and, if it is not, |
11fdf7f2 | 67 | it moves itself. To disable this automatic CRUSH map management, add the |
c07f9fc5 | 68 | following to your configuration file in the ``[osd]`` section:: |
7c673cae FG |
69 | |
70 | osd crush update on start = false | |
71 | ||
f67539c2 TL |
72 | Note that in most cases you will not need to manually configure this. |
73 | ||
c07f9fc5 | 74 | |
7c673cae FG |
75 | Custom location hooks |
76 | --------------------- | |
77 | ||
c07f9fc5 | 78 | A customized location hook can be used to generate a more complete |
f67539c2 | 79 | CRUSH location on startup. The CRUSH location is based on, in order |
11fdf7f2 | 80 | of preference: |
c07f9fc5 | 81 | |
f67539c2 | 82 | #. A ``crush location`` option in ``ceph.conf`` |
c07f9fc5 | 83 | #. A default of ``root=default host=HOSTNAME`` where the hostname is |
f67539c2 | 84 | derived from the ``hostname -s`` command |
c07f9fc5 | 85 | |
f67539c2 TL |
86 | A script can be written to provide additional |
87 | location fields (for example, ``rack`` or ``datacenter``) and the | |
c07f9fc5 | 88 | hook enabled via the config option:: |
7c673cae | 89 | |
f67539c2 | 90 | crush location hook = /path/to/customized-ceph-crush-location |
7c673cae FG |
91 | |
92 | This hook is passed several arguments (below) and should output a single line | |
f67539c2 | 93 | to ``stdout`` with the CRUSH location description.:: |
7c673cae | 94 | |
11fdf7f2 | 95 | --cluster CLUSTER --id ID --type TYPE |
7c673cae | 96 | |
f67539c2 | 97 | where the cluster name is typically ``ceph``, the ``id`` is the daemon |
11fdf7f2 | 98 | identifier (e.g., the OSD number or daemon identifier), and the daemon |
f67539c2 | 99 | type is ``osd``, ``mds``, etc. |
11fdf7f2 | 100 | |
f67539c2 TL |
101 | For example, a simple hook that additionally specifies a rack location |
102 | based on a value in the file ``/etc/rack`` might be:: | |
11fdf7f2 TL |
103 | |
104 | #!/bin/sh | |
105 | echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" | |
7c673cae FG |
106 | |
107 | ||
c07f9fc5 FG |
108 | CRUSH structure |
109 | =============== | |
7c673cae | 110 | |
f67539c2 TL |
111 | The CRUSH map consists of a hierarchy that describes |
112 | the physical topology of the cluster and a set of rules defining | |
113 | data placement policy. The hierarchy has | |
114 | devices (OSDs) at the leaves, and internal nodes | |
c07f9fc5 FG |
115 | corresponding to other physical features or groupings: hosts, racks, |
116 | rows, datacenters, and so on. The rules describe how replicas are | |
117 | placed in terms of that hierarchy (e.g., 'three replicas in different | |
118 | racks'). | |
7c673cae | 119 | |
c07f9fc5 FG |
120 | Devices |
121 | ------- | |
7c673cae | 122 | |
f67539c2 TL |
123 | Devices are individual OSDs that store data, usually one for each storage drive. |
124 | Devices are identified by an ``id`` | |
125 | (a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id. | |
7c673cae | 126 | |
f67539c2 TL |
127 | Since the Luminous release, devices may also have a *device class* assigned (e.g., |
128 | ``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by | |
129 | CRUSH rules. This is especially useful when mixing device types within hosts. | |
7c673cae | 130 | |
c07f9fc5 | 131 | Types and Buckets |
7c673cae FG |
132 | ----------------- |
133 | ||
c07f9fc5 FG |
134 | A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, |
135 | racks, rows, etc. The CRUSH map defines a series of *types* that are | |
f67539c2 TL |
136 | used to describe these nodes. Default types include: |
137 | ||
138 | - ``osd`` (or ``device``) | |
139 | - ``host`` | |
140 | - ``chassis`` | |
141 | - ``rack`` | |
142 | - ``row`` | |
143 | - ``pdu`` | |
144 | - ``pod`` | |
145 | - ``room`` | |
146 | - ``datacenter`` | |
147 | - ``zone`` | |
148 | - ``region`` | |
149 | - ``root`` | |
150 | ||
151 | Most clusters use only a handful of these types, and others | |
c07f9fc5 FG |
152 | can be defined as needed. |
153 | ||
154 | The hierarchy is built with devices (normally type ``osd``) at the | |
155 | leaves, interior nodes with non-device types, and a root node of type | |
156 | ``root``. For example, | |
157 | ||
158 | .. ditaa:: | |
159 | ||
160 | +-----------------+ | |
11fdf7f2 | 161 | |{o}root default | |
c07f9fc5 | 162 | +--------+--------+ |
7c673cae | 163 | | |
11fdf7f2 TL |
164 | +---------------+---------------+ |
165 | | | | |
166 | +------+------+ +------+------+ | |
167 | |{o}host foo | |{o}host bar | | |
168 | +------+------+ +------+------+ | |
7c673cae | 169 | | | |
7c673cae FG |
170 | +-------+-------+ +-------+-------+ |
171 | | | | | | |
172 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
11fdf7f2 | 173 | | osd.0 | | osd.1 | | osd.2 | | osd.3 | |
7c673cae FG |
174 | +-----------+ +-----------+ +-----------+ +-----------+ |
175 | ||
c07f9fc5 | 176 | Each node (device or bucket) in the hierarchy has a *weight* |
f67539c2 | 177 | that indicates the relative proportion of the total |
c07f9fc5 FG |
178 | data that device or hierarchy subtree should store. Weights are set |
179 | at the leaves, indicating the size of the device, and automatically | |
f67539c2 | 180 | sum up the tree, such that the weight of the ``root`` node |
c07f9fc5 FG |
181 | will be the total of all devices contained beneath it. Normally |
182 | weights are in units of terabytes (TB). | |
183 | ||
f67539c2 TL |
184 | You can get a simple view the of CRUSH hierarchy for your cluster, |
185 | including weights, with:: | |
c07f9fc5 | 186 | |
f67539c2 | 187 | ceph osd tree |
c07f9fc5 FG |
188 | |
189 | Rules | |
190 | ----- | |
191 | ||
f67539c2 TL |
192 | CRUSH Rules define policy about how data is distributed across the devices |
193 | in the hierarchy. They define placement and replication strategies or | |
c07f9fc5 | 194 | distribution policies that allow you to specify exactly how CRUSH |
f67539c2 TL |
195 | places data replicas. For example, you might create a rule selecting |
196 | a pair of targets for two-way mirroring, another rule for selecting | |
197 | three targets in two different data centers for three-way mirroring, and | |
198 | yet another rule for erasure coding (EC) across six storage devices. For a | |
c07f9fc5 FG |
199 | detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, |
200 | Scalable, Decentralized Placement of Replicated Data`_, and more | |
201 | specifically to **Section 3.2**. | |
202 | ||
f67539c2 | 203 | CRUSH rules can be created via the CLI by |
c07f9fc5 FG |
204 | specifying the *pool type* they will be used for (replicated or |
205 | erasure coded), the *failure domain*, and optionally a *device class*. | |
206 | In rare cases rules must be written by hand by manually editing the | |
207 | CRUSH map. | |
11fdf7f2 | 208 | |
c07f9fc5 | 209 | You can see what rules are defined for your cluster with:: |
7c673cae | 210 | |
c07f9fc5 | 211 | ceph osd crush rule ls |
7c673cae | 212 | |
c07f9fc5 | 213 | You can view the contents of the rules with:: |
7c673cae | 214 | |
c07f9fc5 | 215 | ceph osd crush rule dump |
7c673cae | 216 | |
d2e6a577 FG |
217 | Device classes |
218 | -------------- | |
219 | ||
f67539c2 TL |
220 | Each device can optionally have a *class* assigned. By |
221 | default, OSDs automatically set their class at startup to | |
d2e6a577 FG |
222 | `hdd`, `ssd`, or `nvme` based on the type of device they are backed |
223 | by. | |
224 | ||
225 | The device class for one or more OSDs can be explicitly set with:: | |
226 | ||
227 | ceph osd crush set-device-class <class> <osd-name> [...] | |
228 | ||
229 | Once a device class is set, it cannot be changed to another class | |
230 | until the old class is unset with:: | |
231 | ||
232 | ceph osd crush rm-device-class <osd-name> [...] | |
233 | ||
234 | This allows administrators to set device classes without the class | |
235 | being changed on OSD restart or by some other script. | |
236 | ||
237 | A placement rule that targets a specific device class can be created with:: | |
238 | ||
239 | ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> | |
240 | ||
241 | A pool can then be changed to use the new rule with:: | |
242 | ||
243 | ceph osd pool set <pool-name> crush_rule <rule-name> | |
244 | ||
245 | Device classes are implemented by creating a "shadow" CRUSH hierarchy | |
246 | for each device class in use that contains only devices of that class. | |
f67539c2 TL |
247 | CRUSH rules can then distribute data over the shadow hierarchy. |
248 | This approach is fully backward compatible with | |
d2e6a577 FG |
249 | old Ceph clients. You can view the CRUSH hierarchy with shadow items |
250 | with:: | |
251 | ||
252 | ceph osd crush tree --show-shadow | |
253 | ||
f64942e4 AA |
254 | For older clusters created before Luminous that relied on manually |
255 | crafted CRUSH maps to maintain per-device-type hierarchies, there is a | |
256 | *reclassify* tool available to help transition to device classes | |
257 | without triggering data movement (see :ref:`crush-reclassify`). | |
258 | ||
7c673cae | 259 | |
c07f9fc5 FG |
260 | Weights sets |
261 | ------------ | |
7c673cae | 262 | |
c07f9fc5 FG |
263 | A *weight set* is an alternative set of weights to use when |
264 | calculating data placement. The normal weights associated with each | |
265 | device in the CRUSH map are set based on the device size and indicate | |
266 | how much data we *should* be storing where. However, because CRUSH is | |
f67539c2 TL |
267 | a "probabilistic" pseudorandom placement process, there is always some |
268 | variation from this ideal distribution, in the same way that rolling a | |
269 | die sixty times will not result in rolling exactly 10 ones and 10 | |
270 | sixes. Weight sets allow the cluster to perform numerical optimization | |
c07f9fc5 FG |
271 | based on the specifics of your cluster (hierarchy, pools, etc.) to achieve |
272 | a balanced distribution. | |
273 | ||
274 | There are two types of weight sets supported: | |
275 | ||
276 | #. A **compat** weight set is a single alternative set of weights for | |
277 | each device and node in the cluster. This is not well-suited for | |
278 | correcting for all anomalies (for example, placement groups for | |
279 | different pools may be different sizes and have different load | |
280 | levels, but will be mostly treated the same by the balancer). | |
281 | However, compat weight sets have the huge advantage that they are | |
282 | *backward compatible* with previous versions of Ceph, which means | |
283 | that even though weight sets were first introduced in Luminous | |
284 | v12.2.z, older clients (e.g., firefly) can still connect to the | |
285 | cluster when a compat weight set is being used to balance data. | |
286 | #. A **per-pool** weight set is more flexible in that it allows | |
287 | placement to be optimized for each data pool. Additionally, | |
288 | weights can be adjusted for each position of placement, allowing | |
11fdf7f2 | 289 | the optimizer to correct for a subtle skew of data toward devices |
c07f9fc5 FG |
290 | with small weights relative to their peers (and effect that is |
291 | usually only apparently in very large clusters but which can cause | |
292 | balancing problems). | |
293 | ||
294 | When weight sets are in use, the weights associated with each node in | |
295 | the hierarchy is visible as a separate column (labeled either | |
296 | ``(compat)`` or the pool name) from the command:: | |
297 | ||
f67539c2 | 298 | ceph osd tree |
c07f9fc5 FG |
299 | |
300 | When both *compat* and *per-pool* weight sets are in use, data | |
301 | placement for a particular pool will use its own per-pool weight set | |
302 | if present. If not, it will use the compat weight set if present. If | |
303 | neither are present, it will use the normal CRUSH weights. | |
304 | ||
305 | Although weight sets can be set up and manipulated by hand, it is | |
f67539c2 TL |
306 | recommended that the ``ceph-mgr`` *balancer* module be enabled to do so |
307 | automatically when running Luminous or later releases. | |
c07f9fc5 FG |
308 | |
309 | ||
310 | Modifying the CRUSH map | |
311 | ======================= | |
7c673cae FG |
312 | |
313 | .. _addosd: | |
314 | ||
315 | Add/Move an OSD | |
c07f9fc5 | 316 | --------------- |
7c673cae | 317 | |
c07f9fc5 FG |
318 | .. note: OSDs are normally automatically added to the CRUSH map when |
319 | the OSD is created. This command is rarely needed. | |
7c673cae | 320 | |
c07f9fc5 | 321 | To add or move an OSD in the CRUSH map of a running cluster:: |
7c673cae | 322 | |
c07f9fc5 | 323 | ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] |
7c673cae FG |
324 | |
325 | Where: | |
326 | ||
7c673cae FG |
327 | ``name`` |
328 | ||
11fdf7f2 | 329 | :Description: The full name of the OSD. |
7c673cae FG |
330 | :Type: String |
331 | :Required: Yes | |
332 | :Example: ``osd.0`` | |
333 | ||
334 | ||
335 | ``weight`` | |
336 | ||
c07f9fc5 | 337 | :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). |
7c673cae FG |
338 | :Type: Double |
339 | :Required: Yes | |
340 | :Example: ``2.0`` | |
341 | ||
342 | ||
343 | ``root`` | |
344 | ||
c07f9fc5 | 345 | :Description: The root node of the tree in which the OSD resides (normally ``default``) |
7c673cae FG |
346 | :Type: Key/value pair. |
347 | :Required: Yes | |
348 | :Example: ``root=default`` | |
349 | ||
350 | ||
351 | ``bucket-type`` | |
352 | ||
11fdf7f2 | 353 | :Description: You may specify the OSD's location in the CRUSH hierarchy. |
7c673cae FG |
354 | :Type: Key/value pairs. |
355 | :Required: No | |
356 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
357 | ||
358 | ||
c07f9fc5 FG |
359 | The following example adds ``osd.0`` to the hierarchy, or moves the |
360 | OSD from a previous location. :: | |
7c673cae | 361 | |
c07f9fc5 | 362 | ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 |
7c673cae FG |
363 | |
364 | ||
c07f9fc5 FG |
365 | Adjust OSD weight |
366 | ----------------- | |
367 | ||
368 | .. note: Normally OSDs automatically add themselves to the CRUSH map | |
369 | with the correct weight when they are created. This command | |
370 | is rarely needed. | |
7c673cae | 371 | |
f67539c2 | 372 | To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute |
7c673cae FG |
373 | the following:: |
374 | ||
c07f9fc5 | 375 | ceph osd crush reweight {name} {weight} |
7c673cae FG |
376 | |
377 | Where: | |
378 | ||
379 | ``name`` | |
380 | ||
11fdf7f2 | 381 | :Description: The full name of the OSD. |
7c673cae FG |
382 | :Type: String |
383 | :Required: Yes | |
384 | :Example: ``osd.0`` | |
385 | ||
386 | ||
387 | ``weight`` | |
388 | ||
11fdf7f2 | 389 | :Description: The CRUSH weight for the OSD. |
7c673cae FG |
390 | :Type: Double |
391 | :Required: Yes | |
392 | :Example: ``2.0`` | |
393 | ||
394 | ||
395 | .. _removeosd: | |
396 | ||
397 | Remove an OSD | |
c07f9fc5 FG |
398 | ------------- |
399 | ||
400 | .. note: OSDs are normally removed from the CRUSH as part of the | |
401 | ``ceph osd purge`` command. This command is rarely needed. | |
7c673cae | 402 | |
c07f9fc5 FG |
403 | To remove an OSD from the CRUSH map of a running cluster, execute the |
404 | following:: | |
7c673cae | 405 | |
c07f9fc5 | 406 | ceph osd crush remove {name} |
7c673cae FG |
407 | |
408 | Where: | |
409 | ||
410 | ``name`` | |
411 | ||
11fdf7f2 | 412 | :Description: The full name of the OSD. |
7c673cae FG |
413 | :Type: String |
414 | :Required: Yes | |
415 | :Example: ``osd.0`` | |
416 | ||
c07f9fc5 | 417 | |
7c673cae | 418 | Add a Bucket |
c07f9fc5 FG |
419 | ------------ |
420 | ||
f67539c2 | 421 | .. note: Buckets are implicitly created when an OSD is added |
c07f9fc5 | 422 | that specifies a ``{bucket-type}={bucket-name}`` as part of its |
f67539c2 | 423 | location, if a bucket with that name does not already exist. This |
c07f9fc5 | 424 | command is typically used when manually adjusting the structure of the |
f67539c2 TL |
425 | hierarchy after OSDs have been created. One use is to move a |
426 | series of hosts underneath a new rack-level bucket; another is to | |
427 | add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't | |
428 | receive data until you're ready, at which time you would move them to the | |
429 | ``default`` or other root as described below. | |
7c673cae | 430 | |
c07f9fc5 FG |
431 | To add a bucket in the CRUSH map of a running cluster, execute the |
432 | ``ceph osd crush add-bucket`` command:: | |
7c673cae | 433 | |
c07f9fc5 | 434 | ceph osd crush add-bucket {bucket-name} {bucket-type} |
7c673cae FG |
435 | |
436 | Where: | |
437 | ||
438 | ``bucket-name`` | |
439 | ||
440 | :Description: The full name of the bucket. | |
441 | :Type: String | |
442 | :Required: Yes | |
443 | :Example: ``rack12`` | |
444 | ||
445 | ||
446 | ``bucket-type`` | |
447 | ||
448 | :Description: The type of the bucket. The type must already exist in the hierarchy. | |
449 | :Type: String | |
450 | :Required: Yes | |
451 | :Example: ``rack`` | |
452 | ||
453 | ||
454 | The following example adds the ``rack12`` bucket to the hierarchy:: | |
455 | ||
c07f9fc5 | 456 | ceph osd crush add-bucket rack12 rack |
7c673cae FG |
457 | |
458 | Move a Bucket | |
c07f9fc5 | 459 | ------------- |
7c673cae | 460 | |
c07f9fc5 FG |
461 | To move a bucket to a different location or position in the CRUSH map |
462 | hierarchy, execute the following:: | |
7c673cae | 463 | |
c07f9fc5 | 464 | ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] |
7c673cae FG |
465 | |
466 | Where: | |
467 | ||
468 | ``bucket-name`` | |
469 | ||
470 | :Description: The name of the bucket to move/reposition. | |
471 | :Type: String | |
472 | :Required: Yes | |
473 | :Example: ``foo-bar-1`` | |
474 | ||
475 | ``bucket-type`` | |
476 | ||
11fdf7f2 | 477 | :Description: You may specify the bucket's location in the CRUSH hierarchy. |
7c673cae FG |
478 | :Type: Key/value pairs. |
479 | :Required: No | |
480 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
481 | ||
482 | Remove a Bucket | |
c07f9fc5 | 483 | --------------- |
7c673cae | 484 | |
f67539c2 | 485 | To remove a bucket from the CRUSH hierarchy, execute the following:: |
7c673cae | 486 | |
c07f9fc5 | 487 | ceph osd crush remove {bucket-name} |
7c673cae FG |
488 | |
489 | .. note:: A bucket must be empty before removing it from the CRUSH hierarchy. | |
490 | ||
491 | Where: | |
492 | ||
493 | ``bucket-name`` | |
494 | ||
495 | :Description: The name of the bucket that you'd like to remove. | |
496 | :Type: String | |
497 | :Required: Yes | |
498 | :Example: ``rack12`` | |
499 | ||
500 | The following example removes the ``rack12`` bucket from the hierarchy:: | |
501 | ||
c07f9fc5 FG |
502 | ceph osd crush remove rack12 |
503 | ||
504 | Creating a compat weight set | |
505 | ---------------------------- | |
506 | ||
507 | .. note: This step is normally done automatically by the ``balancer`` | |
508 | module when enabled. | |
509 | ||
510 | To create a *compat* weight set:: | |
511 | ||
512 | ceph osd crush weight-set create-compat | |
513 | ||
514 | Weights for the compat weight set can be adjusted with:: | |
515 | ||
516 | ceph osd crush weight-set reweight-compat {name} {weight} | |
517 | ||
518 | The compat weight set can be destroyed with:: | |
519 | ||
520 | ceph osd crush weight-set rm-compat | |
521 | ||
522 | Creating per-pool weight sets | |
523 | ----------------------------- | |
524 | ||
525 | To create a weight set for a specific pool,:: | |
526 | ||
527 | ceph osd crush weight-set create {pool-name} {mode} | |
528 | ||
529 | .. note:: Per-pool weight sets require that all servers and daemons | |
530 | run Luminous v12.2.z or later. | |
531 | ||
532 | Where: | |
533 | ||
534 | ``pool-name`` | |
535 | ||
536 | :Description: The name of a RADOS pool | |
537 | :Type: String | |
538 | :Required: Yes | |
539 | :Example: ``rbd`` | |
540 | ||
541 | ``mode`` | |
542 | ||
543 | :Description: Either ``flat`` or ``positional``. A *flat* weight set | |
544 | has a single weight for each device or bucket. A | |
545 | *positional* weight set has a potentially different | |
546 | weight for each position in the resulting placement | |
547 | mapping. For example, if a pool has a replica count of | |
548 | 3, then a positional weight set will have three weights | |
549 | for each device and bucket. | |
550 | :Type: String | |
551 | :Required: Yes | |
552 | :Example: ``flat`` | |
553 | ||
554 | To adjust the weight of an item in a weight set:: | |
555 | ||
556 | ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} | |
557 | ||
558 | To list existing weight sets,:: | |
559 | ||
560 | ceph osd crush weight-set ls | |
561 | ||
562 | To remove a weight set,:: | |
563 | ||
564 | ceph osd crush weight-set rm {pool-name} | |
565 | ||
566 | Creating a rule for a replicated pool | |
567 | ------------------------------------- | |
568 | ||
569 | For a replicated pool, the primary decision when creating the CRUSH | |
570 | rule is what the failure domain is going to be. For example, if a | |
571 | failure domain of ``host`` is selected, then CRUSH will ensure that | |
f67539c2 | 572 | each replica of the data is stored on a unique host. If ``rack`` |
c07f9fc5 | 573 | is selected, then each replica will be stored in a different rack. |
f67539c2 TL |
574 | What failure domain you choose primarily depends on the size and |
575 | topology of your cluster. | |
c07f9fc5 | 576 | |
f67539c2 | 577 | In most cases the entire cluster hierarchy is nested beneath a root node |
c07f9fc5 FG |
578 | named ``default``. If you have customized your hierarchy, you may |
579 | want to create a rule nested at some other node in the hierarchy. It | |
580 | doesn't matter what type is associated with that node (it doesn't have | |
581 | to be a ``root`` node). | |
582 | ||
583 | It is also possible to create a rule that restricts data placement to | |
584 | a specific *class* of device. By default, Ceph OSDs automatically | |
585 | classify themselves as either ``hdd`` or ``ssd``, depending on the | |
586 | underlying type of device being used. These classes can also be | |
587 | customized. | |
588 | ||
589 | To create a replicated rule,:: | |
590 | ||
591 | ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] | |
592 | ||
593 | Where: | |
594 | ||
595 | ``name`` | |
596 | ||
597 | :Description: The name of the rule | |
598 | :Type: String | |
599 | :Required: Yes | |
600 | :Example: ``rbd-rule`` | |
601 | ||
602 | ``root`` | |
603 | ||
604 | :Description: The name of the node under which data should be placed. | |
605 | :Type: String | |
606 | :Required: Yes | |
607 | :Example: ``default`` | |
608 | ||
609 | ``failure-domain-type`` | |
610 | ||
611 | :Description: The type of CRUSH nodes across which we should separate replicas. | |
612 | :Type: String | |
613 | :Required: Yes | |
614 | :Example: ``rack`` | |
615 | ||
616 | ``class`` | |
617 | ||
f67539c2 | 618 | :Description: The device class on which data should be placed. |
c07f9fc5 FG |
619 | :Type: String |
620 | :Required: No | |
621 | :Example: ``ssd`` | |
622 | ||
623 | Creating a rule for an erasure coded pool | |
624 | ----------------------------------------- | |
625 | ||
f67539c2 TL |
626 | For an erasure-coded (EC) pool, the same basic decisions need to be made: |
627 | what is the failure domain, which node in the | |
c07f9fc5 FG |
628 | hierarchy will data be placed under (usually ``default``), and will |
629 | placement be restricted to a specific device class. Erasure code | |
630 | pools are created a bit differently, however, because they need to be | |
631 | constructed carefully based on the erasure code being used. For this reason, | |
632 | you must include this information in the *erasure code profile*. A CRUSH | |
633 | rule will then be created from that either explicitly or automatically when | |
634 | the profile is used to create a pool. | |
635 | ||
636 | The erasure code profiles can be listed with:: | |
637 | ||
638 | ceph osd erasure-code-profile ls | |
639 | ||
640 | An existing profile can be viewed with:: | |
641 | ||
642 | ceph osd erasure-code-profile get {profile-name} | |
643 | ||
644 | Normally profiles should never be modified; instead, a new profile | |
645 | should be created and used when creating a new pool or creating a new | |
646 | rule for an existing pool. | |
647 | ||
648 | An erasure code profile consists of a set of key=value pairs. Most of | |
649 | these control the behavior of the erasure code that is encoding data | |
650 | in the pool. Those that begin with ``crush-``, however, affect the | |
651 | CRUSH rule that is created. | |
652 | ||
653 | The erasure code profile properties of interest are: | |
654 | ||
f67539c2 TL |
655 | * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``]. |
656 | * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``]. | |
657 | * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used]. | |
c07f9fc5 FG |
658 | * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. |
659 | ||
660 | Once a profile is defined, you can create a CRUSH rule with:: | |
661 | ||
662 | ceph osd crush rule create-erasure {name} {profile-name} | |
663 | ||
664 | .. note: When creating a new pool, it is not actually necessary to | |
665 | explicitly create the rule. If the erasure code profile alone is | |
666 | specified and the rule argument is left off then Ceph will create | |
667 | the CRUSH rule automatically. | |
668 | ||
669 | Deleting rules | |
670 | -------------- | |
671 | ||
672 | Rules that are not in use by pools can be deleted with:: | |
673 | ||
674 | ceph osd crush rule rm {rule-name} | |
675 | ||
7c673cae | 676 | |
11fdf7f2 TL |
677 | .. _crush-map-tunables: |
678 | ||
7c673cae FG |
679 | Tunables |
680 | ======== | |
681 | ||
682 | Over time, we have made (and continue to make) improvements to the | |
683 | CRUSH algorithm used to calculate the placement of data. In order to | |
684 | support the change in behavior, we have introduced a series of tunable | |
685 | options that control whether the legacy or improved variation of the | |
686 | algorithm is used. | |
687 | ||
688 | In order to use newer tunables, both clients and servers must support | |
689 | the new version of CRUSH. For this reason, we have created | |
690 | ``profiles`` that are named after the Ceph version in which they were | |
691 | introduced. For example, the ``firefly`` tunables are first supported | |
f67539c2 | 692 | by the Firefly release, and will not work with older (e.g., Dumpling) |
7c673cae FG |
693 | clients. Once a given set of tunables are changed from the legacy |
694 | default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older | |
695 | clients who do not support the new CRUSH features from connecting to | |
696 | the cluster. | |
697 | ||
698 | argonaut (legacy) | |
699 | ----------------- | |
700 | ||
f67539c2 TL |
701 | The legacy CRUSH behavior used by Argonaut and older releases works |
702 | fine for most clusters, provided there are not many OSDs that have | |
7c673cae FG |
703 | been marked out. |
704 | ||
705 | bobtail (CRUSH_TUNABLES2) | |
706 | ------------------------- | |
707 | ||
f67539c2 | 708 | The ``bobtail`` tunable profile fixes a few key misbehaviors: |
7c673cae FG |
709 | |
710 | * For hierarchies with a small number of devices in the leaf buckets, | |
711 | some PGs map to fewer than the desired number of replicas. This | |
712 | commonly happens for hierarchies with "host" nodes with a small | |
713 | number (1-3) of OSDs nested beneath each one. | |
714 | ||
f67539c2 | 715 | * For large clusters, some small percentages of PGs map to fewer than |
7c673cae | 716 | the desired number of OSDs. This is more prevalent when there are |
f67539c2 | 717 | mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``). |
7c673cae FG |
718 | |
719 | * When some OSDs are marked out, the data tends to get redistributed | |
720 | to nearby OSDs instead of across the entire hierarchy. | |
721 | ||
722 | The new tunables are: | |
723 | ||
724 | * ``choose_local_tries``: Number of local retries. Legacy value is | |
725 | 2, optimal value is 0. | |
726 | ||
727 | * ``choose_local_fallback_tries``: Legacy value is 5, optimal value | |
728 | is 0. | |
729 | ||
730 | * ``choose_total_tries``: Total number of attempts to choose an item. | |
731 | Legacy value was 19, subsequent testing indicates that a value of | |
732 | 50 is more appropriate for typical clusters. For extremely large | |
733 | clusters, a larger value might be necessary. | |
734 | ||
735 | * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt | |
736 | will retry, or only try once and allow the original placement to | |
737 | retry. Legacy default is 0, optimal value is 1. | |
738 | ||
739 | Migration impact: | |
740 | ||
f67539c2 | 741 | * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount |
7c673cae FG |
742 | of data movement. Use caution on a cluster that is already |
743 | populated with data. | |
744 | ||
745 | firefly (CRUSH_TUNABLES3) | |
746 | ------------------------- | |
747 | ||
f67539c2 TL |
748 | The ``firefly`` tunable profile fixes a problem |
749 | with ``chooseleaf`` CRUSH rule behavior that tends to result in PG | |
7c673cae FG |
750 | mappings with too few results when too many OSDs have been marked out. |
751 | ||
752 | The new tunable is: | |
753 | ||
754 | * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will | |
f67539c2 TL |
755 | start with a non-zero value of ``r``, based on how many attempts the |
756 | parent has already made. Legacy default is ``0``, but with this value | |
7c673cae | 757 | CRUSH is sometimes unable to find a mapping. The optimal value (in |
f67539c2 | 758 | terms of computational cost and correctness) is ``1``. |
7c673cae | 759 | |
11fdf7f2 | 760 | Migration impact: |
7c673cae | 761 | |
f67539c2 TL |
762 | * For existing clusters that house lots of data, changing |
763 | from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5`` | |
764 | will allow CRUSH to still find a valid mapping but will cause less data | |
765 | to move. | |
7c673cae FG |
766 | |
767 | straw_calc_version tunable (introduced with Firefly too) | |
768 | -------------------------------------------------------- | |
769 | ||
770 | There were some problems with the internal weights calculated and | |
f67539c2 TL |
771 | stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when |
772 | there were items with a CRUSH weight of ``0``, or both a mix of different and | |
773 | unique weights, CRUSH would distribute data incorrectly (i.e., | |
7c673cae FG |
774 | not in proportion to the weights). |
775 | ||
776 | The new tunable is: | |
777 | ||
f67539c2 TL |
778 | * ``straw_calc_version``: A value of ``0`` preserves the old, broken |
779 | internal weight calculation; a value of ``1`` fixes the behavior. | |
7c673cae FG |
780 | |
781 | Migration impact: | |
782 | ||
f67539c2 | 783 | * Moving to straw_calc_version ``1`` and then adjusting a straw bucket |
7c673cae FG |
784 | (by adding, removing, or reweighting an item, or by using the |
785 | reweight-all command) can trigger a small to moderate amount of | |
786 | data movement *if* the cluster has hit one of the problematic | |
787 | conditions. | |
788 | ||
789 | This tunable option is special because it has absolutely no impact | |
790 | concerning the required kernel version in the client side. | |
791 | ||
792 | hammer (CRUSH_V4) | |
793 | ----------------- | |
794 | ||
f67539c2 | 795 | The ``hammer`` tunable profile does not affect the |
7c673cae FG |
796 | mapping of existing CRUSH maps simply by changing the profile. However: |
797 | ||
f67539c2 TL |
798 | * There is a new bucket algorithm (``straw2``) supported. The new |
799 | ``straw2`` bucket algorithm fixes several limitations in the original | |
800 | ``straw``. Specifically, the old ``straw`` buckets would | |
7c673cae FG |
801 | change some mappings that should have changed when a weight was |
802 | adjusted, while ``straw2`` achieves the original goal of only | |
803 | changing mappings to or from the bucket item whose weight has | |
804 | changed. | |
805 | ||
806 | * ``straw2`` is the default for any newly created buckets. | |
807 | ||
808 | Migration impact: | |
809 | ||
810 | * Changing a bucket type from ``straw`` to ``straw2`` will result in | |
811 | a reasonably small amount of data movement, depending on how much | |
812 | the bucket item weights vary from each other. When the weights are | |
813 | all the same no data will move, and when item weights vary | |
814 | significantly there will be more movement. | |
815 | ||
816 | jewel (CRUSH_TUNABLES5) | |
817 | ----------------------- | |
818 | ||
f67539c2 | 819 | The ``jewel`` tunable profile improves the |
7c673cae | 820 | overall behavior of CRUSH such that significantly fewer mappings |
f67539c2 TL |
821 | change when an OSD is marked out of the cluster. This results in |
822 | significantly less data movement. | |
7c673cae FG |
823 | |
824 | The new tunable is: | |
825 | ||
826 | * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will | |
827 | use a better value for an inner loop that greatly reduces the number | |
f67539c2 TL |
828 | of mapping changes when an OSD is marked out. The legacy value is ``0``, |
829 | while the new value of ``1`` uses the new approach. | |
7c673cae FG |
830 | |
831 | Migration impact: | |
832 | ||
833 | * Changing this value on an existing cluster will result in a very | |
834 | large amount of data movement as almost every PG mapping is likely | |
835 | to change. | |
836 | ||
837 | ||
838 | ||
839 | ||
840 | Which client versions support CRUSH_TUNABLES | |
841 | -------------------------------------------- | |
842 | ||
843 | * argonaut series, v0.48.1 or later | |
844 | * v0.49 or later | |
845 | * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) | |
846 | ||
847 | Which client versions support CRUSH_TUNABLES2 | |
848 | --------------------------------------------- | |
849 | ||
850 | * v0.55 or later, including bobtail series (v0.56.x) | |
851 | * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) | |
852 | ||
853 | Which client versions support CRUSH_TUNABLES3 | |
854 | --------------------------------------------- | |
855 | ||
856 | * v0.78 (firefly) or later | |
857 | * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) | |
858 | ||
859 | Which client versions support CRUSH_V4 | |
860 | -------------------------------------- | |
861 | ||
862 | * v0.94 (hammer) or later | |
863 | * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) | |
864 | ||
865 | Which client versions support CRUSH_TUNABLES5 | |
866 | --------------------------------------------- | |
867 | ||
868 | * v10.0.2 (jewel) or later | |
869 | * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) | |
870 | ||
871 | Warning when tunables are non-optimal | |
872 | ------------------------------------- | |
873 | ||
874 | Starting with version v0.74, Ceph will issue a health warning if the | |
875 | current CRUSH tunables don't include all the optimal values from the | |
876 | ``default`` profile (see below for the meaning of the ``default`` profile). | |
877 | To make this warning go away, you have two options: | |
878 | ||
879 | 1. Adjust the tunables on the existing cluster. Note that this will | |
880 | result in some data movement (possibly as much as 10%). This is the | |
881 | preferred route, but should be taken with care on a production cluster | |
882 | where the data movement may affect performance. You can enable optimal | |
883 | tunables with:: | |
884 | ||
885 | ceph osd crush tunables optimal | |
886 | ||
887 | If things go poorly (e.g., too much load) and not very much | |
888 | progress has been made, or there is a client compatibility problem | |
f67539c2 | 889 | (old kernel CephFS or RBD clients, or pre-Bobtail ``librados`` |
7c673cae FG |
890 | clients), you can switch back with:: |
891 | ||
892 | ceph osd crush tunables legacy | |
893 | ||
894 | 2. You can make the warning go away without making any changes to CRUSH by | |
895 | adding the following option to your ceph.conf ``[mon]`` section:: | |
896 | ||
897 | mon warn on legacy crush tunables = false | |
898 | ||
899 | For the change to take effect, you will need to restart the monitors, or | |
900 | apply the option to running monitors with:: | |
901 | ||
11fdf7f2 | 902 | ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false |
7c673cae FG |
903 | |
904 | ||
905 | A few important points | |
906 | ---------------------- | |
907 | ||
908 | * Adjusting these values will result in the shift of some PGs between | |
909 | storage nodes. If the Ceph cluster is already storing a lot of | |
910 | data, be prepared for some fraction of the data to move. | |
911 | * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the | |
912 | feature bits of new connections as soon as they get | |
913 | the updated map. However, already-connected clients are | |
914 | effectively grandfathered in, and will misbehave if they do not | |
915 | support the new feature. | |
916 | * If the CRUSH tunables are set to non-legacy values and then later | |
11fdf7f2 | 917 | changed back to the default values, ``ceph-osd`` daemons will not be |
7c673cae FG |
918 | required to support the feature. However, the OSD peering process |
919 | requires examining and understanding old maps. Therefore, you | |
920 | should not run old versions of the ``ceph-osd`` daemon | |
921 | if the cluster has previously used non-legacy CRUSH values, even if | |
922 | the latest version of the map has been switched back to using the | |
923 | legacy defaults. | |
924 | ||
925 | Tuning CRUSH | |
926 | ------------ | |
927 | ||
f67539c2 TL |
928 | The simplest way to adjust CRUSH tunables is by applying them in matched |
929 | sets known as *profiles*. As of the Octopus release these are: | |
7c673cae FG |
930 | |
931 | * ``legacy``: the legacy behavior from argonaut and earlier. | |
932 | * ``argonaut``: the legacy values supported by the original argonaut release | |
933 | * ``bobtail``: the values supported by the bobtail release | |
934 | * ``firefly``: the values supported by the firefly release | |
c07f9fc5 FG |
935 | * ``hammer``: the values supported by the hammer release |
936 | * ``jewel``: the values supported by the jewel release | |
7c673cae FG |
937 | * ``optimal``: the best (ie optimal) values of the current version of Ceph |
938 | * ``default``: the default values of a new cluster installed from | |
939 | scratch. These values, which depend on the current version of Ceph, | |
f67539c2 | 940 | are hardcoded and are generally a mix of optimal and legacy values. |
7c673cae | 941 | These values generally match the ``optimal`` profile of the previous |
f67539c2 TL |
942 | LTS release, or the most recent release for which we generally expect |
943 | most users to have up-to-date clients for. | |
7c673cae | 944 | |
f67539c2 | 945 | You can apply a profile to a running cluster with the command:: |
7c673cae FG |
946 | |
947 | ceph osd crush tunables {PROFILE} | |
948 | ||
f67539c2 TL |
949 | Note that this may result in data movement, potentially quite a bit. Study |
950 | release notes and documentation carefully before changing the profile on a | |
951 | running cluster, and consider throttling recovery/backfill parameters to | |
952 | limit the impact of a bolus of backfill. | |
7c673cae FG |
953 | |
954 | ||
c07f9fc5 | 955 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf |
7c673cae | 956 | |
7c673cae | 957 | |
c07f9fc5 FG |
958 | Primary Affinity |
959 | ================ | |
7c673cae | 960 | |
f67539c2 TL |
961 | When a Ceph Client reads or writes data, it first contacts the primary OSD in |
962 | each affected PG's acting set. By default, the first OSD in the acting set is | |
963 | the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is | |
964 | listed first and thus is the primary (aka lead) OSD. Sometimes we know that an | |
965 | OSD is less well suited to act as the lead than are other OSDs (e.g., it has | |
966 | a slow drive or a slow controller). To prevent performance bottlenecks | |
c07f9fc5 | 967 | (especially on read operations) while maximizing utilization of your hardware, |
f67539c2 TL |
968 | you can influence the selection of primary OSDs by adjusting primary affinity |
969 | values, or by crafting a CRUSH rule that selects preferred OSDs first. | |
7c673cae | 970 | |
f67539c2 TL |
971 | Tuning primary OSD selection is mainly useful for replicated pools, because |
972 | by default read operations are served from the primary OSD for each PG. | |
973 | For erasure coded (EC) pools, a way to speed up read operations is to enable | |
974 | **fast read** as described in :ref:`pool-settings`. | |
7c673cae | 975 | |
f67539c2 TL |
976 | A common scenario for primary affinity is when a cluster contains |
977 | a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with | |
978 | 3.84TB SATA SSDs. On average the latter will be assigned double the number of | |
979 | PGs and thus will serve double the number of write and read operations, thus | |
980 | they'll be busier than the former. A rough assignment of primary affinity | |
981 | inversely proportional to OSD size won't be 100% optimal, but it can readily | |
982 | achieve a 15% improvement in overall read throughput by utilizing SATA | |
983 | interface bandwidth and CPU cycles more evenly. | |
7c673cae | 984 | |
f67539c2 TL |
985 | By default, all ceph OSDs have primary affinity of ``1``, which indicates that |
986 | any OSD may act as a primary with equal probability. | |
987 | ||
988 | You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose | |
989 | the OSD as primary in a PG's acting set.:: | |
990 | ||
991 | ceph osd primary-affinity <osd-id> <weight> | |
7c673cae | 992 | |
f67539c2 TL |
993 | You may set an OSD's primary affinity to a real number in the range |
994 | ``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary | |
995 | and ``1`` indicates that an OSD may be used as a primary. When the weight is | |
996 | between these extremes, it is less likely that | |
997 | CRUSH will select that OSD as a primary. The process for | |
998 | selecting the lead OSD is more nuanced than a simple probability based on | |
999 | relative affinity values, but measurable results can be achieved even with | |
1000 | first-order approximations of desirable values. | |
1001 | ||
1002 | Custom CRUSH Rules | |
1003 | ------------------ | |
1004 | ||
1005 | There are occasional clusters that balance cost and performance by mixing SSDs | |
1006 | and HDDs in the same replicated pool. By setting the primary affinity of HDD | |
1007 | OSDs to ``0`` one can direct operations to the SSD in each acting set. An | |
1008 | alternative is to define a CRUSH rule that always selects an SSD OSD as the | |
1009 | first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting | |
1010 | set will contain exactly one SSD OSD as the primary with the balance on HDDs. | |
1011 | ||
1012 | For example, the CRUSH rule below:: | |
1013 | ||
1014 | rule mixed_replicated_rule { | |
1015 | id 11 | |
1016 | type replicated | |
1017 | min_size 1 | |
1018 | max_size 10 | |
1019 | step take default class ssd | |
1020 | step chooseleaf firstn 1 type host | |
1021 | step emit | |
1022 | step take default class hdd | |
1023 | step chooseleaf firstn 0 type host | |
1024 | step emit | |
1025 | } | |
1026 | ||
1027 | chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool | |
1028 | this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different | |
1029 | hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD | |
1030 | OSDs. | |
1031 | ||
1032 | This extra storage requirement can be avoided by placing SSDs and HDDs in | |
1033 | different hosts with the tradeoff that hosts with SSDs will receive all client | |
1034 | requests. You may thus consider faster CPU(s) for SSD hosts and more modest | |
1035 | ones for HDD nodes, since the latter will normally only service recovery | |
1036 | operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly | |
1037 | must not contain the same servers:: | |
1038 | ||
1039 | rule mixed_replicated_rule_two { | |
1040 | id 1 | |
1041 | type replicated | |
1042 | min_size 1 | |
1043 | max_size 10 | |
1044 | step take ssd_hosts class ssd | |
1045 | step chooseleaf firstn 1 type host | |
1046 | step emit | |
1047 | step take hdd_hosts class hdd | |
1048 | step chooseleaf firstn -1 type host | |
1049 | step emit | |
1050 | } | |
1051 | ||
1052 | ||
1053 | ||
1054 | Note also that on failure of an SSD, requests to a PG will be served temporarily | |
1055 | from a (slower) HDD OSD until the PG's data has been replicated onto the replacement | |
1056 | primary SSD OSD. | |
7c673cae | 1057 |