]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | CRUSH Maps | |
3 | ============ | |
4 | ||
5 | The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm | |
1e59de90 TL |
6 | computes storage locations in order to determine how to store and retrieve |
7 | data. CRUSH allows Ceph clients to communicate with OSDs directly rather than | |
8 | through a centralized server or broker. By using an algorithmically-determined | |
7c673cae FG |
9 | method of storing and retrieving data, Ceph avoids a single point of failure, a |
10 | performance bottleneck, and a physical limit to its scalability. | |
11 | ||
1e59de90 TL |
12 | CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs, |
13 | distributing the data across the cluster in accordance with configured | |
14 | replication policy and failure domains. For a detailed discussion of CRUSH, see | |
7c673cae FG |
15 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ |
16 | ||
1e59de90 TL |
17 | CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a |
18 | hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH | |
19 | replicates data within the cluster's pools. By reflecting the underlying | |
20 | physical organization of the installation, CRUSH can model (and thereby | |
21 | address) the potential for correlated device failures. Some factors relevant | |
22 | to the CRUSH hierarchy include chassis, racks, physical proximity, a shared | |
23 | power source, shared networking, and failure domains. By encoding this | |
24 | information into the CRUSH map, CRUSH placement policies distribute object | |
25 | replicas across failure domains while maintaining the desired distribution. For | |
26 | example, to address the possibility of concurrent failures, it might be | |
27 | desirable to ensure that data replicas are on devices that reside in or rely | |
28 | upon different shelves, racks, power supplies, controllers, or physical | |
29 | locations. | |
30 | ||
31 | When OSDs are deployed, they are automatically added to the CRUSH map under a | |
32 | ``host`` bucket that is named for the node on which the OSDs run. This | |
33 | behavior, combined with the configured CRUSH failure domain, ensures that | |
34 | replicas or erasure-code shards are distributed across hosts and that the | |
35 | failure of a single host or other kinds of failures will not affect | |
36 | availability. For larger clusters, administrators must carefully consider their | |
37 | choice of failure domain. For example, distributing replicas across racks is | |
38 | typical for mid- to large-sized clusters. | |
7c673cae FG |
39 | |
40 | ||
41 | CRUSH Location | |
42 | ============== | |
43 | ||
1e59de90 TL |
44 | The location of an OSD within the CRUSH map's hierarchy is referred to as its |
45 | ``CRUSH location``. The specification of a CRUSH location takes the form of a | |
46 | list of key-value pairs. For example, if an OSD is in a particular row, rack, | |
47 | chassis, and host, and is also part of the 'default' CRUSH root (which is the | |
48 | case for most clusters), its CRUSH location can be specified as follows:: | |
7c673cae FG |
49 | |
50 | root=default row=a rack=a2 chassis=a2a host=a2a1 | |
51 | ||
1e59de90 | 52 | .. note:: |
7c673cae | 53 | |
1e59de90 TL |
54 | #. The order of the keys does not matter. |
55 | #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default, | |
56 | valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``, | |
57 | ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined | |
58 | types suffice for nearly all clusters, but can be customized by | |
59 | modifying the CRUSH map. | |
60 | #. Not all keys need to be specified. For example, by default, Ceph | |
61 | automatically sets an ``OSD``'s location as ``root=default | |
62 | host=HOSTNAME`` (as determined by the output of ``hostname -s``). | |
7c673cae | 63 | |
1e59de90 TL |
64 | The CRUSH location for an OSD can be modified by adding the ``crush location`` |
65 | option in ``ceph.conf``. When this option has been added, every time the OSD | |
66 | starts it verifies that it is in the correct location in the CRUSH map and | |
67 | moves itself if it is not. To disable this automatic CRUSH map management, add | |
68 | the following to the ``ceph.conf`` configuration file in the ``[osd]`` | |
69 | section:: | |
7c673cae | 70 | |
1e59de90 | 71 | osd crush update on start = false |
7c673cae | 72 | |
1e59de90 | 73 | Note that this action is unnecessary in most cases. |
f67539c2 | 74 | |
c07f9fc5 | 75 | |
7c673cae FG |
76 | Custom location hooks |
77 | --------------------- | |
78 | ||
1e59de90 TL |
79 | A custom location hook can be used to generate a more complete CRUSH location |
80 | on startup. The CRUSH location is determined by, in order of preference: | |
c07f9fc5 | 81 | |
f67539c2 | 82 | #. A ``crush location`` option in ``ceph.conf`` |
1e59de90 TL |
83 | #. A default of ``root=default host=HOSTNAME`` where the hostname is determined |
84 | by the output of the ``hostname -s`` command | |
c07f9fc5 | 85 | |
1e59de90 TL |
86 | A script can be written to provide additional location fields (for example, |
87 | ``rack`` or ``datacenter``) and the hook can be enabled via the following | |
88 | config option:: | |
7c673cae | 89 | |
1e59de90 | 90 | crush location hook = /path/to/customized-ceph-crush-location |
7c673cae | 91 | |
1e59de90 TL |
92 | This hook is passed several arguments (see below). The hook outputs a single |
93 | line to ``stdout`` that contains the CRUSH location description. The output | |
94 | resembles the following::: | |
7c673cae | 95 | |
11fdf7f2 | 96 | --cluster CLUSTER --id ID --type TYPE |
7c673cae | 97 | |
1e59de90 TL |
98 | Here the cluster name is typically ``ceph``, the ``id`` is the daemon |
99 | identifier or (in the case of OSDs) the OSD number, and the daemon type is | |
100 | ``osd``, ``mds, ``mgr``, or ``mon``. | |
11fdf7f2 | 101 | |
1e59de90 TL |
102 | For example, a simple hook that specifies a rack location via a value in the |
103 | file ``/etc/rack`` might be as follows:: | |
11fdf7f2 TL |
104 | |
105 | #!/bin/sh | |
106 | echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" | |
7c673cae FG |
107 | |
108 | ||
c07f9fc5 FG |
109 | CRUSH structure |
110 | =============== | |
7c673cae | 111 | |
1e59de90 TL |
112 | The CRUSH map consists of (1) a hierarchy that describes the physical topology |
113 | of the cluster and (2) a set of rules that defines data placement policy. The | |
114 | hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to | |
115 | other physical features or groupings: hosts, racks, rows, data centers, and so | |
116 | on. The rules determine how replicas are placed in terms of that hierarchy (for | |
117 | example, 'three replicas in different racks'). | |
7c673cae | 118 | |
c07f9fc5 FG |
119 | Devices |
120 | ------- | |
7c673cae | 121 | |
1e59de90 TL |
122 | Devices are individual OSDs that store data (usually one device for each |
123 | storage drive). Devices are identified by an ``id`` (a non-negative integer) | |
124 | and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``). | |
7c673cae | 125 | |
1e59de90 TL |
126 | In Luminous and later releases, OSDs can have a *device class* assigned (for |
127 | example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH | |
128 | rules. Device classes are especially useful when mixing device types within | |
129 | hosts. | |
7c673cae | 130 | |
20effc67 TL |
131 | .. _crush_map_default_types: |
132 | ||
c07f9fc5 | 133 | Types and Buckets |
7c673cae FG |
134 | ----------------- |
135 | ||
1e59de90 TL |
136 | "Bucket", in the context of CRUSH, is a term for any of the internal nodes in |
137 | the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of | |
138 | *types* that are used to identify these nodes. Default types include: | |
f67539c2 TL |
139 | |
140 | - ``osd`` (or ``device``) | |
141 | - ``host`` | |
142 | - ``chassis`` | |
143 | - ``rack`` | |
144 | - ``row`` | |
145 | - ``pdu`` | |
146 | - ``pod`` | |
147 | - ``room`` | |
148 | - ``datacenter`` | |
149 | - ``zone`` | |
150 | - ``region`` | |
151 | - ``root`` | |
152 | ||
1e59de90 TL |
153 | Most clusters use only a handful of these types, and other types can be defined |
154 | as needed. | |
155 | ||
156 | The hierarchy is built with devices (normally of type ``osd``) at the leaves | |
157 | and non-device types as the internal nodes. The root node is of type ``root``. | |
158 | For example: | |
c07f9fc5 | 159 | |
c07f9fc5 FG |
160 | |
161 | .. ditaa:: | |
162 | ||
1e59de90 | 163 | +-----------------+ |
11fdf7f2 | 164 | |{o}root default | |
1e59de90 | 165 | +--------+--------+ |
7c673cae | 166 | | |
11fdf7f2 TL |
167 | +---------------+---------------+ |
168 | | | | |
169 | +------+------+ +------+------+ | |
1e59de90 | 170 | |{o}host foo | |{o}host bar | |
11fdf7f2 | 171 | +------+------+ +------+------+ |
7c673cae | 172 | | | |
7c673cae FG |
173 | +-------+-------+ +-------+-------+ |
174 | | | | | | |
175 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
1e59de90 | 176 | | osd.0 | | osd.1 | | osd.2 | | osd.3 | |
7c673cae FG |
177 | +-----------+ +-----------+ +-----------+ +-----------+ |
178 | ||
c07f9fc5 | 179 | |
1e59de90 TL |
180 | Each node (device or bucket) in the hierarchy has a *weight* that indicates the |
181 | relative proportion of the total data that should be stored by that device or | |
182 | hierarchy subtree. Weights are set at the leaves, indicating the size of the | |
183 | device. These weights automatically sum in an 'up the tree' direction: that is, | |
184 | the weight of the ``root`` node will be the sum of the weights of all devices | |
185 | contained under it. Weights are typically measured in tebibytes (TiB). | |
186 | ||
187 | To get a simple view of the cluster's CRUSH hierarchy, including weights, run | |
188 | the following command: | |
c07f9fc5 | 189 | |
39ae355f TL |
190 | .. prompt:: bash $ |
191 | ||
192 | ceph osd tree | |
c07f9fc5 FG |
193 | |
194 | Rules | |
195 | ----- | |
196 | ||
1e59de90 TL |
197 | CRUSH rules define policy governing how data is distributed across the devices |
198 | in the hierarchy. The rules define placement as well as replication strategies | |
199 | or distribution policies that allow you to specify exactly how CRUSH places | |
200 | data replicas. For example, you might create one rule selecting a pair of | |
201 | targets for two-way mirroring, another rule for selecting three targets in two | |
202 | different data centers for three-way replication, and yet another rule for | |
203 | erasure coding across six storage devices. For a detailed discussion of CRUSH | |
204 | rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized | |
205 | Placement of Replicated Data`_. | |
206 | ||
207 | CRUSH rules can be created via the command-line by specifying the *pool type* | |
208 | that they will govern (replicated or erasure coded), the *failure domain*, and | |
209 | optionally a *device class*. In rare cases, CRUSH rules must be created by | |
210 | manually editing the CRUSH map. | |
211 | ||
212 | To see the rules that are defined for the cluster, run the following command: | |
39ae355f TL |
213 | |
214 | .. prompt:: bash $ | |
215 | ||
216 | ceph osd crush rule ls | |
7c673cae | 217 | |
1e59de90 | 218 | To view the contents of the rules, run the following command: |
7c673cae | 219 | |
39ae355f | 220 | .. prompt:: bash $ |
7c673cae | 221 | |
39ae355f | 222 | ceph osd crush rule dump |
7c673cae | 223 | |
05a536ef TL |
224 | .. _device_classes: |
225 | ||
d2e6a577 FG |
226 | Device classes |
227 | -------------- | |
228 | ||
1e59de90 TL |
229 | Each device can optionally have a *class* assigned. By default, OSDs |
230 | automatically set their class at startup to `hdd`, `ssd`, or `nvme` in | |
231 | accordance with the type of device they are backed by. | |
d2e6a577 | 232 | |
1e59de90 TL |
233 | To explicitly set the device class of one or more OSDs, run a command of the |
234 | following form: | |
d2e6a577 | 235 | |
39ae355f TL |
236 | .. prompt:: bash $ |
237 | ||
238 | ceph osd crush set-device-class <class> <osd-name> [...] | |
d2e6a577 | 239 | |
1e59de90 TL |
240 | Once a device class has been set, it cannot be changed to another class until |
241 | the old class is unset. To remove the old class of one or more OSDs, run a | |
242 | command of the following form: | |
39ae355f TL |
243 | |
244 | .. prompt:: bash $ | |
d2e6a577 | 245 | |
39ae355f | 246 | ceph osd crush rm-device-class <osd-name> [...] |
d2e6a577 | 247 | |
1e59de90 TL |
248 | This restriction allows administrators to set device classes that won't be |
249 | changed on OSD restart or by a script. | |
d2e6a577 | 250 | |
1e59de90 TL |
251 | To create a placement rule that targets a specific device class, run a command |
252 | of the following form: | |
d2e6a577 | 253 | |
39ae355f | 254 | .. prompt:: bash $ |
d2e6a577 | 255 | |
39ae355f | 256 | ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> |
d2e6a577 | 257 | |
1e59de90 TL |
258 | To apply the new placement rule to a specific pool, run a command of the |
259 | following form: | |
39ae355f TL |
260 | |
261 | .. prompt:: bash $ | |
262 | ||
263 | ceph osd pool set <pool-name> crush_rule <rule-name> | |
d2e6a577 | 264 | |
1e59de90 TL |
265 | Device classes are implemented by creating one or more "shadow" CRUSH |
266 | hierarchies. For each device class in use, there will be a shadow hierarchy | |
267 | that contains only devices of that class. CRUSH rules can then distribute data | |
268 | across the relevant shadow hierarchy. This approach is fully backward | |
269 | compatible with older Ceph clients. To view the CRUSH hierarchy with shadow | |
270 | items displayed, run the following command: | |
39ae355f | 271 | |
1e59de90 | 272 | .. prompt:: bash # |
d2e6a577 | 273 | |
39ae355f | 274 | ceph osd crush tree --show-shadow |
d2e6a577 | 275 | |
1e59de90 TL |
276 | Some older clusters that were created before the Luminous release rely on |
277 | manually crafted CRUSH maps to maintain per-device-type hierarchies. For these | |
278 | clusters, there is a *reclassify* tool available that can help them transition | |
279 | to device classes without triggering unwanted data movement (see | |
280 | :ref:`crush-reclassify`). | |
281 | ||
282 | Weight sets | |
283 | ----------- | |
284 | ||
285 | A *weight set* is an alternative set of weights to use when calculating data | |
286 | placement. The normal weights associated with each device in the CRUSH map are | |
287 | set in accordance with the device size and indicate how much data should be | |
288 | stored where. However, because CRUSH is a probabilistic pseudorandom placement | |
289 | process, there is always some variation from this ideal distribution (in the | |
290 | same way that rolling a die sixty times will likely not result in exactly ten | |
291 | ones and ten sixes). Weight sets allow the cluster to perform numerical | |
292 | optimization based on the specifics of your cluster (for example: hierarchy, | |
293 | pools) to achieve a balanced distribution. | |
294 | ||
295 | Ceph supports two types of weight sets: | |
296 | ||
297 | #. A **compat** weight set is a single alternative set of weights for each | |
298 | device and each node in the cluster. Compat weight sets cannot be expected | |
299 | to correct all anomalies (for example, PGs for different pools might be of | |
300 | different sizes and have different load levels, but are mostly treated alike | |
301 | by the balancer). However, they have the major advantage of being *backward | |
302 | compatible* with previous versions of Ceph. This means that even though | |
303 | weight sets were first introduced in Luminous v12.2.z, older clients (for | |
304 | example, Firefly) can still connect to the cluster when a compat weight set | |
305 | is being used to balance data. | |
306 | ||
307 | #. A **per-pool** weight set is more flexible in that it allows placement to | |
308 | be optimized for each data pool. Additionally, weights can be adjusted | |
309 | for each position of placement, allowing the optimizer to correct for a | |
310 | subtle skew of data toward devices with small weights relative to their | |
311 | peers (an effect that is usually apparent only in very large clusters | |
312 | but that can cause balancing problems). | |
313 | ||
314 | When weight sets are in use, the weights associated with each node in the | |
315 | hierarchy are visible in a separate column (labeled either as ``(compat)`` or | |
316 | as the pool name) in the output of the following command: | |
317 | ||
318 | .. prompt:: bash # | |
39ae355f TL |
319 | |
320 | ceph osd tree | |
c07f9fc5 | 321 | |
1e59de90 TL |
322 | If both *compat* and *per-pool* weight sets are in use, data placement for a |
323 | particular pool will use its own per-pool weight set if present. If only | |
324 | *compat* weight sets are in use, data placement will use the compat weight set. | |
325 | If neither are in use, data placement will use the normal CRUSH weights. | |
c07f9fc5 | 326 | |
1e59de90 TL |
327 | Although weight sets can be set up and adjusted manually, we recommend enabling |
328 | the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the | |
329 | cluster is running Luminous or a later release. | |
c07f9fc5 FG |
330 | |
331 | Modifying the CRUSH map | |
332 | ======================= | |
7c673cae FG |
333 | |
334 | .. _addosd: | |
335 | ||
1e59de90 TL |
336 | Adding/Moving an OSD |
337 | -------------------- | |
338 | ||
339 | .. note:: Under normal conditions, OSDs automatically add themselves to the | |
340 | CRUSH map when they are created. The command in this section is rarely | |
341 | needed. | |
7c673cae | 342 | |
7c673cae | 343 | |
1e59de90 TL |
344 | To add or move an OSD in the CRUSH map of a running cluster, run a command of |
345 | the following form: | |
39ae355f TL |
346 | |
347 | .. prompt:: bash $ | |
7c673cae | 348 | |
39ae355f | 349 | ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] |
7c673cae | 350 | |
1e59de90 | 351 | For details on this command's parameters, see the following: |
7c673cae | 352 | |
7c673cae | 353 | ``name`` |
1e59de90 TL |
354 | :Description: The full name of the OSD. |
355 | :Type: String | |
356 | :Required: Yes | |
357 | :Example: ``osd.0`` | |
7c673cae FG |
358 | |
359 | ||
360 | ``weight`` | |
1e59de90 TL |
361 | :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB). |
362 | :Type: Double | |
363 | :Required: Yes | |
364 | :Example: ``2.0`` | |
7c673cae FG |
365 | |
366 | ||
367 | ``root`` | |
1e59de90 TL |
368 | :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``). |
369 | :Type: Key-value pair. | |
370 | :Required: Yes | |
371 | :Example: ``root=default`` | |
7c673cae FG |
372 | |
373 | ||
374 | ``bucket-type`` | |
1e59de90 TL |
375 | :Description: The OSD's location in the CRUSH hierarchy. |
376 | :Type: Key-value pairs. | |
377 | :Required: No | |
378 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
7c673cae | 379 | |
1e59de90 TL |
380 | In the following example, the command adds ``osd.0`` to the hierarchy, or moves |
381 | ``osd.0`` from a previous location: | |
7c673cae | 382 | |
39ae355f TL |
383 | .. prompt:: bash $ |
384 | ||
385 | ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 | |
7c673cae FG |
386 | |
387 | ||
1e59de90 TL |
388 | Adjusting OSD weight |
389 | -------------------- | |
c07f9fc5 | 390 | |
1e59de90 TL |
391 | .. note:: Under normal conditions, OSDs automatically add themselves to the |
392 | CRUSH map with the correct weight when they are created. The command in this | |
393 | section is rarely needed. | |
7c673cae | 394 | |
1e59de90 TL |
395 | To adjust an OSD's CRUSH weight in a running cluster, run a command of the |
396 | following form: | |
39ae355f TL |
397 | |
398 | .. prompt:: bash $ | |
7c673cae | 399 | |
39ae355f | 400 | ceph osd crush reweight {name} {weight} |
7c673cae | 401 | |
1e59de90 | 402 | For details on this command's parameters, see the following: |
7c673cae FG |
403 | |
404 | ``name`` | |
1e59de90 TL |
405 | :Description: The full name of the OSD. |
406 | :Type: String | |
407 | :Required: Yes | |
408 | :Example: ``osd.0`` | |
7c673cae FG |
409 | |
410 | ||
411 | ``weight`` | |
1e59de90 TL |
412 | :Description: The CRUSH weight of the OSD. |
413 | :Type: Double | |
414 | :Required: Yes | |
415 | :Example: ``2.0`` | |
7c673cae FG |
416 | |
417 | ||
418 | .. _removeosd: | |
419 | ||
1e59de90 TL |
420 | Removing an OSD |
421 | --------------- | |
c07f9fc5 | 422 | |
1e59de90 TL |
423 | .. note:: OSDs are normally removed from the CRUSH map as a result of the |
424 | `ceph osd purge`` command. This command is rarely needed. | |
7c673cae | 425 | |
1e59de90 TL |
426 | To remove an OSD from the CRUSH map of a running cluster, run a command of the |
427 | following form: | |
39ae355f TL |
428 | |
429 | .. prompt:: bash $ | |
7c673cae | 430 | |
39ae355f | 431 | ceph osd crush remove {name} |
7c673cae | 432 | |
1e59de90 | 433 | For details on the ``name`` parameter, see the following: |
7c673cae FG |
434 | |
435 | ``name`` | |
1e59de90 TL |
436 | :Description: The full name of the OSD. |
437 | :Type: String | |
438 | :Required: Yes | |
439 | :Example: ``osd.0`` | |
7c673cae | 440 | |
c07f9fc5 | 441 | |
1e59de90 TL |
442 | Adding a CRUSH Bucket |
443 | --------------------- | |
c07f9fc5 | 444 | |
1e59de90 TL |
445 | .. note:: Buckets are implicitly created when an OSD is added and the command |
446 | that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the | |
447 | OSD's location (provided that a bucket with that name does not already | |
448 | exist). The command in this section is typically used when manually | |
449 | adjusting the structure of the hierarchy after OSDs have already been | |
450 | created. One use of this command is to move a series of hosts to a new | |
451 | rack-level bucket. Another use of this command is to add new ``host`` | |
452 | buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive | |
453 | any data until they are ready to receive data. When they are ready, move the | |
454 | buckets to the ``default`` root or to any other root as described below. | |
7c673cae | 455 | |
1e59de90 TL |
456 | To add a bucket in the CRUSH map of a running cluster, run a command of the |
457 | following form: | |
7c673cae | 458 | |
39ae355f TL |
459 | .. prompt:: bash $ |
460 | ||
461 | ceph osd crush add-bucket {bucket-name} {bucket-type} | |
7c673cae | 462 | |
1e59de90 | 463 | For details on this command's parameters, see the following: |
7c673cae FG |
464 | |
465 | ``bucket-name`` | |
1e59de90 TL |
466 | :Description: The full name of the bucket. |
467 | :Type: String | |
468 | :Required: Yes | |
469 | :Example: ``rack12`` | |
7c673cae FG |
470 | |
471 | ||
472 | ``bucket-type`` | |
1e59de90 TL |
473 | :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy. |
474 | :Type: String | |
475 | :Required: Yes | |
476 | :Example: ``rack`` | |
7c673cae | 477 | |
1e59de90 | 478 | In the following example, the command adds the ``rack12`` bucket to the hierarchy: |
39ae355f TL |
479 | |
480 | .. prompt:: bash $ | |
7c673cae | 481 | |
39ae355f | 482 | ceph osd crush add-bucket rack12 rack |
7c673cae | 483 | |
1e59de90 TL |
484 | Moving a Bucket |
485 | --------------- | |
7c673cae | 486 | |
c07f9fc5 | 487 | To move a bucket to a different location or position in the CRUSH map |
1e59de90 | 488 | hierarchy, run a command of the following form: |
7c673cae | 489 | |
39ae355f TL |
490 | .. prompt:: bash $ |
491 | ||
492 | ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] | |
7c673cae | 493 | |
1e59de90 | 494 | For details on this command's parameters, see the following: |
7c673cae FG |
495 | |
496 | ``bucket-name`` | |
1e59de90 TL |
497 | :Description: The name of the bucket that you are moving. |
498 | :Type: String | |
499 | :Required: Yes | |
500 | :Example: ``foo-bar-1`` | |
7c673cae FG |
501 | |
502 | ``bucket-type`` | |
1e59de90 TL |
503 | :Description: The bucket's new location in the CRUSH hierarchy. |
504 | :Type: Key-value pairs. | |
505 | :Required: No | |
506 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
7c673cae | 507 | |
1e59de90 TL |
508 | Removing a Bucket |
509 | ----------------- | |
7c673cae | 510 | |
1e59de90 TL |
511 | To remove a bucket from the CRUSH hierarchy, run a command of the following |
512 | form: | |
39ae355f TL |
513 | |
514 | .. prompt:: bash $ | |
7c673cae | 515 | |
39ae355f | 516 | ceph osd crush remove {bucket-name} |
7c673cae | 517 | |
1e59de90 TL |
518 | .. note:: A bucket must already be empty before it is removed from the CRUSH |
519 | hierarchy. In other words, there must not be OSDs or any other CRUSH buckets | |
520 | within it. | |
7c673cae | 521 | |
1e59de90 | 522 | For details on the ``bucket-name`` parameter, see the following: |
7c673cae FG |
523 | |
524 | ``bucket-name`` | |
1e59de90 TL |
525 | :Description: The name of the bucket that is being removed. |
526 | :Type: String | |
527 | :Required: Yes | |
528 | :Example: ``rack12`` | |
7c673cae | 529 | |
1e59de90 TL |
530 | In the following example, the command removes the ``rack12`` bucket from the |
531 | hierarchy: | |
7c673cae | 532 | |
39ae355f TL |
533 | .. prompt:: bash $ |
534 | ||
535 | ceph osd crush remove rack12 | |
c07f9fc5 FG |
536 | |
537 | Creating a compat weight set | |
538 | ---------------------------- | |
539 | ||
1e59de90 TL |
540 | .. note:: Normally this action is done automatically if needed by the |
541 | ``balancer`` module (provided that the module is enabled). | |
c07f9fc5 | 542 | |
1e59de90 | 543 | To create a *compat* weight set, run the following command: |
39ae355f TL |
544 | |
545 | .. prompt:: bash $ | |
c07f9fc5 | 546 | |
39ae355f | 547 | ceph osd crush weight-set create-compat |
c07f9fc5 | 548 | |
1e59de90 TL |
549 | To adjust the weights of the compat weight set, run a command of the following |
550 | form: | |
c07f9fc5 | 551 | |
39ae355f | 552 | .. prompt:: bash $ |
c07f9fc5 | 553 | |
39ae355f | 554 | ceph osd crush weight-set reweight-compat {name} {weight} |
c07f9fc5 | 555 | |
1e59de90 | 556 | To destroy the compat weight set, run the following command: |
39ae355f TL |
557 | |
558 | .. prompt:: bash $ | |
559 | ||
560 | ceph osd crush weight-set rm-compat | |
c07f9fc5 FG |
561 | |
562 | Creating per-pool weight sets | |
563 | ----------------------------- | |
564 | ||
1e59de90 TL |
565 | To create a weight set for a specific pool, run a command of the following |
566 | form: | |
39ae355f TL |
567 | |
568 | .. prompt:: bash $ | |
c07f9fc5 | 569 | |
39ae355f | 570 | ceph osd crush weight-set create {pool-name} {mode} |
c07f9fc5 | 571 | |
1e59de90 TL |
572 | .. note:: Per-pool weight sets can be used only if all servers and daemons are |
573 | running Luminous v12.2.z or a later release. | |
c07f9fc5 | 574 | |
1e59de90 | 575 | For details on this command's parameters, see the following: |
c07f9fc5 FG |
576 | |
577 | ``pool-name`` | |
1e59de90 TL |
578 | :Description: The name of a RADOS pool. |
579 | :Type: String | |
580 | :Required: Yes | |
581 | :Example: ``rbd`` | |
c07f9fc5 FG |
582 | |
583 | ``mode`` | |
1e59de90 TL |
584 | :Description: Either ``flat`` or ``positional``. A *flat* weight set |
585 | assigns a single weight to all devices or buckets. A | |
586 | *positional* weight set has a potentially different | |
587 | weight for each position in the resulting placement | |
588 | mapping. For example: if a pool has a replica count of | |
589 | ``3``, then a positional weight set will have three | |
590 | weights for each device and bucket. | |
591 | :Type: String | |
592 | :Required: Yes | |
593 | :Example: ``flat`` | |
594 | ||
595 | To adjust the weight of an item in a weight set, run a command of the following | |
596 | form: | |
39ae355f TL |
597 | |
598 | .. prompt:: bash $ | |
599 | ||
600 | ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} | |
c07f9fc5 | 601 | |
1e59de90 | 602 | To list existing weight sets, run the following command: |
c07f9fc5 | 603 | |
39ae355f | 604 | .. prompt:: bash $ |
c07f9fc5 | 605 | |
39ae355f | 606 | ceph osd crush weight-set ls |
c07f9fc5 | 607 | |
1e59de90 | 608 | To remove a weight set, run a command of the following form: |
c07f9fc5 | 609 | |
39ae355f TL |
610 | .. prompt:: bash $ |
611 | ||
612 | ceph osd crush weight-set rm {pool-name} | |
c07f9fc5 | 613 | |
1e59de90 | 614 | |
c07f9fc5 FG |
615 | Creating a rule for a replicated pool |
616 | ------------------------------------- | |
617 | ||
1e59de90 TL |
618 | When you create a CRUSH rule for a replicated pool, there is an important |
619 | decision to make: selecting a failure domain. For example, if you select a | |
620 | failure domain of ``host``, then CRUSH will ensure that each replica of the | |
621 | data is stored on a unique host. Alternatively, if you select a failure domain | |
622 | of ``rack``, then each replica of the data will be stored in a different rack. | |
623 | Your selection of failure domain should be guided by the size and its CRUSH | |
624 | topology. | |
625 | ||
626 | The entire cluster hierarchy is typically nested beneath a root node that is | |
627 | named ``default``. If you have customized your hierarchy, you might want to | |
628 | create a rule nested beneath some other node in the hierarchy. In creating | |
629 | this rule for the customized hierarchy, the node type doesn't matter, and in | |
630 | particular the rule does not have to be nested beneath a ``root`` node. | |
631 | ||
632 | It is possible to create a rule that restricts data placement to a specific | |
633 | *class* of device. By default, Ceph OSDs automatically classify themselves as | |
634 | either ``hdd`` or ``ssd`` in accordance with the underlying type of device | |
635 | being used. These device classes can be customized. One might set the ``device | |
636 | class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set | |
637 | them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules | |
638 | and pools may be flexibly constrained to use (or avoid using) specific subsets | |
639 | of OSDs based on specific requirements. | |
640 | ||
641 | To create a rule for a replicated pool, run a command of the following form: | |
39ae355f TL |
642 | |
643 | .. prompt:: bash $ | |
c07f9fc5 | 644 | |
39ae355f | 645 | ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] |
c07f9fc5 | 646 | |
1e59de90 | 647 | For details on this command's parameters, see the following: |
c07f9fc5 FG |
648 | |
649 | ``name`` | |
1e59de90 TL |
650 | :Description: The name of the rule. |
651 | :Type: String | |
652 | :Required: Yes | |
653 | :Example: ``rbd-rule`` | |
c07f9fc5 FG |
654 | |
655 | ``root`` | |
1e59de90 TL |
656 | :Description: The name of the CRUSH hierarchy node under which data is to be placed. |
657 | :Type: String | |
658 | :Required: Yes | |
659 | :Example: ``default`` | |
c07f9fc5 FG |
660 | |
661 | ``failure-domain-type`` | |
1e59de90 TL |
662 | :Description: The type of CRUSH nodes used for the replicas of the failure domain. |
663 | :Type: String | |
664 | :Required: Yes | |
665 | :Example: ``rack`` | |
c07f9fc5 FG |
666 | |
667 | ``class`` | |
1e59de90 TL |
668 | :Description: The device class on which data is to be placed. |
669 | :Type: String | |
670 | :Required: No | |
671 | :Example: ``ssd`` | |
c07f9fc5 | 672 | |
1e59de90 | 673 | Creating a rule for an erasure-coded pool |
c07f9fc5 FG |
674 | ----------------------------------------- |
675 | ||
1e59de90 TL |
676 | For an erasure-coded pool, similar decisions need to be made: what the failure |
677 | domain is, which node in the hierarchy data will be placed under (usually | |
678 | ``default``), and whether placement is restricted to a specific device class. | |
679 | However, erasure-code pools are created in a different way: there is a need to | |
680 | construct them carefully with reference to the erasure code plugin in use. For | |
681 | this reason, these decisions must be incorporated into the **erasure-code | |
682 | profile**. A CRUSH rule will then be created from the erasure-code profile, | |
683 | either explicitly or automatically when the profile is used to create a pool. | |
c07f9fc5 | 684 | |
1e59de90 | 685 | To list the erasure-code profiles, run the following command: |
39ae355f TL |
686 | |
687 | .. prompt:: bash $ | |
688 | ||
689 | ceph osd erasure-code-profile ls | |
c07f9fc5 | 690 | |
1e59de90 | 691 | To view a specific existing profile, run a command of the following form: |
c07f9fc5 | 692 | |
39ae355f | 693 | .. prompt:: bash $ |
c07f9fc5 | 694 | |
39ae355f | 695 | ceph osd erasure-code-profile get {profile-name} |
c07f9fc5 | 696 | |
1e59de90 TL |
697 | Under normal conditions, profiles should never be modified; instead, a new |
698 | profile should be created and used when creating either a new pool or a new | |
c07f9fc5 FG |
699 | rule for an existing pool. |
700 | ||
1e59de90 TL |
701 | An erasure-code profile consists of a set of key-value pairs. Most of these |
702 | key-value pairs govern the behavior of the erasure code that encodes data in | |
703 | the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH | |
704 | rule that is created. | |
c07f9fc5 | 705 | |
1e59de90 | 706 | The relevant erasure-code profile properties are as follows: |
c07f9fc5 | 707 | |
1e59de90 TL |
708 | * **crush-root**: the name of the CRUSH node under which to place data |
709 | [default: ``default``]. | |
710 | * **crush-failure-domain**: the CRUSH bucket type used in the distribution of | |
711 | erasure-coded shards [default: ``host``]. | |
712 | * **crush-device-class**: the device class on which to place data [default: | |
713 | none, which means that all devices are used]. | |
714 | * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the | |
715 | number of erasure-code shards, affecting the resulting CRUSH rule. | |
c07f9fc5 | 716 | |
1e59de90 TL |
717 | After a profile is defined, you can create a CRUSH rule by running a command |
718 | of the following form: | |
c07f9fc5 | 719 | |
39ae355f TL |
720 | .. prompt:: bash $ |
721 | ||
722 | ceph osd crush rule create-erasure {name} {profile-name} | |
c07f9fc5 | 723 | |
1e59de90 TL |
724 | .. note: When creating a new pool, it is not necessary to create the rule |
725 | explicitly. If only the erasure-code profile is specified and the rule | |
726 | argument is omitted, then Ceph will create the CRUSH rule automatically. | |
727 | ||
c07f9fc5 FG |
728 | |
729 | Deleting rules | |
730 | -------------- | |
731 | ||
1e59de90 TL |
732 | To delete rules that are not in use by pools, run a command of the following |
733 | form: | |
39ae355f TL |
734 | |
735 | .. prompt:: bash $ | |
c07f9fc5 | 736 | |
39ae355f | 737 | ceph osd crush rule rm {rule-name} |
c07f9fc5 | 738 | |
11fdf7f2 TL |
739 | .. _crush-map-tunables: |
740 | ||
7c673cae FG |
741 | Tunables |
742 | ======== | |
743 | ||
1e59de90 TL |
744 | The CRUSH algorithm that is used to calculate the placement of data has been |
745 | improved over time. In order to support changes in behavior, we have provided | |
746 | users with sets of tunables that determine which legacy or optimal version of | |
747 | CRUSH is to be used. | |
7c673cae | 748 | |
1e59de90 TL |
749 | In order to use newer tunables, all Ceph clients and daemons must support the |
750 | new major release of CRUSH. Because of this requirement, we have created | |
7c673cae | 751 | ``profiles`` that are named after the Ceph version in which they were |
1e59de90 TL |
752 | introduced. For example, the ``firefly`` tunables were first supported by the |
753 | Firefly release and do not work with older clients (for example, clients | |
754 | running Dumpling). After a cluster's tunables profile is changed from a legacy | |
755 | set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options | |
756 | will prevent older clients that do not support the new CRUSH features from | |
757 | connecting to the cluster. | |
7c673cae FG |
758 | |
759 | argonaut (legacy) | |
760 | ----------------- | |
761 | ||
1e59de90 TL |
762 | The legacy CRUSH behavior used by Argonaut and older releases works fine for |
763 | most clusters, provided that not many OSDs have been marked ``out``. | |
7c673cae FG |
764 | |
765 | bobtail (CRUSH_TUNABLES2) | |
766 | ------------------------- | |
767 | ||
1e59de90 | 768 | The ``bobtail`` tunable profile provides the following improvements: |
7c673cae | 769 | |
1e59de90 TL |
770 | * For hierarchies with a small number of devices in leaf buckets, some PGs |
771 | might map to fewer than the desired number of replicas, resulting in | |
772 | ``undersized`` PGs. This is known to happen in the case of hierarchies with | |
773 | ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each | |
774 | host. | |
7c673cae | 775 | |
1e59de90 TL |
776 | * For large clusters, a small percentage of PGs might map to fewer than the |
777 | desired number of OSDs. This is known to happen when there are multiple | |
778 | hierarchy layers in use (for example,, ``row``, ``rack``, ``host``, | |
779 | ``osd``). | |
7c673cae | 780 | |
1e59de90 | 781 | * When one or more OSDs are marked ``out``, data tends to be redistributed |
7c673cae FG |
782 | to nearby OSDs instead of across the entire hierarchy. |
783 | ||
1e59de90 | 784 | The tunables introduced in the Bobtail release are as follows: |
7c673cae | 785 | |
1e59de90 TL |
786 | * ``choose_local_tries``: Number of local retries. The legacy value is ``2``, |
787 | and the optimal value is ``0``. | |
7c673cae | 788 | |
1e59de90 TL |
789 | * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal |
790 | value is 0. | |
7c673cae | 791 | |
1e59de90 TL |
792 | * ``choose_total_tries``: Total number of attempts to choose an item. The |
793 | legacy value is ``19``, but subsequent testing indicates that a value of | |
794 | ``50`` is more appropriate for typical clusters. For extremely large | |
795 | clusters, an even larger value might be necessary. | |
7c673cae | 796 | |
1e59de90 TL |
797 | * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will |
798 | retry, or try only once and allow the original placement to retry. The | |
799 | legacy default is ``0``, and the optimal value is ``1``. | |
7c673cae FG |
800 | |
801 | Migration impact: | |
802 | ||
1e59de90 TL |
803 | * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a |
804 | moderate amount of data movement. Use caution on a cluster that is already | |
7c673cae FG |
805 | populated with data. |
806 | ||
807 | firefly (CRUSH_TUNABLES3) | |
808 | ------------------------- | |
809 | ||
1e59de90 TL |
810 | chooseleaf_vary_r |
811 | ~~~~~~~~~~~~~~~~~ | |
7c673cae | 812 | |
1e59de90 TL |
813 | This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step |
814 | behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs. | |
7c673cae | 815 | |
1e59de90 TL |
816 | This profile was introduced in the Firefly release, and adds a new tunable as follows: |
817 | ||
818 | * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start | |
819 | with a non-zero value of ``r``, as determined by the number of attempts the | |
820 | parent has already made. The legacy default value is ``0``, but with this | |
821 | value CRUSH is sometimes unable to find a mapping. The optimal value (in | |
f67539c2 | 822 | terms of computational cost and correctness) is ``1``. |
7c673cae | 823 | |
11fdf7f2 | 824 | Migration impact: |
7c673cae | 825 | |
1e59de90 TL |
826 | * For existing clusters that store a great deal of data, changing this tunable |
827 | from ``0`` to ``1`` will trigger a large amount of data migration; a value | |
828 | of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will | |
829 | cause less data to move. | |
7c673cae | 830 | |
1e59de90 TL |
831 | straw_calc_version tunable |
832 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
7c673cae | 833 | |
1e59de90 TL |
834 | There were problems with the internal weights calculated and stored in the |
835 | CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH | |
836 | weight of ``0`` or with a mix of different and unique weights, CRUSH would | |
837 | distribute data incorrectly (that is, not in proportion to the weights). | |
7c673cae | 838 | |
1e59de90 | 839 | This tunable, introduced in the Firefly release, is as follows: |
7c673cae | 840 | |
f67539c2 | 841 | * ``straw_calc_version``: A value of ``0`` preserves the old, broken |
1e59de90 | 842 | internal-weight calculation; a value of ``1`` fixes the problem. |
7c673cae FG |
843 | |
844 | Migration impact: | |
845 | ||
1e59de90 TL |
846 | * Changing this tunable to a value of ``1`` and then adjusting a straw bucket |
847 | (either by adding, removing, or reweighting an item or by using the | |
848 | reweight-all command) can trigger a small to moderate amount of data | |
849 | movement provided that the cluster has hit one of the problematic | |
7c673cae FG |
850 | conditions. |
851 | ||
1e59de90 TL |
852 | This tunable option is notable in that it has absolutely no impact on the |
853 | required kernel version in the client side. | |
7c673cae FG |
854 | |
855 | hammer (CRUSH_V4) | |
856 | ----------------- | |
857 | ||
1e59de90 TL |
858 | The ``hammer`` tunable profile does not affect the mapping of existing CRUSH |
859 | maps simply by changing the profile. However: | |
7c673cae | 860 | |
1e59de90 TL |
861 | * There is a new bucket algorithm supported: ``straw2``. This new algorithm |
862 | fixes several limitations in the original ``straw``. More specifically, the | |
863 | old ``straw`` buckets would change some mappings that should not have | |
864 | changed when a weight was adjusted, while ``straw2`` achieves the original | |
865 | goal of changing mappings only to or from the bucket item whose weight has | |
7c673cae FG |
866 | changed. |
867 | ||
1e59de90 | 868 | * The ``straw2`` type is the default type for any newly created buckets. |
7c673cae FG |
869 | |
870 | Migration impact: | |
871 | ||
1e59de90 TL |
872 | * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small |
873 | amount of data movement, depending on how much the bucket items' weights | |
874 | vary from each other. When the weights are all the same no data will move, | |
875 | and the more variance there is in the weights the more movement there will | |
876 | be. | |
7c673cae FG |
877 | |
878 | jewel (CRUSH_TUNABLES5) | |
879 | ----------------------- | |
880 | ||
1e59de90 TL |
881 | The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a |
882 | result, significantly fewer mappings change when an OSD is marked ``out`` of | |
883 | the cluster. This improvement results in significantly less data movement. | |
7c673cae | 884 | |
1e59de90 | 885 | The new tunable introduced in the Jewel release is as follows: |
7c673cae | 886 | |
1e59de90 TL |
887 | * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt |
888 | will use a better value for an inner loop that greatly reduces the number of | |
889 | mapping changes when an OSD is marked ``out``. The legacy value is ``0``, | |
890 | and the new value of ``1`` uses the new approach. | |
7c673cae FG |
891 | |
892 | Migration impact: | |
893 | ||
1e59de90 TL |
894 | * Changing this value on an existing cluster will result in a very large |
895 | amount of data movement because nearly every PG mapping is likely to change. | |
7c673cae | 896 | |
1e59de90 | 897 | Client versions that support CRUSH_TUNABLES2 |
7c673cae FG |
898 | -------------------------------------------- |
899 | ||
1e59de90 TL |
900 | * v0.55 and later, including Bobtail (v0.56.x) |
901 | * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 902 | |
1e59de90 TL |
903 | Client versions that support CRUSH_TUNABLES3 |
904 | -------------------------------------------- | |
7c673cae | 905 | |
1e59de90 TL |
906 | * v0.78 (Firefly) and later |
907 | * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 908 | |
1e59de90 TL |
909 | Client versions that support CRUSH_V4 |
910 | ------------------------------------- | |
7c673cae | 911 | |
1e59de90 TL |
912 | * v0.94 (Hammer) and later |
913 | * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 914 | |
1e59de90 TL |
915 | Client versions that support CRUSH_TUNABLES5 |
916 | -------------------------------------------- | |
7c673cae | 917 | |
1e59de90 TL |
918 | * v10.0.2 (Jewel) and later |
919 | * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 920 | |
1e59de90 TL |
921 | "Non-optimal tunables" warning |
922 | ------------------------------ | |
7c673cae | 923 | |
1e59de90 TL |
924 | In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush |
925 | map has non-optimal tunables") if any of the current CRUSH tunables have | |
926 | non-optimal values: that is, if any fail to have the optimal values from the | |
927 | :ref:` ``default`` profile | |
928 | <rados_operations_crush_map_default_profile_definition>`. There are two | |
929 | different ways to silence the alert: | |
7c673cae | 930 | |
1e59de90 TL |
931 | 1. Adjust the CRUSH tunables on the existing cluster so as to render them |
932 | optimal. Making this adjustment will trigger some data movement | |
933 | (possibly as much as 10%). This approach is generally preferred to the | |
934 | other approach, but special care must be taken in situations where | |
935 | data movement might affect performance: for example, in production clusters. | |
936 | To enable optimal tunables, run the following command: | |
39ae355f TL |
937 | |
938 | .. prompt:: bash $ | |
7c673cae FG |
939 | |
940 | ceph osd crush tunables optimal | |
941 | ||
1e59de90 TL |
942 | There are several potential problems that might make it preferable to revert |
943 | to the previous values of the tunables. The new values might generate too | |
944 | much load for the cluster to handle, the new values might unacceptably slow | |
945 | the operation of the cluster, or there might be a client-compatibility | |
946 | problem. Such client-compatibility problems can arise when using old-kernel | |
947 | CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to | |
948 | the previous values of the tunables, run the following command: | |
39ae355f TL |
949 | |
950 | .. prompt:: bash $ | |
7c673cae FG |
951 | |
952 | ceph osd crush tunables legacy | |
953 | ||
1e59de90 TL |
954 | 2. To silence the alert without making any changes to CRUSH, |
955 | add the following option to the ``[mon]`` section of your ceph.conf file:: | |
7c673cae | 956 | |
1e59de90 | 957 | mon_warn_on_legacy_crush_tunables = false |
7c673cae | 958 | |
1e59de90 TL |
959 | In order for this change to take effect, you will need to either restart |
960 | the monitors or run the following command to apply the option to the | |
961 | monitors while they are still running: | |
39ae355f TL |
962 | |
963 | .. prompt:: bash $ | |
7c673cae | 964 | |
11fdf7f2 | 965 | ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false |
7c673cae FG |
966 | |
967 | ||
7c673cae FG |
968 | Tuning CRUSH |
969 | ------------ | |
970 | ||
1e59de90 TL |
971 | When making adjustments to CRUSH tunables, keep the following considerations in |
972 | mind: | |
973 | ||
974 | * Adjusting the values of CRUSH tunables will result in the shift of one or | |
975 | more PGs from one storage node to another. If the Ceph cluster is already | |
976 | storing a great deal of data, be prepared for significant data movement. | |
977 | * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they | |
978 | immediately begin rejecting new connections from clients that do not support | |
979 | the new feature. However, already-connected clients are effectively | |
980 | grandfathered in, and any of these clients that do not support the new | |
981 | feature will malfunction. | |
982 | * If the CRUSH tunables are set to newer (non-legacy) values and subsequently | |
983 | reverted to the legacy values, ``ceph-osd`` daemons will not be required to | |
984 | support any of the newer CRUSH features associated with the newer | |
985 | (non-legacy) values. However, the OSD peering process requires the | |
986 | examination and understanding of old maps. For this reason, **if the cluster | |
987 | has previously used non-legacy CRUSH values, do not run old versions of | |
988 | the** ``ceph-osd`` **daemon** -- even if the latest version of the map has | |
989 | been reverted so as to use the legacy defaults. | |
990 | ||
991 | The simplest way to adjust CRUSH tunables is to apply them in matched sets | |
992 | known as *profiles*. As of the Octopus release, Ceph supports the following | |
993 | profiles: | |
994 | ||
995 | * ``legacy``: The legacy behavior from argonaut and earlier. | |
996 | * ``argonaut``: The legacy values supported by the argonaut release. | |
997 | * ``bobtail``: The values supported by the bobtail release. | |
998 | * ``firefly``: The values supported by the firefly release. | |
999 | * ``hammer``: The values supported by the hammer release. | |
1000 | * ``jewel``: The values supported by the jewel release. | |
1001 | * ``optimal``: The best values for the current version of Ceph. | |
1002 | .. _rados_operations_crush_map_default_profile_definition: | |
1003 | * ``default``: The default values of a new cluster that has been installed | |
1004 | from scratch. These values, which depend on the current version of Ceph, are | |
1005 | hardcoded and are typically a mix of optimal and legacy values. These | |
1006 | values often correspond to the ``optimal`` profile of either the previous | |
1007 | LTS (long-term service) release or the most recent release for which most | |
1008 | users are expected to have up-to-date clients. | |
1009 | ||
1010 | To apply a profile to a running cluster, run a command of the following form: | |
7c673cae | 1011 | |
39ae355f TL |
1012 | .. prompt:: bash $ |
1013 | ||
1014 | ceph osd crush tunables {PROFILE} | |
7c673cae | 1015 | |
1e59de90 TL |
1016 | This action might trigger a great deal of data movement. Consult release notes |
1017 | and documentation before changing the profile on a running cluster. Consider | |
1018 | throttling recovery and backfill parameters in order to limit the backfill | |
1019 | resulting from a specific change. | |
7c673cae | 1020 | |
39ae355f | 1021 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |
7c673cae | 1022 | |
7c673cae | 1023 | |
1e59de90 TL |
1024 | Tuning Primary OSD Selection |
1025 | ============================ | |
7c673cae | 1026 | |
1e59de90 | 1027 | When a Ceph client reads or writes data, it first contacts the primary OSD in |
f67539c2 | 1028 | each affected PG's acting set. By default, the first OSD in the acting set is |
1e59de90 TL |
1029 | the primary OSD (also known as the "lead OSD"). For example, in the acting set |
1030 | ``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD. | |
1031 | However, sometimes it is clear that an OSD is not well suited to act as the | |
1032 | lead as compared with other OSDs (for example, if the OSD has a slow drive or a | |
1033 | slow controller). To prevent performance bottlenecks (especially on read | |
1034 | operations) and at the same time maximize the utilization of your hardware, you | |
1035 | can influence the selection of the primary OSD either by adjusting "primary | |
1036 | affinity" values, or by crafting a CRUSH rule that selects OSDs that are better | |
1037 | suited to act as the lead rather than other OSDs. | |
1038 | ||
1039 | To determine whether tuning Ceph's selection of primary OSDs will improve | |
1040 | cluster performance, pool redundancy strategy must be taken into account. For | |
1041 | replicated pools, this tuning can be especially useful, because by default read | |
1042 | operations are served from the primary OSD of each PG. For erasure-coded pools, | |
1043 | however, the speed of read operations can be increased by enabling **fast | |
1044 | read** (see :ref:`pool-settings`). | |
1045 | ||
1046 | Primary Affinity | |
1047 | ---------------- | |
1048 | ||
1049 | **Primary affinity** is a characteristic of an OSD that governs the likelihood | |
1050 | that a given OSD will be selected as the primary OSD (or "lead OSD") in a given | |
1051 | acting set. A primary affinity value can be any real number in the range ``0`` | |
1052 | to ``1``, inclusive. | |
1053 | ||
1054 | As an example of a common scenario in which it can be useful to adjust primary | |
1055 | affinity values, let us suppose that a cluster contains a mix of drive sizes: | |
1056 | for example, suppose it contains some older racks with 1.9 TB SATA SSDs and | |
1057 | some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned | |
1058 | twice the number of PGs and will thus serve twice the number of write and read | |
1059 | operations -- they will be busier than the former. In such a scenario, you | |
1060 | might make a rough assignment of primary affinity as inversely proportional to | |
1061 | OSD size. Such an assignment will not be 100% optimal, but it can readily | |
1062 | achieve a 15% improvement in overall read throughput by means of a more even | |
1063 | utilization of SATA interface bandwidth and CPU cycles. This example is not | |
1064 | merely a thought experiment meant to illustrate the theoretical benefits of | |
1065 | adjusting primary affinity values; this fifteen percent improvement was | |
1066 | achieved on an actual Ceph cluster. | |
1067 | ||
1068 | By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster | |
1069 | in which every OSD has this default value, all OSDs are equally likely to act | |
1070 | as a primary OSD. | |
1071 | ||
1072 | By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less | |
1073 | likely to select the OSD as primary in a PG's acting set. To change the weight | |
1074 | value associated with a specific OSD's primary affinity, run a command of the | |
1075 | following form: | |
f67539c2 | 1076 | |
39ae355f | 1077 | .. prompt:: bash $ |
7c673cae | 1078 | |
39ae355f TL |
1079 | ceph osd primary-affinity <osd-id> <weight> |
1080 | ||
1e59de90 TL |
1081 | The primary affinity of an OSD can be set to any real number in the range |
1082 | ``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as | |
1083 | primary and ``1`` indicates that the OSD is maximally likely to be used as a | |
1084 | primary. When the weight is between these extremes, its value indicates roughly | |
1085 | how likely it is that CRUSH will select the OSD associated with it as a | |
1086 | primary. | |
1087 | ||
1088 | The process by which CRUSH selects the lead OSD is not a mere function of a | |
1089 | simple probability determined by relative affinity values. Nevertheless, | |
1090 | measurable results can be achieved even with first-order approximations of | |
1091 | desirable primary affinity values. | |
1092 | ||
f67539c2 TL |
1093 | |
1094 | Custom CRUSH Rules | |
1095 | ------------------ | |
1096 | ||
1e59de90 TL |
1097 | Some clusters balance cost and performance by mixing SSDs and HDDs in the same |
1098 | replicated pool. By setting the primary affinity of HDD OSDs to ``0``, | |
1099 | operations will be directed to an SSD OSD in each acting set. Alternatively, | |
1100 | you can define a CRUSH rule that always selects an SSD OSD as the primary OSD | |
1101 | and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting | |
1102 | set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs. | |
1103 | ||
1104 | For example, see the following CRUSH rule:: | |
1105 | ||
1106 | rule mixed_replicated_rule { | |
1107 | id 11 | |
1108 | type replicated | |
1109 | step take default class ssd | |
1110 | step chooseleaf firstn 1 type host | |
1111 | step emit | |
1112 | step take default class hdd | |
1113 | step chooseleaf firstn 0 type host | |
1114 | step emit | |
1115 | } | |
1116 | ||
1117 | This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool, | |
1118 | this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on | |
1119 | different hosts, because the first SSD OSD might be colocated with any of the | |
1120 | ``N`` HDD OSDs. | |
1121 | ||
1122 | To avoid this extra storage requirement, you might place SSDs and HDDs in | |
1123 | different hosts. However, taking this approach means that all client requests | |
1124 | will be received by hosts with SSDs. For this reason, it might be advisable to | |
1125 | have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the | |
1126 | latter will under normal circumstances perform only recovery operations. Here | |
1127 | the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement | |
1128 | not to contain any of the same servers, as seen in the following CRUSH rule:: | |
f67539c2 TL |
1129 | |
1130 | rule mixed_replicated_rule_two { | |
1131 | id 1 | |
1132 | type replicated | |
f67539c2 TL |
1133 | step take ssd_hosts class ssd |
1134 | step chooseleaf firstn 1 type host | |
1135 | step emit | |
1136 | step take hdd_hosts class hdd | |
1137 | step chooseleaf firstn -1 type host | |
1138 | step emit | |
1139 | } | |
1140 | ||
1e59de90 TL |
1141 | .. note:: If a primary SSD OSD fails, then requests to the associated PG will |
1142 | be temporarily served from a slower HDD OSD until the PG's data has been | |
1143 | replicated onto the replacement primary SSD OSD. | |
f67539c2 | 1144 | |
7c673cae | 1145 |