]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | CRUSH Maps | |
3 | ============ | |
4 | ||
5 | The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm | |
1e59de90 TL |
6 | computes storage locations in order to determine how to store and retrieve |
7 | data. CRUSH allows Ceph clients to communicate with OSDs directly rather than | |
8 | through a centralized server or broker. By using an algorithmically-determined | |
7c673cae FG |
9 | method of storing and retrieving data, Ceph avoids a single point of failure, a |
10 | performance bottleneck, and a physical limit to its scalability. | |
11 | ||
1e59de90 TL |
12 | CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs, |
13 | distributing the data across the cluster in accordance with configured | |
14 | replication policy and failure domains. For a detailed discussion of CRUSH, see | |
7c673cae FG |
15 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ |
16 | ||
1e59de90 TL |
17 | CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a |
18 | hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH | |
19 | replicates data within the cluster's pools. By reflecting the underlying | |
20 | physical organization of the installation, CRUSH can model (and thereby | |
21 | address) the potential for correlated device failures. Some factors relevant | |
22 | to the CRUSH hierarchy include chassis, racks, physical proximity, a shared | |
23 | power source, shared networking, and failure domains. By encoding this | |
24 | information into the CRUSH map, CRUSH placement policies distribute object | |
25 | replicas across failure domains while maintaining the desired distribution. For | |
26 | example, to address the possibility of concurrent failures, it might be | |
27 | desirable to ensure that data replicas are on devices that reside in or rely | |
28 | upon different shelves, racks, power supplies, controllers, or physical | |
29 | locations. | |
30 | ||
31 | When OSDs are deployed, they are automatically added to the CRUSH map under a | |
32 | ``host`` bucket that is named for the node on which the OSDs run. This | |
33 | behavior, combined with the configured CRUSH failure domain, ensures that | |
34 | replicas or erasure-code shards are distributed across hosts and that the | |
35 | failure of a single host or other kinds of failures will not affect | |
36 | availability. For larger clusters, administrators must carefully consider their | |
37 | choice of failure domain. For example, distributing replicas across racks is | |
38 | typical for mid- to large-sized clusters. | |
7c673cae FG |
39 | |
40 | ||
41 | CRUSH Location | |
42 | ============== | |
43 | ||
1e59de90 TL |
44 | The location of an OSD within the CRUSH map's hierarchy is referred to as its |
45 | ``CRUSH location``. The specification of a CRUSH location takes the form of a | |
46 | list of key-value pairs. For example, if an OSD is in a particular row, rack, | |
47 | chassis, and host, and is also part of the 'default' CRUSH root (which is the | |
48 | case for most clusters), its CRUSH location can be specified as follows:: | |
7c673cae FG |
49 | |
50 | root=default row=a rack=a2 chassis=a2a host=a2a1 | |
51 | ||
1e59de90 | 52 | .. note:: |
7c673cae | 53 | |
1e59de90 TL |
54 | #. The order of the keys does not matter. |
55 | #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default, | |
56 | valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``, | |
57 | ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined | |
58 | types suffice for nearly all clusters, but can be customized by | |
59 | modifying the CRUSH map. | |
7c673cae | 60 | |
f38dd50b TL |
61 | The CRUSH location for an OSD can be set by adding the ``crush_location`` |
62 | option in ``ceph.conf``, example: | |
63 | ||
64 | crush_location = root=default row=a rack=a2 chassis=a2a host=a2a1 | |
65 | ||
66 | When this option has been added, every time the OSD | |
1e59de90 TL |
67 | starts it verifies that it is in the correct location in the CRUSH map and |
68 | moves itself if it is not. To disable this automatic CRUSH map management, add | |
69 | the following to the ``ceph.conf`` configuration file in the ``[osd]`` | |
70 | section:: | |
7c673cae | 71 | |
f38dd50b | 72 | osd_crush_update_on_start = false |
7c673cae | 73 | |
1e59de90 | 74 | Note that this action is unnecessary in most cases. |
f67539c2 | 75 | |
f38dd50b TL |
76 | If the ``crush_location`` is not set explicitly, |
77 | a default of ``root=default host=HOSTNAME`` is used for ``OSD``s, | |
78 | where the hostname is determined by the output of the ``hostname -s`` command. | |
79 | ||
80 | .. note:: If you switch from this default to an explicitly set ``crush_location``, | |
81 | do not forget to include ``root=default`` because existing CRUSH rules refer to it. | |
c07f9fc5 | 82 | |
7c673cae FG |
83 | Custom location hooks |
84 | --------------------- | |
85 | ||
f38dd50b TL |
86 | A custom location hook can be used to generate a more complete CRUSH location, |
87 | on startup. | |
88 | ||
89 | This is useful when some location fields are not known at the time | |
90 | ``ceph.conf`` is written (for example, fields ``rack`` or ``datacenter`` | |
91 | when deploying a single configuration across multiple datacenters). | |
c07f9fc5 | 92 | |
f38dd50b TL |
93 | If configured, executed, and parsed successfully, the hook's output replaces |
94 | any previously set CRUSH location. | |
c07f9fc5 | 95 | |
f38dd50b TL |
96 | The hook hook can be enabled in ``ceph.conf`` by providing a path to an |
97 | executable file (often a script), example:: | |
7c673cae | 98 | |
f38dd50b | 99 | crush_location_hook = /path/to/customized-ceph-crush-location |
7c673cae | 100 | |
1e59de90 | 101 | This hook is passed several arguments (see below). The hook outputs a single |
f38dd50b TL |
102 | line to ``stdout`` that contains the CRUSH location description. The arguments |
103 | resemble the following::: | |
7c673cae | 104 | |
11fdf7f2 | 105 | --cluster CLUSTER --id ID --type TYPE |
7c673cae | 106 | |
1e59de90 TL |
107 | Here the cluster name is typically ``ceph``, the ``id`` is the daemon |
108 | identifier or (in the case of OSDs) the OSD number, and the daemon type is | |
f38dd50b | 109 | ``osd``, ``mds``, ``mgr``, or ``mon``. |
11fdf7f2 | 110 | |
1e59de90 | 111 | For example, a simple hook that specifies a rack location via a value in the |
f38dd50b | 112 | file ``/etc/rack`` (assuming it contains no spaces) might be as follows:: |
11fdf7f2 TL |
113 | |
114 | #!/bin/sh | |
f38dd50b | 115 | echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)" |
7c673cae FG |
116 | |
117 | ||
c07f9fc5 FG |
118 | CRUSH structure |
119 | =============== | |
7c673cae | 120 | |
1e59de90 TL |
121 | The CRUSH map consists of (1) a hierarchy that describes the physical topology |
122 | of the cluster and (2) a set of rules that defines data placement policy. The | |
123 | hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to | |
124 | other physical features or groupings: hosts, racks, rows, data centers, and so | |
125 | on. The rules determine how replicas are placed in terms of that hierarchy (for | |
126 | example, 'three replicas in different racks'). | |
7c673cae | 127 | |
c07f9fc5 FG |
128 | Devices |
129 | ------- | |
7c673cae | 130 | |
1e59de90 TL |
131 | Devices are individual OSDs that store data (usually one device for each |
132 | storage drive). Devices are identified by an ``id`` (a non-negative integer) | |
133 | and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``). | |
7c673cae | 134 | |
1e59de90 TL |
135 | In Luminous and later releases, OSDs can have a *device class* assigned (for |
136 | example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH | |
137 | rules. Device classes are especially useful when mixing device types within | |
138 | hosts. | |
7c673cae | 139 | |
20effc67 TL |
140 | .. _crush_map_default_types: |
141 | ||
c07f9fc5 | 142 | Types and Buckets |
7c673cae FG |
143 | ----------------- |
144 | ||
1e59de90 TL |
145 | "Bucket", in the context of CRUSH, is a term for any of the internal nodes in |
146 | the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of | |
147 | *types* that are used to identify these nodes. Default types include: | |
f67539c2 TL |
148 | |
149 | - ``osd`` (or ``device``) | |
150 | - ``host`` | |
151 | - ``chassis`` | |
152 | - ``rack`` | |
153 | - ``row`` | |
154 | - ``pdu`` | |
155 | - ``pod`` | |
156 | - ``room`` | |
157 | - ``datacenter`` | |
158 | - ``zone`` | |
159 | - ``region`` | |
160 | - ``root`` | |
161 | ||
1e59de90 TL |
162 | Most clusters use only a handful of these types, and other types can be defined |
163 | as needed. | |
164 | ||
165 | The hierarchy is built with devices (normally of type ``osd``) at the leaves | |
166 | and non-device types as the internal nodes. The root node is of type ``root``. | |
167 | For example: | |
c07f9fc5 | 168 | |
c07f9fc5 FG |
169 | |
170 | .. ditaa:: | |
171 | ||
1e59de90 | 172 | +-----------------+ |
11fdf7f2 | 173 | |{o}root default | |
1e59de90 | 174 | +--------+--------+ |
7c673cae | 175 | | |
11fdf7f2 TL |
176 | +---------------+---------------+ |
177 | | | | |
178 | +------+------+ +------+------+ | |
1e59de90 | 179 | |{o}host foo | |{o}host bar | |
11fdf7f2 | 180 | +------+------+ +------+------+ |
7c673cae | 181 | | | |
7c673cae FG |
182 | +-------+-------+ +-------+-------+ |
183 | | | | | | |
184 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
1e59de90 | 185 | | osd.0 | | osd.1 | | osd.2 | | osd.3 | |
7c673cae FG |
186 | +-----------+ +-----------+ +-----------+ +-----------+ |
187 | ||
c07f9fc5 | 188 | |
1e59de90 TL |
189 | Each node (device or bucket) in the hierarchy has a *weight* that indicates the |
190 | relative proportion of the total data that should be stored by that device or | |
191 | hierarchy subtree. Weights are set at the leaves, indicating the size of the | |
192 | device. These weights automatically sum in an 'up the tree' direction: that is, | |
193 | the weight of the ``root`` node will be the sum of the weights of all devices | |
194 | contained under it. Weights are typically measured in tebibytes (TiB). | |
195 | ||
196 | To get a simple view of the cluster's CRUSH hierarchy, including weights, run | |
197 | the following command: | |
c07f9fc5 | 198 | |
39ae355f TL |
199 | .. prompt:: bash $ |
200 | ||
201 | ceph osd tree | |
c07f9fc5 FG |
202 | |
203 | Rules | |
204 | ----- | |
205 | ||
1e59de90 TL |
206 | CRUSH rules define policy governing how data is distributed across the devices |
207 | in the hierarchy. The rules define placement as well as replication strategies | |
208 | or distribution policies that allow you to specify exactly how CRUSH places | |
209 | data replicas. For example, you might create one rule selecting a pair of | |
210 | targets for two-way mirroring, another rule for selecting three targets in two | |
211 | different data centers for three-way replication, and yet another rule for | |
212 | erasure coding across six storage devices. For a detailed discussion of CRUSH | |
213 | rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized | |
214 | Placement of Replicated Data`_. | |
215 | ||
216 | CRUSH rules can be created via the command-line by specifying the *pool type* | |
217 | that they will govern (replicated or erasure coded), the *failure domain*, and | |
218 | optionally a *device class*. In rare cases, CRUSH rules must be created by | |
219 | manually editing the CRUSH map. | |
220 | ||
221 | To see the rules that are defined for the cluster, run the following command: | |
39ae355f TL |
222 | |
223 | .. prompt:: bash $ | |
224 | ||
225 | ceph osd crush rule ls | |
7c673cae | 226 | |
1e59de90 | 227 | To view the contents of the rules, run the following command: |
7c673cae | 228 | |
39ae355f | 229 | .. prompt:: bash $ |
7c673cae | 230 | |
39ae355f | 231 | ceph osd crush rule dump |
7c673cae | 232 | |
05a536ef TL |
233 | .. _device_classes: |
234 | ||
d2e6a577 FG |
235 | Device classes |
236 | -------------- | |
237 | ||
1e59de90 TL |
238 | Each device can optionally have a *class* assigned. By default, OSDs |
239 | automatically set their class at startup to `hdd`, `ssd`, or `nvme` in | |
240 | accordance with the type of device they are backed by. | |
d2e6a577 | 241 | |
1e59de90 TL |
242 | To explicitly set the device class of one or more OSDs, run a command of the |
243 | following form: | |
d2e6a577 | 244 | |
39ae355f TL |
245 | .. prompt:: bash $ |
246 | ||
247 | ceph osd crush set-device-class <class> <osd-name> [...] | |
d2e6a577 | 248 | |
1e59de90 TL |
249 | Once a device class has been set, it cannot be changed to another class until |
250 | the old class is unset. To remove the old class of one or more OSDs, run a | |
251 | command of the following form: | |
39ae355f TL |
252 | |
253 | .. prompt:: bash $ | |
d2e6a577 | 254 | |
39ae355f | 255 | ceph osd crush rm-device-class <osd-name> [...] |
d2e6a577 | 256 | |
1e59de90 TL |
257 | This restriction allows administrators to set device classes that won't be |
258 | changed on OSD restart or by a script. | |
d2e6a577 | 259 | |
1e59de90 TL |
260 | To create a placement rule that targets a specific device class, run a command |
261 | of the following form: | |
d2e6a577 | 262 | |
39ae355f | 263 | .. prompt:: bash $ |
d2e6a577 | 264 | |
39ae355f | 265 | ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> |
d2e6a577 | 266 | |
1e59de90 TL |
267 | To apply the new placement rule to a specific pool, run a command of the |
268 | following form: | |
39ae355f TL |
269 | |
270 | .. prompt:: bash $ | |
271 | ||
272 | ceph osd pool set <pool-name> crush_rule <rule-name> | |
d2e6a577 | 273 | |
1e59de90 TL |
274 | Device classes are implemented by creating one or more "shadow" CRUSH |
275 | hierarchies. For each device class in use, there will be a shadow hierarchy | |
276 | that contains only devices of that class. CRUSH rules can then distribute data | |
277 | across the relevant shadow hierarchy. This approach is fully backward | |
278 | compatible with older Ceph clients. To view the CRUSH hierarchy with shadow | |
279 | items displayed, run the following command: | |
39ae355f | 280 | |
1e59de90 | 281 | .. prompt:: bash # |
d2e6a577 | 282 | |
39ae355f | 283 | ceph osd crush tree --show-shadow |
d2e6a577 | 284 | |
1e59de90 TL |
285 | Some older clusters that were created before the Luminous release rely on |
286 | manually crafted CRUSH maps to maintain per-device-type hierarchies. For these | |
287 | clusters, there is a *reclassify* tool available that can help them transition | |
288 | to device classes without triggering unwanted data movement (see | |
289 | :ref:`crush-reclassify`). | |
290 | ||
291 | Weight sets | |
292 | ----------- | |
293 | ||
294 | A *weight set* is an alternative set of weights to use when calculating data | |
295 | placement. The normal weights associated with each device in the CRUSH map are | |
296 | set in accordance with the device size and indicate how much data should be | |
297 | stored where. However, because CRUSH is a probabilistic pseudorandom placement | |
298 | process, there is always some variation from this ideal distribution (in the | |
299 | same way that rolling a die sixty times will likely not result in exactly ten | |
300 | ones and ten sixes). Weight sets allow the cluster to perform numerical | |
301 | optimization based on the specifics of your cluster (for example: hierarchy, | |
302 | pools) to achieve a balanced distribution. | |
303 | ||
304 | Ceph supports two types of weight sets: | |
305 | ||
306 | #. A **compat** weight set is a single alternative set of weights for each | |
307 | device and each node in the cluster. Compat weight sets cannot be expected | |
308 | to correct all anomalies (for example, PGs for different pools might be of | |
309 | different sizes and have different load levels, but are mostly treated alike | |
310 | by the balancer). However, they have the major advantage of being *backward | |
311 | compatible* with previous versions of Ceph. This means that even though | |
312 | weight sets were first introduced in Luminous v12.2.z, older clients (for | |
313 | example, Firefly) can still connect to the cluster when a compat weight set | |
314 | is being used to balance data. | |
315 | ||
316 | #. A **per-pool** weight set is more flexible in that it allows placement to | |
317 | be optimized for each data pool. Additionally, weights can be adjusted | |
318 | for each position of placement, allowing the optimizer to correct for a | |
319 | subtle skew of data toward devices with small weights relative to their | |
320 | peers (an effect that is usually apparent only in very large clusters | |
321 | but that can cause balancing problems). | |
322 | ||
323 | When weight sets are in use, the weights associated with each node in the | |
324 | hierarchy are visible in a separate column (labeled either as ``(compat)`` or | |
325 | as the pool name) in the output of the following command: | |
326 | ||
327 | .. prompt:: bash # | |
39ae355f TL |
328 | |
329 | ceph osd tree | |
c07f9fc5 | 330 | |
1e59de90 TL |
331 | If both *compat* and *per-pool* weight sets are in use, data placement for a |
332 | particular pool will use its own per-pool weight set if present. If only | |
333 | *compat* weight sets are in use, data placement will use the compat weight set. | |
334 | If neither are in use, data placement will use the normal CRUSH weights. | |
c07f9fc5 | 335 | |
1e59de90 TL |
336 | Although weight sets can be set up and adjusted manually, we recommend enabling |
337 | the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the | |
338 | cluster is running Luminous or a later release. | |
c07f9fc5 FG |
339 | |
340 | Modifying the CRUSH map | |
341 | ======================= | |
7c673cae FG |
342 | |
343 | .. _addosd: | |
344 | ||
1e59de90 TL |
345 | Adding/Moving an OSD |
346 | -------------------- | |
347 | ||
348 | .. note:: Under normal conditions, OSDs automatically add themselves to the | |
349 | CRUSH map when they are created. The command in this section is rarely | |
350 | needed. | |
7c673cae | 351 | |
7c673cae | 352 | |
1e59de90 TL |
353 | To add or move an OSD in the CRUSH map of a running cluster, run a command of |
354 | the following form: | |
39ae355f TL |
355 | |
356 | .. prompt:: bash $ | |
7c673cae | 357 | |
39ae355f | 358 | ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] |
7c673cae | 359 | |
1e59de90 | 360 | For details on this command's parameters, see the following: |
7c673cae | 361 | |
7c673cae | 362 | ``name`` |
1e59de90 TL |
363 | :Description: The full name of the OSD. |
364 | :Type: String | |
365 | :Required: Yes | |
366 | :Example: ``osd.0`` | |
7c673cae FG |
367 | |
368 | ||
369 | ``weight`` | |
1e59de90 TL |
370 | :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB). |
371 | :Type: Double | |
372 | :Required: Yes | |
373 | :Example: ``2.0`` | |
7c673cae FG |
374 | |
375 | ||
376 | ``root`` | |
1e59de90 TL |
377 | :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``). |
378 | :Type: Key-value pair. | |
379 | :Required: Yes | |
380 | :Example: ``root=default`` | |
7c673cae FG |
381 | |
382 | ||
383 | ``bucket-type`` | |
1e59de90 TL |
384 | :Description: The OSD's location in the CRUSH hierarchy. |
385 | :Type: Key-value pairs. | |
386 | :Required: No | |
387 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
7c673cae | 388 | |
1e59de90 TL |
389 | In the following example, the command adds ``osd.0`` to the hierarchy, or moves |
390 | ``osd.0`` from a previous location: | |
7c673cae | 391 | |
39ae355f TL |
392 | .. prompt:: bash $ |
393 | ||
394 | ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 | |
7c673cae FG |
395 | |
396 | ||
1e59de90 TL |
397 | Adjusting OSD weight |
398 | -------------------- | |
c07f9fc5 | 399 | |
1e59de90 TL |
400 | .. note:: Under normal conditions, OSDs automatically add themselves to the |
401 | CRUSH map with the correct weight when they are created. The command in this | |
402 | section is rarely needed. | |
7c673cae | 403 | |
1e59de90 TL |
404 | To adjust an OSD's CRUSH weight in a running cluster, run a command of the |
405 | following form: | |
39ae355f TL |
406 | |
407 | .. prompt:: bash $ | |
7c673cae | 408 | |
39ae355f | 409 | ceph osd crush reweight {name} {weight} |
7c673cae | 410 | |
1e59de90 | 411 | For details on this command's parameters, see the following: |
7c673cae FG |
412 | |
413 | ``name`` | |
1e59de90 TL |
414 | :Description: The full name of the OSD. |
415 | :Type: String | |
416 | :Required: Yes | |
417 | :Example: ``osd.0`` | |
7c673cae FG |
418 | |
419 | ||
420 | ``weight`` | |
1e59de90 TL |
421 | :Description: The CRUSH weight of the OSD. |
422 | :Type: Double | |
423 | :Required: Yes | |
424 | :Example: ``2.0`` | |
7c673cae FG |
425 | |
426 | ||
427 | .. _removeosd: | |
428 | ||
1e59de90 TL |
429 | Removing an OSD |
430 | --------------- | |
c07f9fc5 | 431 | |
1e59de90 TL |
432 | .. note:: OSDs are normally removed from the CRUSH map as a result of the |
433 | `ceph osd purge`` command. This command is rarely needed. | |
7c673cae | 434 | |
1e59de90 TL |
435 | To remove an OSD from the CRUSH map of a running cluster, run a command of the |
436 | following form: | |
39ae355f TL |
437 | |
438 | .. prompt:: bash $ | |
7c673cae | 439 | |
39ae355f | 440 | ceph osd crush remove {name} |
7c673cae | 441 | |
1e59de90 | 442 | For details on the ``name`` parameter, see the following: |
7c673cae FG |
443 | |
444 | ``name`` | |
1e59de90 TL |
445 | :Description: The full name of the OSD. |
446 | :Type: String | |
447 | :Required: Yes | |
448 | :Example: ``osd.0`` | |
7c673cae | 449 | |
c07f9fc5 | 450 | |
1e59de90 TL |
451 | Adding a CRUSH Bucket |
452 | --------------------- | |
c07f9fc5 | 453 | |
1e59de90 TL |
454 | .. note:: Buckets are implicitly created when an OSD is added and the command |
455 | that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the | |
456 | OSD's location (provided that a bucket with that name does not already | |
457 | exist). The command in this section is typically used when manually | |
458 | adjusting the structure of the hierarchy after OSDs have already been | |
459 | created. One use of this command is to move a series of hosts to a new | |
460 | rack-level bucket. Another use of this command is to add new ``host`` | |
461 | buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive | |
462 | any data until they are ready to receive data. When they are ready, move the | |
463 | buckets to the ``default`` root or to any other root as described below. | |
7c673cae | 464 | |
1e59de90 TL |
465 | To add a bucket in the CRUSH map of a running cluster, run a command of the |
466 | following form: | |
7c673cae | 467 | |
39ae355f TL |
468 | .. prompt:: bash $ |
469 | ||
470 | ceph osd crush add-bucket {bucket-name} {bucket-type} | |
7c673cae | 471 | |
1e59de90 | 472 | For details on this command's parameters, see the following: |
7c673cae FG |
473 | |
474 | ``bucket-name`` | |
1e59de90 TL |
475 | :Description: The full name of the bucket. |
476 | :Type: String | |
477 | :Required: Yes | |
478 | :Example: ``rack12`` | |
7c673cae FG |
479 | |
480 | ||
481 | ``bucket-type`` | |
1e59de90 TL |
482 | :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy. |
483 | :Type: String | |
484 | :Required: Yes | |
485 | :Example: ``rack`` | |
7c673cae | 486 | |
1e59de90 | 487 | In the following example, the command adds the ``rack12`` bucket to the hierarchy: |
39ae355f TL |
488 | |
489 | .. prompt:: bash $ | |
7c673cae | 490 | |
39ae355f | 491 | ceph osd crush add-bucket rack12 rack |
7c673cae | 492 | |
1e59de90 TL |
493 | Moving a Bucket |
494 | --------------- | |
7c673cae | 495 | |
c07f9fc5 | 496 | To move a bucket to a different location or position in the CRUSH map |
1e59de90 | 497 | hierarchy, run a command of the following form: |
7c673cae | 498 | |
39ae355f TL |
499 | .. prompt:: bash $ |
500 | ||
501 | ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] | |
7c673cae | 502 | |
1e59de90 | 503 | For details on this command's parameters, see the following: |
7c673cae FG |
504 | |
505 | ``bucket-name`` | |
1e59de90 TL |
506 | :Description: The name of the bucket that you are moving. |
507 | :Type: String | |
508 | :Required: Yes | |
509 | :Example: ``foo-bar-1`` | |
7c673cae FG |
510 | |
511 | ``bucket-type`` | |
1e59de90 TL |
512 | :Description: The bucket's new location in the CRUSH hierarchy. |
513 | :Type: Key-value pairs. | |
514 | :Required: No | |
515 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
7c673cae | 516 | |
1e59de90 TL |
517 | Removing a Bucket |
518 | ----------------- | |
7c673cae | 519 | |
1e59de90 TL |
520 | To remove a bucket from the CRUSH hierarchy, run a command of the following |
521 | form: | |
39ae355f TL |
522 | |
523 | .. prompt:: bash $ | |
7c673cae | 524 | |
39ae355f | 525 | ceph osd crush remove {bucket-name} |
7c673cae | 526 | |
1e59de90 TL |
527 | .. note:: A bucket must already be empty before it is removed from the CRUSH |
528 | hierarchy. In other words, there must not be OSDs or any other CRUSH buckets | |
529 | within it. | |
7c673cae | 530 | |
1e59de90 | 531 | For details on the ``bucket-name`` parameter, see the following: |
7c673cae FG |
532 | |
533 | ``bucket-name`` | |
1e59de90 TL |
534 | :Description: The name of the bucket that is being removed. |
535 | :Type: String | |
536 | :Required: Yes | |
537 | :Example: ``rack12`` | |
7c673cae | 538 | |
1e59de90 TL |
539 | In the following example, the command removes the ``rack12`` bucket from the |
540 | hierarchy: | |
7c673cae | 541 | |
39ae355f TL |
542 | .. prompt:: bash $ |
543 | ||
544 | ceph osd crush remove rack12 | |
c07f9fc5 FG |
545 | |
546 | Creating a compat weight set | |
547 | ---------------------------- | |
548 | ||
1e59de90 TL |
549 | .. note:: Normally this action is done automatically if needed by the |
550 | ``balancer`` module (provided that the module is enabled). | |
c07f9fc5 | 551 | |
1e59de90 | 552 | To create a *compat* weight set, run the following command: |
39ae355f TL |
553 | |
554 | .. prompt:: bash $ | |
c07f9fc5 | 555 | |
39ae355f | 556 | ceph osd crush weight-set create-compat |
c07f9fc5 | 557 | |
1e59de90 TL |
558 | To adjust the weights of the compat weight set, run a command of the following |
559 | form: | |
c07f9fc5 | 560 | |
39ae355f | 561 | .. prompt:: bash $ |
c07f9fc5 | 562 | |
39ae355f | 563 | ceph osd crush weight-set reweight-compat {name} {weight} |
c07f9fc5 | 564 | |
1e59de90 | 565 | To destroy the compat weight set, run the following command: |
39ae355f TL |
566 | |
567 | .. prompt:: bash $ | |
568 | ||
569 | ceph osd crush weight-set rm-compat | |
c07f9fc5 FG |
570 | |
571 | Creating per-pool weight sets | |
572 | ----------------------------- | |
573 | ||
1e59de90 TL |
574 | To create a weight set for a specific pool, run a command of the following |
575 | form: | |
39ae355f TL |
576 | |
577 | .. prompt:: bash $ | |
c07f9fc5 | 578 | |
39ae355f | 579 | ceph osd crush weight-set create {pool-name} {mode} |
c07f9fc5 | 580 | |
1e59de90 TL |
581 | .. note:: Per-pool weight sets can be used only if all servers and daemons are |
582 | running Luminous v12.2.z or a later release. | |
c07f9fc5 | 583 | |
1e59de90 | 584 | For details on this command's parameters, see the following: |
c07f9fc5 FG |
585 | |
586 | ``pool-name`` | |
1e59de90 TL |
587 | :Description: The name of a RADOS pool. |
588 | :Type: String | |
589 | :Required: Yes | |
590 | :Example: ``rbd`` | |
c07f9fc5 FG |
591 | |
592 | ``mode`` | |
1e59de90 TL |
593 | :Description: Either ``flat`` or ``positional``. A *flat* weight set |
594 | assigns a single weight to all devices or buckets. A | |
595 | *positional* weight set has a potentially different | |
596 | weight for each position in the resulting placement | |
597 | mapping. For example: if a pool has a replica count of | |
598 | ``3``, then a positional weight set will have three | |
599 | weights for each device and bucket. | |
600 | :Type: String | |
601 | :Required: Yes | |
602 | :Example: ``flat`` | |
603 | ||
604 | To adjust the weight of an item in a weight set, run a command of the following | |
605 | form: | |
39ae355f TL |
606 | |
607 | .. prompt:: bash $ | |
608 | ||
609 | ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} | |
c07f9fc5 | 610 | |
1e59de90 | 611 | To list existing weight sets, run the following command: |
c07f9fc5 | 612 | |
39ae355f | 613 | .. prompt:: bash $ |
c07f9fc5 | 614 | |
39ae355f | 615 | ceph osd crush weight-set ls |
c07f9fc5 | 616 | |
1e59de90 | 617 | To remove a weight set, run a command of the following form: |
c07f9fc5 | 618 | |
39ae355f TL |
619 | .. prompt:: bash $ |
620 | ||
621 | ceph osd crush weight-set rm {pool-name} | |
c07f9fc5 | 622 | |
1e59de90 | 623 | |
c07f9fc5 FG |
624 | Creating a rule for a replicated pool |
625 | ------------------------------------- | |
626 | ||
1e59de90 TL |
627 | When you create a CRUSH rule for a replicated pool, there is an important |
628 | decision to make: selecting a failure domain. For example, if you select a | |
629 | failure domain of ``host``, then CRUSH will ensure that each replica of the | |
630 | data is stored on a unique host. Alternatively, if you select a failure domain | |
631 | of ``rack``, then each replica of the data will be stored in a different rack. | |
632 | Your selection of failure domain should be guided by the size and its CRUSH | |
633 | topology. | |
634 | ||
635 | The entire cluster hierarchy is typically nested beneath a root node that is | |
636 | named ``default``. If you have customized your hierarchy, you might want to | |
637 | create a rule nested beneath some other node in the hierarchy. In creating | |
638 | this rule for the customized hierarchy, the node type doesn't matter, and in | |
639 | particular the rule does not have to be nested beneath a ``root`` node. | |
640 | ||
641 | It is possible to create a rule that restricts data placement to a specific | |
642 | *class* of device. By default, Ceph OSDs automatically classify themselves as | |
643 | either ``hdd`` or ``ssd`` in accordance with the underlying type of device | |
644 | being used. These device classes can be customized. One might set the ``device | |
645 | class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set | |
646 | them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules | |
647 | and pools may be flexibly constrained to use (or avoid using) specific subsets | |
648 | of OSDs based on specific requirements. | |
649 | ||
650 | To create a rule for a replicated pool, run a command of the following form: | |
39ae355f TL |
651 | |
652 | .. prompt:: bash $ | |
c07f9fc5 | 653 | |
39ae355f | 654 | ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] |
c07f9fc5 | 655 | |
1e59de90 | 656 | For details on this command's parameters, see the following: |
c07f9fc5 FG |
657 | |
658 | ``name`` | |
1e59de90 TL |
659 | :Description: The name of the rule. |
660 | :Type: String | |
661 | :Required: Yes | |
662 | :Example: ``rbd-rule`` | |
c07f9fc5 FG |
663 | |
664 | ``root`` | |
1e59de90 TL |
665 | :Description: The name of the CRUSH hierarchy node under which data is to be placed. |
666 | :Type: String | |
667 | :Required: Yes | |
668 | :Example: ``default`` | |
c07f9fc5 FG |
669 | |
670 | ``failure-domain-type`` | |
1e59de90 TL |
671 | :Description: The type of CRUSH nodes used for the replicas of the failure domain. |
672 | :Type: String | |
673 | :Required: Yes | |
674 | :Example: ``rack`` | |
c07f9fc5 FG |
675 | |
676 | ``class`` | |
1e59de90 TL |
677 | :Description: The device class on which data is to be placed. |
678 | :Type: String | |
679 | :Required: No | |
680 | :Example: ``ssd`` | |
c07f9fc5 | 681 | |
1e59de90 | 682 | Creating a rule for an erasure-coded pool |
c07f9fc5 FG |
683 | ----------------------------------------- |
684 | ||
1e59de90 TL |
685 | For an erasure-coded pool, similar decisions need to be made: what the failure |
686 | domain is, which node in the hierarchy data will be placed under (usually | |
687 | ``default``), and whether placement is restricted to a specific device class. | |
688 | However, erasure-code pools are created in a different way: there is a need to | |
689 | construct them carefully with reference to the erasure code plugin in use. For | |
690 | this reason, these decisions must be incorporated into the **erasure-code | |
691 | profile**. A CRUSH rule will then be created from the erasure-code profile, | |
692 | either explicitly or automatically when the profile is used to create a pool. | |
c07f9fc5 | 693 | |
1e59de90 | 694 | To list the erasure-code profiles, run the following command: |
39ae355f TL |
695 | |
696 | .. prompt:: bash $ | |
697 | ||
698 | ceph osd erasure-code-profile ls | |
c07f9fc5 | 699 | |
1e59de90 | 700 | To view a specific existing profile, run a command of the following form: |
c07f9fc5 | 701 | |
39ae355f | 702 | .. prompt:: bash $ |
c07f9fc5 | 703 | |
39ae355f | 704 | ceph osd erasure-code-profile get {profile-name} |
c07f9fc5 | 705 | |
1e59de90 TL |
706 | Under normal conditions, profiles should never be modified; instead, a new |
707 | profile should be created and used when creating either a new pool or a new | |
c07f9fc5 FG |
708 | rule for an existing pool. |
709 | ||
1e59de90 TL |
710 | An erasure-code profile consists of a set of key-value pairs. Most of these |
711 | key-value pairs govern the behavior of the erasure code that encodes data in | |
712 | the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH | |
713 | rule that is created. | |
c07f9fc5 | 714 | |
1e59de90 | 715 | The relevant erasure-code profile properties are as follows: |
c07f9fc5 | 716 | |
1e59de90 TL |
717 | * **crush-root**: the name of the CRUSH node under which to place data |
718 | [default: ``default``]. | |
719 | * **crush-failure-domain**: the CRUSH bucket type used in the distribution of | |
720 | erasure-coded shards [default: ``host``]. | |
721 | * **crush-device-class**: the device class on which to place data [default: | |
722 | none, which means that all devices are used]. | |
723 | * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the | |
724 | number of erasure-code shards, affecting the resulting CRUSH rule. | |
c07f9fc5 | 725 | |
1e59de90 TL |
726 | After a profile is defined, you can create a CRUSH rule by running a command |
727 | of the following form: | |
c07f9fc5 | 728 | |
39ae355f TL |
729 | .. prompt:: bash $ |
730 | ||
731 | ceph osd crush rule create-erasure {name} {profile-name} | |
c07f9fc5 | 732 | |
1e59de90 TL |
733 | .. note: When creating a new pool, it is not necessary to create the rule |
734 | explicitly. If only the erasure-code profile is specified and the rule | |
735 | argument is omitted, then Ceph will create the CRUSH rule automatically. | |
736 | ||
c07f9fc5 FG |
737 | |
738 | Deleting rules | |
739 | -------------- | |
740 | ||
1e59de90 TL |
741 | To delete rules that are not in use by pools, run a command of the following |
742 | form: | |
39ae355f TL |
743 | |
744 | .. prompt:: bash $ | |
c07f9fc5 | 745 | |
39ae355f | 746 | ceph osd crush rule rm {rule-name} |
c07f9fc5 | 747 | |
11fdf7f2 TL |
748 | .. _crush-map-tunables: |
749 | ||
7c673cae FG |
750 | Tunables |
751 | ======== | |
752 | ||
1e59de90 TL |
753 | The CRUSH algorithm that is used to calculate the placement of data has been |
754 | improved over time. In order to support changes in behavior, we have provided | |
755 | users with sets of tunables that determine which legacy or optimal version of | |
756 | CRUSH is to be used. | |
7c673cae | 757 | |
1e59de90 TL |
758 | In order to use newer tunables, all Ceph clients and daemons must support the |
759 | new major release of CRUSH. Because of this requirement, we have created | |
7c673cae | 760 | ``profiles`` that are named after the Ceph version in which they were |
1e59de90 TL |
761 | introduced. For example, the ``firefly`` tunables were first supported by the |
762 | Firefly release and do not work with older clients (for example, clients | |
763 | running Dumpling). After a cluster's tunables profile is changed from a legacy | |
764 | set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options | |
765 | will prevent older clients that do not support the new CRUSH features from | |
766 | connecting to the cluster. | |
7c673cae FG |
767 | |
768 | argonaut (legacy) | |
769 | ----------------- | |
770 | ||
1e59de90 TL |
771 | The legacy CRUSH behavior used by Argonaut and older releases works fine for |
772 | most clusters, provided that not many OSDs have been marked ``out``. | |
7c673cae FG |
773 | |
774 | bobtail (CRUSH_TUNABLES2) | |
775 | ------------------------- | |
776 | ||
1e59de90 | 777 | The ``bobtail`` tunable profile provides the following improvements: |
7c673cae | 778 | |
1e59de90 TL |
779 | * For hierarchies with a small number of devices in leaf buckets, some PGs |
780 | might map to fewer than the desired number of replicas, resulting in | |
781 | ``undersized`` PGs. This is known to happen in the case of hierarchies with | |
782 | ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each | |
783 | host. | |
7c673cae | 784 | |
1e59de90 TL |
785 | * For large clusters, a small percentage of PGs might map to fewer than the |
786 | desired number of OSDs. This is known to happen when there are multiple | |
787 | hierarchy layers in use (for example,, ``row``, ``rack``, ``host``, | |
788 | ``osd``). | |
7c673cae | 789 | |
1e59de90 | 790 | * When one or more OSDs are marked ``out``, data tends to be redistributed |
7c673cae FG |
791 | to nearby OSDs instead of across the entire hierarchy. |
792 | ||
1e59de90 | 793 | The tunables introduced in the Bobtail release are as follows: |
7c673cae | 794 | |
1e59de90 TL |
795 | * ``choose_local_tries``: Number of local retries. The legacy value is ``2``, |
796 | and the optimal value is ``0``. | |
7c673cae | 797 | |
1e59de90 TL |
798 | * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal |
799 | value is 0. | |
7c673cae | 800 | |
1e59de90 TL |
801 | * ``choose_total_tries``: Total number of attempts to choose an item. The |
802 | legacy value is ``19``, but subsequent testing indicates that a value of | |
803 | ``50`` is more appropriate for typical clusters. For extremely large | |
804 | clusters, an even larger value might be necessary. | |
7c673cae | 805 | |
1e59de90 TL |
806 | * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will |
807 | retry, or try only once and allow the original placement to retry. The | |
808 | legacy default is ``0``, and the optimal value is ``1``. | |
7c673cae FG |
809 | |
810 | Migration impact: | |
811 | ||
1e59de90 TL |
812 | * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a |
813 | moderate amount of data movement. Use caution on a cluster that is already | |
7c673cae FG |
814 | populated with data. |
815 | ||
816 | firefly (CRUSH_TUNABLES3) | |
817 | ------------------------- | |
818 | ||
1e59de90 TL |
819 | chooseleaf_vary_r |
820 | ~~~~~~~~~~~~~~~~~ | |
7c673cae | 821 | |
1e59de90 TL |
822 | This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step |
823 | behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs. | |
7c673cae | 824 | |
1e59de90 TL |
825 | This profile was introduced in the Firefly release, and adds a new tunable as follows: |
826 | ||
827 | * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start | |
828 | with a non-zero value of ``r``, as determined by the number of attempts the | |
829 | parent has already made. The legacy default value is ``0``, but with this | |
830 | value CRUSH is sometimes unable to find a mapping. The optimal value (in | |
f67539c2 | 831 | terms of computational cost and correctness) is ``1``. |
7c673cae | 832 | |
11fdf7f2 | 833 | Migration impact: |
7c673cae | 834 | |
1e59de90 TL |
835 | * For existing clusters that store a great deal of data, changing this tunable |
836 | from ``0`` to ``1`` will trigger a large amount of data migration; a value | |
837 | of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will | |
838 | cause less data to move. | |
7c673cae | 839 | |
1e59de90 TL |
840 | straw_calc_version tunable |
841 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
7c673cae | 842 | |
1e59de90 TL |
843 | There were problems with the internal weights calculated and stored in the |
844 | CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH | |
845 | weight of ``0`` or with a mix of different and unique weights, CRUSH would | |
846 | distribute data incorrectly (that is, not in proportion to the weights). | |
7c673cae | 847 | |
1e59de90 | 848 | This tunable, introduced in the Firefly release, is as follows: |
7c673cae | 849 | |
f67539c2 | 850 | * ``straw_calc_version``: A value of ``0`` preserves the old, broken |
1e59de90 | 851 | internal-weight calculation; a value of ``1`` fixes the problem. |
7c673cae FG |
852 | |
853 | Migration impact: | |
854 | ||
1e59de90 TL |
855 | * Changing this tunable to a value of ``1`` and then adjusting a straw bucket |
856 | (either by adding, removing, or reweighting an item or by using the | |
857 | reweight-all command) can trigger a small to moderate amount of data | |
858 | movement provided that the cluster has hit one of the problematic | |
7c673cae FG |
859 | conditions. |
860 | ||
1e59de90 TL |
861 | This tunable option is notable in that it has absolutely no impact on the |
862 | required kernel version in the client side. | |
7c673cae FG |
863 | |
864 | hammer (CRUSH_V4) | |
865 | ----------------- | |
866 | ||
1e59de90 TL |
867 | The ``hammer`` tunable profile does not affect the mapping of existing CRUSH |
868 | maps simply by changing the profile. However: | |
7c673cae | 869 | |
1e59de90 TL |
870 | * There is a new bucket algorithm supported: ``straw2``. This new algorithm |
871 | fixes several limitations in the original ``straw``. More specifically, the | |
872 | old ``straw`` buckets would change some mappings that should not have | |
873 | changed when a weight was adjusted, while ``straw2`` achieves the original | |
874 | goal of changing mappings only to or from the bucket item whose weight has | |
7c673cae FG |
875 | changed. |
876 | ||
1e59de90 | 877 | * The ``straw2`` type is the default type for any newly created buckets. |
7c673cae FG |
878 | |
879 | Migration impact: | |
880 | ||
1e59de90 TL |
881 | * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small |
882 | amount of data movement, depending on how much the bucket items' weights | |
883 | vary from each other. When the weights are all the same no data will move, | |
884 | and the more variance there is in the weights the more movement there will | |
885 | be. | |
7c673cae FG |
886 | |
887 | jewel (CRUSH_TUNABLES5) | |
888 | ----------------------- | |
889 | ||
1e59de90 TL |
890 | The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a |
891 | result, significantly fewer mappings change when an OSD is marked ``out`` of | |
892 | the cluster. This improvement results in significantly less data movement. | |
7c673cae | 893 | |
1e59de90 | 894 | The new tunable introduced in the Jewel release is as follows: |
7c673cae | 895 | |
1e59de90 TL |
896 | * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt |
897 | will use a better value for an inner loop that greatly reduces the number of | |
898 | mapping changes when an OSD is marked ``out``. The legacy value is ``0``, | |
899 | and the new value of ``1`` uses the new approach. | |
7c673cae FG |
900 | |
901 | Migration impact: | |
902 | ||
1e59de90 TL |
903 | * Changing this value on an existing cluster will result in a very large |
904 | amount of data movement because nearly every PG mapping is likely to change. | |
7c673cae | 905 | |
1e59de90 | 906 | Client versions that support CRUSH_TUNABLES2 |
7c673cae FG |
907 | -------------------------------------------- |
908 | ||
1e59de90 TL |
909 | * v0.55 and later, including Bobtail (v0.56.x) |
910 | * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 911 | |
1e59de90 TL |
912 | Client versions that support CRUSH_TUNABLES3 |
913 | -------------------------------------------- | |
7c673cae | 914 | |
1e59de90 TL |
915 | * v0.78 (Firefly) and later |
916 | * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 917 | |
1e59de90 TL |
918 | Client versions that support CRUSH_V4 |
919 | ------------------------------------- | |
7c673cae | 920 | |
1e59de90 TL |
921 | * v0.94 (Hammer) and later |
922 | * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 923 | |
1e59de90 TL |
924 | Client versions that support CRUSH_TUNABLES5 |
925 | -------------------------------------------- | |
7c673cae | 926 | |
1e59de90 TL |
927 | * v10.0.2 (Jewel) and later |
928 | * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients) | |
7c673cae | 929 | |
1e59de90 TL |
930 | "Non-optimal tunables" warning |
931 | ------------------------------ | |
7c673cae | 932 | |
1e59de90 TL |
933 | In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush |
934 | map has non-optimal tunables") if any of the current CRUSH tunables have | |
935 | non-optimal values: that is, if any fail to have the optimal values from the | |
936 | :ref:` ``default`` profile | |
937 | <rados_operations_crush_map_default_profile_definition>`. There are two | |
938 | different ways to silence the alert: | |
7c673cae | 939 | |
1e59de90 TL |
940 | 1. Adjust the CRUSH tunables on the existing cluster so as to render them |
941 | optimal. Making this adjustment will trigger some data movement | |
942 | (possibly as much as 10%). This approach is generally preferred to the | |
943 | other approach, but special care must be taken in situations where | |
944 | data movement might affect performance: for example, in production clusters. | |
945 | To enable optimal tunables, run the following command: | |
39ae355f TL |
946 | |
947 | .. prompt:: bash $ | |
7c673cae FG |
948 | |
949 | ceph osd crush tunables optimal | |
950 | ||
1e59de90 TL |
951 | There are several potential problems that might make it preferable to revert |
952 | to the previous values of the tunables. The new values might generate too | |
953 | much load for the cluster to handle, the new values might unacceptably slow | |
954 | the operation of the cluster, or there might be a client-compatibility | |
955 | problem. Such client-compatibility problems can arise when using old-kernel | |
956 | CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to | |
957 | the previous values of the tunables, run the following command: | |
39ae355f TL |
958 | |
959 | .. prompt:: bash $ | |
7c673cae FG |
960 | |
961 | ceph osd crush tunables legacy | |
962 | ||
1e59de90 TL |
963 | 2. To silence the alert without making any changes to CRUSH, |
964 | add the following option to the ``[mon]`` section of your ceph.conf file:: | |
7c673cae | 965 | |
1e59de90 | 966 | mon_warn_on_legacy_crush_tunables = false |
7c673cae | 967 | |
1e59de90 TL |
968 | In order for this change to take effect, you will need to either restart |
969 | the monitors or run the following command to apply the option to the | |
970 | monitors while they are still running: | |
39ae355f TL |
971 | |
972 | .. prompt:: bash $ | |
7c673cae | 973 | |
11fdf7f2 | 974 | ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false |
7c673cae FG |
975 | |
976 | ||
7c673cae FG |
977 | Tuning CRUSH |
978 | ------------ | |
979 | ||
1e59de90 TL |
980 | When making adjustments to CRUSH tunables, keep the following considerations in |
981 | mind: | |
982 | ||
983 | * Adjusting the values of CRUSH tunables will result in the shift of one or | |
984 | more PGs from one storage node to another. If the Ceph cluster is already | |
985 | storing a great deal of data, be prepared for significant data movement. | |
986 | * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they | |
987 | immediately begin rejecting new connections from clients that do not support | |
988 | the new feature. However, already-connected clients are effectively | |
989 | grandfathered in, and any of these clients that do not support the new | |
990 | feature will malfunction. | |
991 | * If the CRUSH tunables are set to newer (non-legacy) values and subsequently | |
992 | reverted to the legacy values, ``ceph-osd`` daemons will not be required to | |
993 | support any of the newer CRUSH features associated with the newer | |
994 | (non-legacy) values. However, the OSD peering process requires the | |
995 | examination and understanding of old maps. For this reason, **if the cluster | |
996 | has previously used non-legacy CRUSH values, do not run old versions of | |
997 | the** ``ceph-osd`` **daemon** -- even if the latest version of the map has | |
998 | been reverted so as to use the legacy defaults. | |
999 | ||
1000 | The simplest way to adjust CRUSH tunables is to apply them in matched sets | |
1001 | known as *profiles*. As of the Octopus release, Ceph supports the following | |
1002 | profiles: | |
1003 | ||
1004 | * ``legacy``: The legacy behavior from argonaut and earlier. | |
1005 | * ``argonaut``: The legacy values supported by the argonaut release. | |
1006 | * ``bobtail``: The values supported by the bobtail release. | |
1007 | * ``firefly``: The values supported by the firefly release. | |
1008 | * ``hammer``: The values supported by the hammer release. | |
1009 | * ``jewel``: The values supported by the jewel release. | |
1010 | * ``optimal``: The best values for the current version of Ceph. | |
1011 | .. _rados_operations_crush_map_default_profile_definition: | |
1012 | * ``default``: The default values of a new cluster that has been installed | |
1013 | from scratch. These values, which depend on the current version of Ceph, are | |
1014 | hardcoded and are typically a mix of optimal and legacy values. These | |
1015 | values often correspond to the ``optimal`` profile of either the previous | |
1016 | LTS (long-term service) release or the most recent release for which most | |
1017 | users are expected to have up-to-date clients. | |
1018 | ||
1019 | To apply a profile to a running cluster, run a command of the following form: | |
7c673cae | 1020 | |
39ae355f TL |
1021 | .. prompt:: bash $ |
1022 | ||
1023 | ceph osd crush tunables {PROFILE} | |
7c673cae | 1024 | |
1e59de90 TL |
1025 | This action might trigger a great deal of data movement. Consult release notes |
1026 | and documentation before changing the profile on a running cluster. Consider | |
1027 | throttling recovery and backfill parameters in order to limit the backfill | |
1028 | resulting from a specific change. | |
7c673cae | 1029 | |
39ae355f | 1030 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |
7c673cae | 1031 | |
7c673cae | 1032 | |
1e59de90 TL |
1033 | Tuning Primary OSD Selection |
1034 | ============================ | |
7c673cae | 1035 | |
1e59de90 | 1036 | When a Ceph client reads or writes data, it first contacts the primary OSD in |
f67539c2 | 1037 | each affected PG's acting set. By default, the first OSD in the acting set is |
1e59de90 TL |
1038 | the primary OSD (also known as the "lead OSD"). For example, in the acting set |
1039 | ``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD. | |
1040 | However, sometimes it is clear that an OSD is not well suited to act as the | |
1041 | lead as compared with other OSDs (for example, if the OSD has a slow drive or a | |
1042 | slow controller). To prevent performance bottlenecks (especially on read | |
1043 | operations) and at the same time maximize the utilization of your hardware, you | |
1044 | can influence the selection of the primary OSD either by adjusting "primary | |
1045 | affinity" values, or by crafting a CRUSH rule that selects OSDs that are better | |
1046 | suited to act as the lead rather than other OSDs. | |
1047 | ||
1048 | To determine whether tuning Ceph's selection of primary OSDs will improve | |
1049 | cluster performance, pool redundancy strategy must be taken into account. For | |
1050 | replicated pools, this tuning can be especially useful, because by default read | |
1051 | operations are served from the primary OSD of each PG. For erasure-coded pools, | |
1052 | however, the speed of read operations can be increased by enabling **fast | |
1053 | read** (see :ref:`pool-settings`). | |
1054 | ||
aee94f69 TL |
1055 | .. _rados_ops_primary_affinity: |
1056 | ||
1e59de90 TL |
1057 | Primary Affinity |
1058 | ---------------- | |
1059 | ||
1060 | **Primary affinity** is a characteristic of an OSD that governs the likelihood | |
1061 | that a given OSD will be selected as the primary OSD (or "lead OSD") in a given | |
1062 | acting set. A primary affinity value can be any real number in the range ``0`` | |
1063 | to ``1``, inclusive. | |
1064 | ||
1065 | As an example of a common scenario in which it can be useful to adjust primary | |
1066 | affinity values, let us suppose that a cluster contains a mix of drive sizes: | |
1067 | for example, suppose it contains some older racks with 1.9 TB SATA SSDs and | |
1068 | some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned | |
1069 | twice the number of PGs and will thus serve twice the number of write and read | |
1070 | operations -- they will be busier than the former. In such a scenario, you | |
1071 | might make a rough assignment of primary affinity as inversely proportional to | |
1072 | OSD size. Such an assignment will not be 100% optimal, but it can readily | |
1073 | achieve a 15% improvement in overall read throughput by means of a more even | |
1074 | utilization of SATA interface bandwidth and CPU cycles. This example is not | |
1075 | merely a thought experiment meant to illustrate the theoretical benefits of | |
1076 | adjusting primary affinity values; this fifteen percent improvement was | |
1077 | achieved on an actual Ceph cluster. | |
1078 | ||
1079 | By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster | |
1080 | in which every OSD has this default value, all OSDs are equally likely to act | |
1081 | as a primary OSD. | |
1082 | ||
1083 | By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less | |
1084 | likely to select the OSD as primary in a PG's acting set. To change the weight | |
1085 | value associated with a specific OSD's primary affinity, run a command of the | |
1086 | following form: | |
f67539c2 | 1087 | |
39ae355f | 1088 | .. prompt:: bash $ |
7c673cae | 1089 | |
39ae355f TL |
1090 | ceph osd primary-affinity <osd-id> <weight> |
1091 | ||
1e59de90 TL |
1092 | The primary affinity of an OSD can be set to any real number in the range |
1093 | ``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as | |
1094 | primary and ``1`` indicates that the OSD is maximally likely to be used as a | |
1095 | primary. When the weight is between these extremes, its value indicates roughly | |
1096 | how likely it is that CRUSH will select the OSD associated with it as a | |
1097 | primary. | |
1098 | ||
1099 | The process by which CRUSH selects the lead OSD is not a mere function of a | |
1100 | simple probability determined by relative affinity values. Nevertheless, | |
1101 | measurable results can be achieved even with first-order approximations of | |
1102 | desirable primary affinity values. | |
1103 | ||
f67539c2 TL |
1104 | |
1105 | Custom CRUSH Rules | |
1106 | ------------------ | |
1107 | ||
1e59de90 TL |
1108 | Some clusters balance cost and performance by mixing SSDs and HDDs in the same |
1109 | replicated pool. By setting the primary affinity of HDD OSDs to ``0``, | |
1110 | operations will be directed to an SSD OSD in each acting set. Alternatively, | |
1111 | you can define a CRUSH rule that always selects an SSD OSD as the primary OSD | |
1112 | and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting | |
1113 | set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs. | |
1114 | ||
1115 | For example, see the following CRUSH rule:: | |
1116 | ||
1117 | rule mixed_replicated_rule { | |
1118 | id 11 | |
1119 | type replicated | |
1120 | step take default class ssd | |
1121 | step chooseleaf firstn 1 type host | |
1122 | step emit | |
1123 | step take default class hdd | |
1124 | step chooseleaf firstn 0 type host | |
1125 | step emit | |
1126 | } | |
1127 | ||
1128 | This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool, | |
1129 | this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on | |
1130 | different hosts, because the first SSD OSD might be colocated with any of the | |
1131 | ``N`` HDD OSDs. | |
1132 | ||
1133 | To avoid this extra storage requirement, you might place SSDs and HDDs in | |
1134 | different hosts. However, taking this approach means that all client requests | |
1135 | will be received by hosts with SSDs. For this reason, it might be advisable to | |
1136 | have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the | |
1137 | latter will under normal circumstances perform only recovery operations. Here | |
1138 | the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement | |
1139 | not to contain any of the same servers, as seen in the following CRUSH rule:: | |
f67539c2 TL |
1140 | |
1141 | rule mixed_replicated_rule_two { | |
1142 | id 1 | |
1143 | type replicated | |
f67539c2 TL |
1144 | step take ssd_hosts class ssd |
1145 | step chooseleaf firstn 1 type host | |
1146 | step emit | |
1147 | step take hdd_hosts class hdd | |
1148 | step chooseleaf firstn -1 type host | |
1149 | step emit | |
1150 | } | |
1151 | ||
1e59de90 TL |
1152 | .. note:: If a primary SSD OSD fails, then requests to the associated PG will |
1153 | be temporarily served from a slower HDD OSD until the PG's data has been | |
1154 | replicated onto the replacement primary SSD OSD. | |
f67539c2 | 1155 | |
7c673cae | 1156 |