]>
Commit | Line | Data |
---|---|---|
1e59de90 TL |
1 | Manually editing the CRUSH Map |
2 | ============================== | |
c07f9fc5 | 3 | |
1e59de90 TL |
4 | .. note:: Manually editing the CRUSH map is an advanced administrator |
5 | operation. For the majority of installations, CRUSH changes can be | |
6 | implemented via the Ceph CLI and do not require manual CRUSH map edits. If | |
7 | you have identified a use case where manual edits *are* necessary with a | |
8 | recent Ceph release, consider contacting the Ceph developers at dev@ceph.io | |
9 | so that future versions of Ceph do not have this problem. | |
c07f9fc5 | 10 | |
1e59de90 | 11 | To edit an existing CRUSH map, carry out the following procedure: |
c07f9fc5 FG |
12 | |
13 | #. `Get the CRUSH map`_. | |
14 | #. `Decompile`_ the CRUSH map. | |
1e59de90 TL |
15 | #. Edit at least one of the following sections: `Devices`_, `Buckets`_, and |
16 | `Rules`_. Use a text editor for this task. | |
c07f9fc5 FG |
17 | #. `Recompile`_ the CRUSH map. |
18 | #. `Set the CRUSH map`_. | |
19 | ||
1e59de90 TL |
20 | For details on setting the CRUSH map rule for a specific pool, see `Set Pool |
21 | Values`_. | |
c07f9fc5 FG |
22 | |
23 | .. _Get the CRUSH map: #getcrushmap | |
24 | .. _Decompile: #decompilecrushmap | |
25 | .. _Devices: #crushmapdevices | |
26 | .. _Buckets: #crushmapbuckets | |
27 | .. _Rules: #crushmaprules | |
28 | .. _Recompile: #compilecrushmap | |
29 | .. _Set the CRUSH map: #setcrushmap | |
30 | .. _Set Pool Values: ../pools#setpoolvalues | |
31 | ||
32 | .. _getcrushmap: | |
33 | ||
1e59de90 TL |
34 | Get the CRUSH Map |
35 | ----------------- | |
c07f9fc5 | 36 | |
1e59de90 | 37 | To get the CRUSH map for your cluster, run a command of the following form: |
39ae355f TL |
38 | |
39 | .. prompt:: bash $ | |
c07f9fc5 | 40 | |
1e59de90 | 41 | ceph osd getcrushmap -o {compiled-crushmap-filename} |
c07f9fc5 | 42 | |
1e59de90 TL |
43 | Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have |
44 | specified. Because the CRUSH map is in a compiled form, you must first | |
45 | decompile it before you can edit it. | |
c07f9fc5 FG |
46 | |
47 | .. _decompilecrushmap: | |
48 | ||
1e59de90 TL |
49 | Decompile the CRUSH Map |
50 | ----------------------- | |
c07f9fc5 | 51 | |
1e59de90 | 52 | To decompile the CRUSH map, run a command of the following form: |
39ae355f TL |
53 | |
54 | .. prompt:: bash $ | |
c07f9fc5 | 55 | |
1e59de90 | 56 | crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} |
c07f9fc5 | 57 | |
9f95a23c TL |
58 | .. _compilecrushmap: |
59 | ||
1e59de90 TL |
60 | Recompile the CRUSH Map |
61 | ----------------------- | |
9f95a23c | 62 | |
1e59de90 | 63 | To compile the CRUSH map, run a command of the following form: |
39ae355f TL |
64 | |
65 | .. prompt:: bash $ | |
9f95a23c | 66 | |
1e59de90 | 67 | crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} |
9f95a23c TL |
68 | |
69 | .. _setcrushmap: | |
70 | ||
71 | Set the CRUSH Map | |
72 | ----------------- | |
73 | ||
1e59de90 | 74 | To set the CRUSH map for your cluster, run a command of the following form: |
39ae355f TL |
75 | |
76 | .. prompt:: bash $ | |
9f95a23c | 77 | |
1e59de90 | 78 | ceph osd setcrushmap -i {compiled-crushmap-filename} |
9f95a23c | 79 | |
1e59de90 TL |
80 | Ceph loads (``-i``) a compiled CRUSH map from the filename that you have |
81 | specified. | |
c07f9fc5 FG |
82 | |
83 | Sections | |
84 | -------- | |
85 | ||
1e59de90 | 86 | A CRUSH map has six main sections: |
c07f9fc5 | 87 | |
f67539c2 | 88 | #. **tunables:** The preamble at the top of the map describes any *tunables* |
1e59de90 TL |
89 | that are not a part of legacy CRUSH behavior. These tunables correct for old |
90 | bugs, optimizations, or other changes that have been made over the years to | |
91 | improve CRUSH's behavior. | |
c07f9fc5 | 92 | |
f67539c2 | 93 | #. **devices:** Devices are individual OSDs that store data. |
c07f9fc5 | 94 | |
1e59de90 TL |
95 | #. **types**: Bucket ``types`` define the types of buckets that are used in |
96 | your CRUSH hierarchy. | |
c07f9fc5 | 97 | |
1e59de90 TL |
98 | #. **buckets:** Buckets consist of a hierarchical aggregation of storage |
99 | locations (for example, rows, racks, chassis, hosts) and their assigned | |
100 | weights. After the bucket ``types`` have been defined, the CRUSH map defines | |
101 | each node in the hierarchy, its type, and which devices or other nodes it | |
11fdf7f2 | 102 | contains. |
c07f9fc5 FG |
103 | |
104 | #. **rules:** Rules define policy about how data is distributed across | |
105 | devices in the hierarchy. | |
106 | ||
1e59de90 TL |
107 | #. **choose_args:** ``choose_args`` are alternative weights associated with |
108 | the hierarchy that have been adjusted in order to optimize data placement. A | |
109 | single ``choose_args`` map can be used for the entire cluster, or a number | |
110 | of ``choose_args`` maps can be created such that each map is crafted for a | |
111 | particular pool. | |
c07f9fc5 FG |
112 | |
113 | ||
114 | .. _crushmapdevices: | |
115 | ||
1e59de90 | 116 | CRUSH-Map Devices |
c07f9fc5 FG |
117 | ----------------- |
118 | ||
1e59de90 TL |
119 | Devices are individual OSDs that store data. In this section, there is usually |
120 | one device defined for each OSD daemon in your cluster. Devices are identified | |
121 | by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where | |
122 | ``N`` is the device's ``id``). | |
123 | ||
f67539c2 TL |
124 | |
125 | .. _crush-map-device-class: | |
c07f9fc5 | 126 | |
1e59de90 TL |
127 | A device can also have a *device class* associated with it: for example, |
128 | ``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted | |
129 | by CRUSH rules. This means that device classes allow CRUSH rules to select only | |
130 | OSDs that match certain characteristics. For example, you might want an RBD | |
131 | pool associated only with SSDs and a different RBD pool associated only with | |
132 | HDDs. | |
133 | ||
134 | To see a list of devices, run the following command: | |
c07f9fc5 | 135 | |
39ae355f TL |
136 | .. prompt:: bash # |
137 | ||
1e59de90 TL |
138 | ceph device ls |
139 | ||
140 | The output of this command takes the following form: | |
39ae355f | 141 | |
c07f9fc5 FG |
142 | :: |
143 | ||
1e59de90 | 144 | device {num} {osd.name} [class {class}] |
c07f9fc5 | 145 | |
39ae355f TL |
146 | For example: |
147 | ||
148 | .. prompt:: bash # | |
149 | ||
1e59de90 | 150 | ceph device ls |
39ae355f TL |
151 | |
152 | :: | |
c07f9fc5 | 153 | |
1e59de90 TL |
154 | device 0 osd.0 class ssd |
155 | device 1 osd.1 class hdd | |
156 | device 2 osd.2 | |
157 | device 3 osd.3 | |
158 | ||
159 | In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This | |
160 | daemon might map to a single storage device, a pair of devices (for example, | |
161 | one for data and one for a journal or metadata), or in some cases a small RAID | |
162 | device or a partition of a larger storage device. | |
c07f9fc5 | 163 | |
c07f9fc5 | 164 | |
1e59de90 | 165 | CRUSH-Map Bucket Types |
c07f9fc5 FG |
166 | ---------------------- |
167 | ||
1e59de90 TL |
168 | The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a |
169 | hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets) | |
170 | typically represent physical locations in a hierarchy. Nodes aggregate other | |
171 | nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their | |
172 | corresponding storage media. | |
c07f9fc5 | 173 | |
1e59de90 TL |
174 | .. tip:: In the context of CRUSH, the term "bucket" is used to refer to |
175 | a node in the hierarchy (that is, to a location or a piece of physical | |
176 | hardware). In the context of RADOS Gateway APIs, however, the term | |
177 | "bucket" has a different meaning. | |
c07f9fc5 | 178 | |
1e59de90 | 179 | To add a bucket type to the CRUSH map, create a new line under the list of |
c07f9fc5 | 180 | bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. |
1e59de90 TL |
181 | By convention, there is exactly one leaf bucket type and it is ``type 0``; |
182 | however, you may give the leaf bucket any name you like (for example: ``osd``, | |
183 | ``disk``, ``drive``, ``storage``):: | |
c07f9fc5 | 184 | |
1e59de90 TL |
185 | # types |
186 | type {num} {bucket-name} | |
c07f9fc5 FG |
187 | |
188 | For example:: | |
189 | ||
1e59de90 TL |
190 | # types |
191 | type 0 osd | |
192 | type 1 host | |
193 | type 2 chassis | |
194 | type 3 rack | |
195 | type 4 row | |
196 | type 5 pdu | |
197 | type 6 pod | |
198 | type 7 room | |
199 | type 8 datacenter | |
200 | type 9 zone | |
201 | type 10 region | |
202 | type 11 root | |
c07f9fc5 FG |
203 | |
204 | .. _crushmapbuckets: | |
205 | ||
1e59de90 | 206 | CRUSH-Map Bucket Hierarchy |
c07f9fc5 FG |
207 | -------------------------- |
208 | ||
1e59de90 TL |
209 | The CRUSH algorithm distributes data objects among storage devices according to |
210 | a per-device weight value, approximating a uniform probability distribution. | |
c07f9fc5 | 211 | CRUSH distributes objects and their replicas according to the hierarchical |
1e59de90 TL |
212 | cluster map you define. The CRUSH map represents the available storage devices |
213 | and the logical elements that contain them. | |
214 | ||
215 | To map placement groups (PGs) to OSDs across failure domains, a CRUSH map | |
216 | defines a hierarchical list of bucket types under ``#types`` in the generated | |
217 | CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf | |
218 | nodes according to their failure domains (for example: hosts, chassis, racks, | |
219 | power distribution units, pods, rows, rooms, and data centers). With the | |
220 | exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and | |
c07f9fc5 FG |
221 | you may define it according to your own needs. |
222 | ||
1e59de90 TL |
223 | We recommend adapting your CRUSH map to your preferred hardware-naming |
224 | conventions and using bucket names that clearly reflect the physical | |
225 | hardware. Clear naming practice can make it easier to administer the cluster | |
226 | and easier to troubleshoot problems when OSDs malfunction (or other hardware | |
227 | malfunctions) and the administrator needs access to physical hardware. | |
c07f9fc5 | 228 | |
1e59de90 TL |
229 | |
230 | In the following example, the bucket hierarchy has a leaf bucket named ``osd`` | |
231 | and two node buckets named ``host`` and ``rack``: | |
c07f9fc5 FG |
232 | |
233 | .. ditaa:: | |
234 | +-----------+ | |
235 | | {o}rack | | |
236 | | Bucket | | |
237 | +-----+-----+ | |
238 | | | |
239 | +---------------+---------------+ | |
240 | | | | |
241 | +-----+-----+ +-----+-----+ | |
242 | | {o}host | | {o}host | | |
243 | | Bucket | | Bucket | | |
244 | +-----+-----+ +-----+-----+ | |
245 | | | | |
246 | +-------+-------+ +-------+-------+ | |
247 | | | | | | |
248 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
249 | | osd | | osd | | osd | | osd | | |
250 | | Bucket | | Bucket | | Bucket | | Bucket | | |
251 | +-----------+ +-----------+ +-----------+ +-----------+ | |
252 | ||
1e59de90 TL |
253 | .. note:: The higher-numbered ``rack`` bucket type aggregates the |
254 | lower-numbered ``host`` bucket type. | |
255 | ||
256 | Because leaf nodes reflect storage devices that have already been declared | |
257 | under the ``#devices`` list at the beginning of the CRUSH map, there is no need | |
258 | to declare them as bucket instances. The second-lowest bucket type in your | |
259 | hierarchy is typically used to aggregate the devices (that is, the | |
260 | second-lowest bucket type is usually the computer that contains the storage | |
261 | media and, such as ``node``, ``computer``, ``server``, ``host``, or | |
262 | ``machine``). In high-density environments, it is common to have multiple hosts | |
263 | or nodes in a single chassis (for example, in the cases of blades or twins). It | |
264 | is important to anticipate the potential consequences of chassis failure -- for | |
265 | example, during the replacement of a chassis in case of a node failure, the | |
266 | chassis's hosts or nodes (and their associated OSDs) will be in a ``down`` | |
267 | state. | |
268 | ||
269 | To declare a bucket instance, do the following: specify its type, give it a | |
270 | unique name (an alphanumeric string), assign it a unique ID expressed as a | |
271 | negative integer (this is optional), assign it a weight relative to the total | |
272 | capacity and capability of the item(s) in the bucket, assign it a bucket | |
273 | algorithm (usually ``straw2``), and specify the bucket algorithm's hash | |
274 | (usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A | |
275 | bucket may have one or more items. The items may consist of node buckets or | |
276 | leaves. Items may have a weight that reflects the relative weight of the item. | |
277 | ||
278 | To declare a node bucket, use the following syntax:: | |
279 | ||
280 | [bucket-type] [bucket-name] { | |
281 | id [a unique negative numeric ID] | |
282 | weight [the relative capacity/capability of the item(s)] | |
283 | alg [the bucket type: uniform | list | tree | straw | straw2 ] | |
284 | hash [the hash type: 0 by default] | |
285 | item [item-name] weight [weight] | |
286 | } | |
287 | ||
288 | For example, in the above diagram, two host buckets (referred to in the | |
289 | declaration below as ``node1`` and ``node2``) and one rack bucket (referred to | |
290 | in the declaration below as ``rack1``) are defined. The OSDs are declared as | |
291 | items within the host buckets:: | |
292 | ||
293 | host node1 { | |
294 | id -1 | |
295 | alg straw2 | |
296 | hash 0 | |
297 | item osd.0 weight 1.00 | |
298 | item osd.1 weight 1.00 | |
299 | } | |
300 | ||
301 | host node2 { | |
302 | id -2 | |
303 | alg straw2 | |
304 | hash 0 | |
305 | item osd.2 weight 1.00 | |
306 | item osd.3 weight 1.00 | |
307 | } | |
308 | ||
309 | rack rack1 { | |
310 | id -3 | |
311 | alg straw2 | |
312 | hash 0 | |
313 | item node1 weight 2.00 | |
314 | item node2 weight 2.00 | |
315 | } | |
316 | ||
317 | .. note:: In this example, the rack bucket does not contain any OSDs. Instead, | |
318 | it contains lower-level host buckets and includes the sum of their weight in | |
319 | the item entry. | |
c07f9fc5 | 320 | |
c07f9fc5 FG |
321 | |
322 | .. topic:: Bucket Types | |
323 | ||
1e59de90 TL |
324 | Ceph supports five bucket types. Each bucket type provides a balance between |
325 | performance and reorganization efficiency, and each is different from the | |
326 | others. If you are unsure of which bucket type to use, use the ``straw2`` | |
327 | bucket. For a more technical discussion of bucket types than is offered | |
328 | here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized | |
329 | Placement of Replicated Data`_. | |
330 | ||
331 | The bucket types are as follows: | |
332 | ||
333 | #. **uniform**: Uniform buckets aggregate devices that have **exactly** | |
334 | the same weight. For example, when hardware is commissioned or | |
335 | decommissioned, it is often done in sets of machines that have exactly | |
336 | the same physical configuration (this can be the case, for example, | |
337 | after bulk purchases). When storage devices have exactly the same | |
338 | weight, you may use the ``uniform`` bucket type, which allows CRUSH to | |
339 | map replicas into uniform buckets in constant time. If your devices have | |
340 | non-uniform weights, you should not use the uniform bucket algorithm. | |
341 | ||
342 | #. **list**: List buckets aggregate their content as linked lists. The | |
343 | behavior of list buckets is governed by the :abbr:`RUSH (Replication | |
344 | Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this | |
345 | bucket type, an object is either relocated to the newest device in | |
346 | accordance with an appropriate probability, or it remains on the older | |
347 | devices as before. This results in optimal data migration when items are | |
348 | added to the bucket. The removal of items from the middle or the tail of | |
349 | the list, however, can result in a significant amount of unnecessary | |
350 | data movement. This means that list buckets are most suitable for | |
351 | circumstances in which they **never shrink or very rarely shrink**. | |
352 | ||
353 | #. **tree**: Tree buckets use a binary search tree. They are more efficient | |
354 | at dealing with buckets that contain many items than are list buckets. | |
355 | The behavior of tree buckets is governed by the :abbr:`RUSH (Replication | |
356 | Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the | |
357 | placement time to 0(log\ :sub:`n`). This means that tree buckets are | |
358 | suitable for managing large sets of devices or nested buckets. | |
359 | ||
360 | #. **straw**: Straw buckets allow all items in the bucket to "compete" | |
361 | against each other for replica placement through a process analogous to | |
362 | drawing straws. This is different from the behavior of list buckets and | |
363 | tree buckets, which use a divide-and-conquer strategy that either gives | |
364 | certain items precedence (for example, those at the beginning of a list) | |
365 | or obviates the need to consider entire subtrees of items. Such an | |
366 | approach improves the performance of the replica placement process, but | |
367 | can also introduce suboptimal reorganization behavior when the contents | |
368 | of a bucket change due an addition, a removal, or the re-weighting of an | |
369 | item. | |
370 | ||
371 | * **straw2**: Straw2 buckets improve on Straw by correctly avoiding | |
372 | any data movement between items when neighbor weights change. For | |
373 | example, if the weight of a given item changes (including during the | |
374 | operations of adding it to the cluster or removing it from the | |
375 | cluster), there will be data movement to or from only that item. | |
376 | Neighbor weights are not taken into account. | |
377 | ||
11fdf7f2 | 378 | |
c07f9fc5 FG |
379 | .. topic:: Hash |
380 | ||
1e59de90 TL |
381 | Each bucket uses a hash algorithm. As of Reef, Ceph supports the |
382 | ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm, | |
383 | enter ``0`` as your hash setting. | |
c07f9fc5 FG |
384 | |
385 | ||
386 | .. _weightingbucketitems: | |
387 | ||
388 | .. topic:: Weighting Bucket Items | |
389 | ||
390 | Ceph expresses bucket weights as doubles, which allows for fine | |
391 | weighting. A weight is the relative difference between device capacities. We | |
392 | recommend using ``1.00`` as the relative weight for a 1TB storage device. | |
393 | In such a scenario, a weight of ``0.5`` would represent approximately 500GB, | |
394 | and a weight of ``3.00`` would represent approximately 3TB. Higher level | |
395 | buckets have a weight that is the sum total of the leaf items aggregated by | |
396 | the bucket. | |
397 | ||
398 | A bucket item weight is one dimensional, but you may also calculate your | |
399 | item weights to reflect the performance of the storage drive. For example, | |
400 | if you have many 1TB drives where some have relatively low data transfer | |
401 | rate and the others have a relatively high data transfer rate, you may | |
402 | weight them differently, even though they have the same capacity (e.g., | |
403 | a weight of 0.80 for the first set of drives with lower total throughput, | |
404 | and 1.20 for the second set of drives with higher total throughput). | |
405 | ||
406 | ||
407 | .. _crushmaprules: | |
408 | ||
409 | CRUSH Map Rules | |
410 | --------------- | |
411 | ||
412 | CRUSH maps support the notion of 'CRUSH rules', which are the rules that | |
b32b8144 FG |
413 | determine data placement for a pool. The default CRUSH map has a rule for each |
414 | pool. For large clusters, you will likely create many pools where each pool may | |
415 | have its own non-default CRUSH rule. | |
c07f9fc5 | 416 | |
b32b8144 FG |
417 | .. note:: In most cases, you will not need to modify the default rule. When |
418 | you create a new pool, by default the rule will be set to ``0``. | |
c07f9fc5 FG |
419 | |
420 | ||
421 | CRUSH rules define placement and replication strategies or distribution policies | |
422 | that allow you to specify exactly how CRUSH places object replicas. For | |
423 | example, you might create a rule selecting a pair of targets for 2-way | |
424 | mirroring, another rule for selecting three targets in two different data | |
425 | centers for 3-way mirroring, and yet another rule for erasure coding over six | |
426 | storage devices. For a detailed discussion of CRUSH rules, refer to | |
427 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, | |
428 | and more specifically to **Section 3.2**. | |
429 | ||
430 | A rule takes the following form:: | |
431 | ||
432 | rule <rulename> { | |
433 | ||
11fdf7f2 | 434 | id [a unique whole numeric ID] |
c07f9fc5 | 435 | type [ replicated | erasure ] |
c07f9fc5 | 436 | step take <bucket-name> [class <device-class>] |
11fdf7f2 | 437 | step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> |
c07f9fc5 FG |
438 | step emit |
439 | } | |
440 | ||
441 | ||
11fdf7f2 | 442 | ``id`` |
c07f9fc5 | 443 | |
11fdf7f2 | 444 | :Description: A unique whole number for identifying the rule. |
c07f9fc5 FG |
445 | |
446 | :Purpose: A component of the rule mask. | |
447 | :Type: Integer | |
448 | :Required: Yes | |
449 | :Default: 0 | |
450 | ||
c07f9fc5 FG |
451 | |
452 | ``type`` | |
453 | ||
454 | :Description: Describes a rule for either a storage drive (replicated) | |
455 | or a RAID. | |
456 | ||
457 | :Purpose: A component of the rule mask. | |
458 | :Type: String | |
459 | :Required: Yes | |
460 | :Default: ``replicated`` | |
461 | :Valid Values: Currently only ``replicated`` and ``erasure`` | |
462 | ||
c07f9fc5 FG |
463 | |
464 | ``step take <bucket-name> [class <device-class>]`` | |
465 | ||
466 | :Description: Takes a bucket name, and begins iterating down the tree. | |
467 | If the ``device-class`` is specified, it must match | |
468 | a class previously used when defining a device. All | |
469 | devices that do not belong to the class are excluded. | |
470 | :Purpose: A component of the rule. | |
471 | :Required: Yes | |
472 | :Example: ``step take data`` | |
473 | ||
474 | ||
475 | ``step choose firstn {num} type {bucket-type}`` | |
476 | ||
11fdf7f2 TL |
477 | :Description: Selects the number of buckets of the given type from within the |
478 | current bucket. The number is usually the number of replicas in | |
479 | the pool (i.e., pool size). | |
c07f9fc5 FG |
480 | |
481 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
482 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
483 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
484 | ||
485 | :Purpose: A component of the rule. | |
486 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
487 | :Example: ``step choose firstn 1 type row`` | |
488 | ||
489 | ||
490 | ``step chooseleaf firstn {num} type {bucket-type}`` | |
491 | ||
492 | :Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf | |
11fdf7f2 TL |
493 | node (that is, an OSD) from the subtree of each bucket in the set of buckets. |
494 | The number of buckets in the set is usually the number of replicas in | |
c07f9fc5 FG |
495 | the pool (i.e., pool size). |
496 | ||
497 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
498 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
499 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
500 | ||
501 | :Purpose: A component of the rule. Usage removes the need to select a device using two steps. | |
502 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
503 | :Example: ``step chooseleaf firstn 0 type row`` | |
504 | ||
505 | ||
c07f9fc5 FG |
506 | ``step emit`` |
507 | ||
508 | :Description: Outputs the current value and empties the stack. Typically used | |
509 | at the end of a rule, but may also be used to pick from different | |
510 | trees in the same rule. | |
511 | ||
512 | :Purpose: A component of the rule. | |
513 | :Prerequisite: Follows ``step choose``. | |
514 | :Example: ``step emit`` | |
515 | ||
b32b8144 FG |
516 | .. important:: A given CRUSH rule may be assigned to multiple pools, but it |
517 | is not possible for a single pool to have multiple CRUSH rules. | |
c07f9fc5 | 518 | |
11fdf7f2 TL |
519 | ``firstn`` versus ``indep`` |
520 | ||
521 | :Description: Controls the replacement strategy CRUSH uses when items (OSDs) | |
522 | are marked down in the CRUSH map. If this rule is to be used with | |
523 | replicated pools it should be ``firstn`` and if it's for | |
524 | erasure-coded pools it should be ``indep``. | |
525 | ||
526 | The reason has to do with how they behave when a | |
527 | previously-selected device fails. Let's say you have a PG stored | |
528 | on OSDs 1, 2, 3, 4, 5. Then 3 goes down. | |
529 | ||
530 | With the "firstn" mode, CRUSH simply adjusts its calculation to | |
531 | select 1 and 2, then selects 3 but discovers it's down, so it | |
532 | retries and selects 4 and 5, and then goes on to select a new | |
533 | OSD 6. So the final CRUSH mapping change is | |
534 | 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. | |
535 | ||
536 | But if you're storing an EC pool, that means you just changed the | |
537 | data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to | |
538 | not do that. You can instead expect it, when it selects the failed | |
539 | OSD 3, to try again and pick out 6, for a final transformation of: | |
540 | 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 | |
541 | ||
f64942e4 AA |
542 | .. _crush-reclassify: |
543 | ||
544 | Migrating from a legacy SSD rule to device classes | |
545 | -------------------------------------------------- | |
546 | ||
547 | It used to be necessary to manually edit your CRUSH map and maintain a | |
548 | parallel hierarchy for each specialized device type (e.g., SSD) in order to | |
549 | write rules that apply to those devices. Since the Luminous release, | |
550 | the *device class* feature has enabled this transparently. | |
551 | ||
552 | However, migrating from an existing, manually customized per-device map to | |
553 | the new device class rules in the trivial way will cause all data in the | |
554 | system to be reshuffled. | |
555 | ||
556 | The ``crushtool`` has a few commands that can transform a legacy rule | |
557 | and hierarchy so that you can start using the new class-based rules. | |
558 | There are three types of transformations possible: | |
559 | ||
560 | #. ``--reclassify-root <root-name> <device-class>`` | |
561 | ||
562 | This will take everything in the hierarchy beneath root-name and | |
563 | adjust any rules that reference that root via a ``take | |
564 | <root-name>`` to instead ``take <root-name> class <device-class>``. | |
565 | It renumbers the buckets in such a way that the old IDs are instead | |
566 | used for the specified class's "shadow tree" so that no data | |
567 | movement takes place. | |
568 | ||
569 | For example, imagine you have an existing rule like:: | |
570 | ||
20effc67 | 571 | rule replicated_rule { |
f64942e4 AA |
572 | id 0 |
573 | type replicated | |
f64942e4 AA |
574 | step take default |
575 | step chooseleaf firstn 0 type rack | |
576 | step emit | |
577 | } | |
578 | ||
579 | If you reclassify the root `default` as class `hdd`, the rule will | |
580 | become:: | |
581 | ||
20effc67 | 582 | rule replicated_rule { |
f64942e4 AA |
583 | id 0 |
584 | type replicated | |
f64942e4 AA |
585 | step take default class hdd |
586 | step chooseleaf firstn 0 type rack | |
587 | step emit | |
588 | } | |
589 | ||
590 | #. ``--set-subtree-class <bucket-name> <device-class>`` | |
591 | ||
592 | This will mark every device in the subtree rooted at *bucket-name* | |
593 | with the specified device class. | |
594 | ||
595 | This is normally used in conjunction with the ``--reclassify-root`` | |
596 | option to ensure that all devices in that root are labeled with the | |
597 | correct class. In some situations, however, some of those devices | |
598 | (correctly) have a different class and we do not want to relabel | |
599 | them. In such cases, one can exclude the ``--set-subtree-class`` | |
600 | option. This means that the remapping process will not be perfect, | |
601 | since the previous rule distributed across devices of multiple | |
602 | classes but the adjusted rules will only map to devices of the | |
603 | specified *device-class*, but that often is an accepted level of | |
9f95a23c | 604 | data movement when the number of outlier devices is small. |
f64942e4 AA |
605 | |
606 | #. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` | |
607 | ||
9f95a23c | 608 | This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: |
f64942e4 AA |
609 | |
610 | host node1 { | |
611 | id -2 # do not change unnecessarily | |
612 | # weight 109.152 | |
9f95a23c | 613 | alg straw2 |
f64942e4 AA |
614 | hash 0 # rjenkins1 |
615 | item osd.0 weight 9.096 | |
616 | item osd.1 weight 9.096 | |
617 | item osd.2 weight 9.096 | |
618 | item osd.3 weight 9.096 | |
619 | item osd.4 weight 9.096 | |
620 | item osd.5 weight 9.096 | |
621 | ... | |
622 | } | |
623 | ||
624 | host node1-ssd { | |
625 | id -10 # do not change unnecessarily | |
626 | # weight 2.000 | |
9f95a23c | 627 | alg straw2 |
f64942e4 AA |
628 | hash 0 # rjenkins1 |
629 | item osd.80 weight 2.000 | |
630 | ... | |
631 | } | |
632 | ||
633 | root default { | |
634 | id -1 # do not change unnecessarily | |
9f95a23c | 635 | alg straw2 |
f64942e4 AA |
636 | hash 0 # rjenkins1 |
637 | item node1 weight 110.967 | |
638 | ... | |
639 | } | |
640 | ||
641 | root ssd { | |
642 | id -18 # do not change unnecessarily | |
643 | # weight 16.000 | |
9f95a23c | 644 | alg straw2 |
f64942e4 AA |
645 | hash 0 # rjenkins1 |
646 | item node1-ssd weight 2.000 | |
647 | ... | |
648 | } | |
649 | ||
650 | This function will reclassify each bucket that matches a | |
651 | pattern. The pattern can look like ``%suffix`` or ``prefix%``. | |
652 | For example, in the above example, we would use the pattern | |
653 | ``%-ssd``. For each matched bucket, the remaining portion of the | |
654 | name (that matches the ``%`` wildcard) specifies the *base bucket*. | |
655 | All devices in the matched bucket are labeled with the specified | |
656 | device class and then moved to the base bucket. If the base bucket | |
657 | does not exist (e.g., ``node12-ssd`` exists but ``node12`` does | |
658 | not), then it is created and linked underneath the specified | |
659 | *default parent* bucket. In each case, we are careful to preserve | |
660 | the old bucket IDs for the new shadow buckets to prevent data | |
661 | movement. Any rules with ``take`` steps referencing the old | |
662 | buckets are adjusted. | |
663 | ||
664 | #. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` | |
665 | ||
666 | The same command can also be used without a wildcard to map a | |
667 | single bucket. For example, in the previous example, we want the | |
668 | ``ssd`` bucket to be mapped to the ``default`` bucket. | |
669 | ||
39ae355f TL |
670 | The final command to convert the map comprising the above fragments would be something like: |
671 | ||
672 | .. prompt:: bash $ | |
f64942e4 | 673 | |
39ae355f TL |
674 | ceph osd getcrushmap -o original |
675 | crushtool -i original --reclassify \ | |
676 | --set-subtree-class default hdd \ | |
677 | --reclassify-root default hdd \ | |
678 | --reclassify-bucket %-ssd ssd default \ | |
679 | --reclassify-bucket ssd ssd default \ | |
680 | -o adjusted | |
f64942e4 | 681 | |
39ae355f TL |
682 | In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: |
683 | ||
684 | .. prompt:: bash $ | |
685 | ||
686 | crushtool -i original --compare adjusted | |
687 | ||
688 | :: | |
f64942e4 | 689 | |
f64942e4 AA |
690 | rule 0 had 0/10240 mismatched mappings (0) |
691 | rule 1 had 0/10240 mismatched mappings (0) | |
692 | maps appear equivalent | |
693 | ||
39ae355f TL |
694 | If there were differences, the ratio of remapped inputs would be reported in |
695 | the parentheses. | |
696 | ||
697 | When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: | |
f64942e4 | 698 | |
39ae355f | 699 | .. prompt:: bash $ |
f64942e4 | 700 | |
39ae355f | 701 | ceph osd setcrushmap -i adjusted |
c07f9fc5 FG |
702 | |
703 | Tuning CRUSH, the hard way | |
704 | -------------------------- | |
705 | ||
706 | If you can ensure that all clients are running recent code, you can | |
707 | adjust the tunables by extracting the CRUSH map, modifying the values, | |
708 | and reinjecting it into the cluster. | |
709 | ||
39ae355f TL |
710 | * Extract the latest CRUSH map: |
711 | ||
712 | .. prompt:: bash $ | |
c07f9fc5 FG |
713 | |
714 | ceph osd getcrushmap -o /tmp/crush | |
715 | ||
716 | * Adjust tunables. These values appear to offer the best behavior | |
717 | for both large and small clusters we tested with. You will need to | |
718 | additionally specify the ``--enable-unsafe-tunables`` argument to | |
719 | ``crushtool`` for this to work. Please use this option with | |
39ae355f | 720 | extreme care.: |
c07f9fc5 | 721 | |
39ae355f | 722 | .. prompt:: bash $ |
c07f9fc5 | 723 | |
39ae355f | 724 | crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new |
c07f9fc5 | 725 | |
39ae355f TL |
726 | * Reinject modified map: |
727 | ||
728 | .. prompt:: bash $ | |
729 | ||
730 | ceph osd setcrushmap -i /tmp/crush.new | |
c07f9fc5 FG |
731 | |
732 | Legacy values | |
733 | ------------- | |
734 | ||
735 | For reference, the legacy values for the CRUSH tunables can be set | |
39ae355f TL |
736 | with: |
737 | ||
738 | .. prompt:: bash $ | |
c07f9fc5 FG |
739 | |
740 | crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy | |
741 | ||
742 | Again, the special ``--enable-unsafe-tunables`` option is required. | |
743 | Further, as noted above, be careful running old versions of the | |
744 | ``ceph-osd`` daemon after reverting to legacy values as the feature | |
745 | bit is not perfectly enforced. | |
11fdf7f2 | 746 | |
39ae355f | 747 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |