]>
Commit | Line | Data |
---|---|---|
1e59de90 TL |
1 | Manually editing the CRUSH Map |
2 | ============================== | |
c07f9fc5 | 3 | |
1e59de90 TL |
4 | .. note:: Manually editing the CRUSH map is an advanced administrator |
5 | operation. For the majority of installations, CRUSH changes can be | |
6 | implemented via the Ceph CLI and do not require manual CRUSH map edits. If | |
7 | you have identified a use case where manual edits *are* necessary with a | |
8 | recent Ceph release, consider contacting the Ceph developers at dev@ceph.io | |
9 | so that future versions of Ceph do not have this problem. | |
c07f9fc5 | 10 | |
1e59de90 | 11 | To edit an existing CRUSH map, carry out the following procedure: |
c07f9fc5 FG |
12 | |
13 | #. `Get the CRUSH map`_. | |
14 | #. `Decompile`_ the CRUSH map. | |
1e59de90 TL |
15 | #. Edit at least one of the following sections: `Devices`_, `Buckets`_, and |
16 | `Rules`_. Use a text editor for this task. | |
c07f9fc5 FG |
17 | #. `Recompile`_ the CRUSH map. |
18 | #. `Set the CRUSH map`_. | |
19 | ||
1e59de90 TL |
20 | For details on setting the CRUSH map rule for a specific pool, see `Set Pool |
21 | Values`_. | |
c07f9fc5 FG |
22 | |
23 | .. _Get the CRUSH map: #getcrushmap | |
24 | .. _Decompile: #decompilecrushmap | |
25 | .. _Devices: #crushmapdevices | |
26 | .. _Buckets: #crushmapbuckets | |
27 | .. _Rules: #crushmaprules | |
28 | .. _Recompile: #compilecrushmap | |
29 | .. _Set the CRUSH map: #setcrushmap | |
30 | .. _Set Pool Values: ../pools#setpoolvalues | |
31 | ||
32 | .. _getcrushmap: | |
33 | ||
1e59de90 TL |
34 | Get the CRUSH Map |
35 | ----------------- | |
c07f9fc5 | 36 | |
1e59de90 | 37 | To get the CRUSH map for your cluster, run a command of the following form: |
39ae355f TL |
38 | |
39 | .. prompt:: bash $ | |
c07f9fc5 | 40 | |
1e59de90 | 41 | ceph osd getcrushmap -o {compiled-crushmap-filename} |
c07f9fc5 | 42 | |
1e59de90 TL |
43 | Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have |
44 | specified. Because the CRUSH map is in a compiled form, you must first | |
45 | decompile it before you can edit it. | |
c07f9fc5 FG |
46 | |
47 | .. _decompilecrushmap: | |
48 | ||
1e59de90 TL |
49 | Decompile the CRUSH Map |
50 | ----------------------- | |
c07f9fc5 | 51 | |
1e59de90 | 52 | To decompile the CRUSH map, run a command of the following form: |
39ae355f TL |
53 | |
54 | .. prompt:: bash $ | |
c07f9fc5 | 55 | |
1e59de90 | 56 | crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} |
c07f9fc5 | 57 | |
9f95a23c TL |
58 | .. _compilecrushmap: |
59 | ||
1e59de90 TL |
60 | Recompile the CRUSH Map |
61 | ----------------------- | |
9f95a23c | 62 | |
1e59de90 | 63 | To compile the CRUSH map, run a command of the following form: |
39ae355f TL |
64 | |
65 | .. prompt:: bash $ | |
9f95a23c | 66 | |
1e59de90 | 67 | crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} |
9f95a23c TL |
68 | |
69 | .. _setcrushmap: | |
70 | ||
71 | Set the CRUSH Map | |
72 | ----------------- | |
73 | ||
1e59de90 | 74 | To set the CRUSH map for your cluster, run a command of the following form: |
39ae355f TL |
75 | |
76 | .. prompt:: bash $ | |
9f95a23c | 77 | |
1e59de90 | 78 | ceph osd setcrushmap -i {compiled-crushmap-filename} |
9f95a23c | 79 | |
1e59de90 TL |
80 | Ceph loads (``-i``) a compiled CRUSH map from the filename that you have |
81 | specified. | |
c07f9fc5 FG |
82 | |
83 | Sections | |
84 | -------- | |
85 | ||
1e59de90 | 86 | A CRUSH map has six main sections: |
c07f9fc5 | 87 | |
f67539c2 | 88 | #. **tunables:** The preamble at the top of the map describes any *tunables* |
1e59de90 TL |
89 | that are not a part of legacy CRUSH behavior. These tunables correct for old |
90 | bugs, optimizations, or other changes that have been made over the years to | |
91 | improve CRUSH's behavior. | |
c07f9fc5 | 92 | |
f67539c2 | 93 | #. **devices:** Devices are individual OSDs that store data. |
c07f9fc5 | 94 | |
1e59de90 TL |
95 | #. **types**: Bucket ``types`` define the types of buckets that are used in |
96 | your CRUSH hierarchy. | |
c07f9fc5 | 97 | |
1e59de90 TL |
98 | #. **buckets:** Buckets consist of a hierarchical aggregation of storage |
99 | locations (for example, rows, racks, chassis, hosts) and their assigned | |
100 | weights. After the bucket ``types`` have been defined, the CRUSH map defines | |
101 | each node in the hierarchy, its type, and which devices or other nodes it | |
11fdf7f2 | 102 | contains. |
c07f9fc5 FG |
103 | |
104 | #. **rules:** Rules define policy about how data is distributed across | |
105 | devices in the hierarchy. | |
106 | ||
1e59de90 TL |
107 | #. **choose_args:** ``choose_args`` are alternative weights associated with |
108 | the hierarchy that have been adjusted in order to optimize data placement. A | |
109 | single ``choose_args`` map can be used for the entire cluster, or a number | |
110 | of ``choose_args`` maps can be created such that each map is crafted for a | |
111 | particular pool. | |
c07f9fc5 FG |
112 | |
113 | ||
114 | .. _crushmapdevices: | |
115 | ||
1e59de90 | 116 | CRUSH-Map Devices |
c07f9fc5 FG |
117 | ----------------- |
118 | ||
1e59de90 TL |
119 | Devices are individual OSDs that store data. In this section, there is usually |
120 | one device defined for each OSD daemon in your cluster. Devices are identified | |
121 | by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where | |
122 | ``N`` is the device's ``id``). | |
123 | ||
f67539c2 TL |
124 | |
125 | .. _crush-map-device-class: | |
c07f9fc5 | 126 | |
1e59de90 TL |
127 | A device can also have a *device class* associated with it: for example, |
128 | ``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted | |
129 | by CRUSH rules. This means that device classes allow CRUSH rules to select only | |
130 | OSDs that match certain characteristics. For example, you might want an RBD | |
131 | pool associated only with SSDs and a different RBD pool associated only with | |
132 | HDDs. | |
133 | ||
134 | To see a list of devices, run the following command: | |
c07f9fc5 | 135 | |
39ae355f TL |
136 | .. prompt:: bash # |
137 | ||
1e59de90 TL |
138 | ceph device ls |
139 | ||
140 | The output of this command takes the following form: | |
39ae355f | 141 | |
c07f9fc5 FG |
142 | :: |
143 | ||
1e59de90 | 144 | device {num} {osd.name} [class {class}] |
c07f9fc5 | 145 | |
39ae355f TL |
146 | For example: |
147 | ||
148 | .. prompt:: bash # | |
149 | ||
1e59de90 | 150 | ceph device ls |
39ae355f TL |
151 | |
152 | :: | |
c07f9fc5 | 153 | |
1e59de90 TL |
154 | device 0 osd.0 class ssd |
155 | device 1 osd.1 class hdd | |
156 | device 2 osd.2 | |
157 | device 3 osd.3 | |
158 | ||
159 | In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This | |
160 | daemon might map to a single storage device, a pair of devices (for example, | |
161 | one for data and one for a journal or metadata), or in some cases a small RAID | |
162 | device or a partition of a larger storage device. | |
c07f9fc5 | 163 | |
c07f9fc5 | 164 | |
1e59de90 | 165 | CRUSH-Map Bucket Types |
c07f9fc5 FG |
166 | ---------------------- |
167 | ||
1e59de90 TL |
168 | The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a |
169 | hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets) | |
170 | typically represent physical locations in a hierarchy. Nodes aggregate other | |
171 | nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their | |
172 | corresponding storage media. | |
c07f9fc5 | 173 | |
1e59de90 TL |
174 | .. tip:: In the context of CRUSH, the term "bucket" is used to refer to |
175 | a node in the hierarchy (that is, to a location or a piece of physical | |
176 | hardware). In the context of RADOS Gateway APIs, however, the term | |
177 | "bucket" has a different meaning. | |
c07f9fc5 | 178 | |
1e59de90 | 179 | To add a bucket type to the CRUSH map, create a new line under the list of |
c07f9fc5 | 180 | bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. |
1e59de90 TL |
181 | By convention, there is exactly one leaf bucket type and it is ``type 0``; |
182 | however, you may give the leaf bucket any name you like (for example: ``osd``, | |
183 | ``disk``, ``drive``, ``storage``):: | |
c07f9fc5 | 184 | |
1e59de90 TL |
185 | # types |
186 | type {num} {bucket-name} | |
c07f9fc5 FG |
187 | |
188 | For example:: | |
189 | ||
1e59de90 TL |
190 | # types |
191 | type 0 osd | |
192 | type 1 host | |
193 | type 2 chassis | |
194 | type 3 rack | |
195 | type 4 row | |
196 | type 5 pdu | |
197 | type 6 pod | |
198 | type 7 room | |
199 | type 8 datacenter | |
200 | type 9 zone | |
201 | type 10 region | |
202 | type 11 root | |
c07f9fc5 FG |
203 | |
204 | .. _crushmapbuckets: | |
205 | ||
1e59de90 | 206 | CRUSH-Map Bucket Hierarchy |
c07f9fc5 FG |
207 | -------------------------- |
208 | ||
1e59de90 TL |
209 | The CRUSH algorithm distributes data objects among storage devices according to |
210 | a per-device weight value, approximating a uniform probability distribution. | |
c07f9fc5 | 211 | CRUSH distributes objects and their replicas according to the hierarchical |
1e59de90 TL |
212 | cluster map you define. The CRUSH map represents the available storage devices |
213 | and the logical elements that contain them. | |
214 | ||
215 | To map placement groups (PGs) to OSDs across failure domains, a CRUSH map | |
216 | defines a hierarchical list of bucket types under ``#types`` in the generated | |
217 | CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf | |
218 | nodes according to their failure domains (for example: hosts, chassis, racks, | |
219 | power distribution units, pods, rows, rooms, and data centers). With the | |
220 | exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and | |
c07f9fc5 FG |
221 | you may define it according to your own needs. |
222 | ||
1e59de90 TL |
223 | We recommend adapting your CRUSH map to your preferred hardware-naming |
224 | conventions and using bucket names that clearly reflect the physical | |
225 | hardware. Clear naming practice can make it easier to administer the cluster | |
226 | and easier to troubleshoot problems when OSDs malfunction (or other hardware | |
227 | malfunctions) and the administrator needs access to physical hardware. | |
c07f9fc5 | 228 | |
1e59de90 TL |
229 | |
230 | In the following example, the bucket hierarchy has a leaf bucket named ``osd`` | |
231 | and two node buckets named ``host`` and ``rack``: | |
c07f9fc5 FG |
232 | |
233 | .. ditaa:: | |
234 | +-----------+ | |
235 | | {o}rack | | |
236 | | Bucket | | |
237 | +-----+-----+ | |
238 | | | |
239 | +---------------+---------------+ | |
240 | | | | |
241 | +-----+-----+ +-----+-----+ | |
242 | | {o}host | | {o}host | | |
243 | | Bucket | | Bucket | | |
244 | +-----+-----+ +-----+-----+ | |
245 | | | | |
246 | +-------+-------+ +-------+-------+ | |
247 | | | | | | |
248 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
249 | | osd | | osd | | osd | | osd | | |
250 | | Bucket | | Bucket | | Bucket | | Bucket | | |
251 | +-----------+ +-----------+ +-----------+ +-----------+ | |
252 | ||
1e59de90 TL |
253 | .. note:: The higher-numbered ``rack`` bucket type aggregates the |
254 | lower-numbered ``host`` bucket type. | |
255 | ||
256 | Because leaf nodes reflect storage devices that have already been declared | |
257 | under the ``#devices`` list at the beginning of the CRUSH map, there is no need | |
258 | to declare them as bucket instances. The second-lowest bucket type in your | |
259 | hierarchy is typically used to aggregate the devices (that is, the | |
260 | second-lowest bucket type is usually the computer that contains the storage | |
261 | media and, such as ``node``, ``computer``, ``server``, ``host``, or | |
262 | ``machine``). In high-density environments, it is common to have multiple hosts | |
263 | or nodes in a single chassis (for example, in the cases of blades or twins). It | |
264 | is important to anticipate the potential consequences of chassis failure -- for | |
265 | example, during the replacement of a chassis in case of a node failure, the | |
266 | chassis's hosts or nodes (and their associated OSDs) will be in a ``down`` | |
267 | state. | |
268 | ||
269 | To declare a bucket instance, do the following: specify its type, give it a | |
270 | unique name (an alphanumeric string), assign it a unique ID expressed as a | |
271 | negative integer (this is optional), assign it a weight relative to the total | |
272 | capacity and capability of the item(s) in the bucket, assign it a bucket | |
273 | algorithm (usually ``straw2``), and specify the bucket algorithm's hash | |
274 | (usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A | |
275 | bucket may have one or more items. The items may consist of node buckets or | |
276 | leaves. Items may have a weight that reflects the relative weight of the item. | |
277 | ||
278 | To declare a node bucket, use the following syntax:: | |
279 | ||
280 | [bucket-type] [bucket-name] { | |
281 | id [a unique negative numeric ID] | |
282 | weight [the relative capacity/capability of the item(s)] | |
283 | alg [the bucket type: uniform | list | tree | straw | straw2 ] | |
284 | hash [the hash type: 0 by default] | |
285 | item [item-name] weight [weight] | |
286 | } | |
287 | ||
288 | For example, in the above diagram, two host buckets (referred to in the | |
289 | declaration below as ``node1`` and ``node2``) and one rack bucket (referred to | |
290 | in the declaration below as ``rack1``) are defined. The OSDs are declared as | |
291 | items within the host buckets:: | |
292 | ||
293 | host node1 { | |
294 | id -1 | |
295 | alg straw2 | |
296 | hash 0 | |
297 | item osd.0 weight 1.00 | |
298 | item osd.1 weight 1.00 | |
299 | } | |
300 | ||
301 | host node2 { | |
302 | id -2 | |
303 | alg straw2 | |
304 | hash 0 | |
305 | item osd.2 weight 1.00 | |
306 | item osd.3 weight 1.00 | |
307 | } | |
308 | ||
309 | rack rack1 { | |
310 | id -3 | |
311 | alg straw2 | |
312 | hash 0 | |
313 | item node1 weight 2.00 | |
314 | item node2 weight 2.00 | |
315 | } | |
316 | ||
317 | .. note:: In this example, the rack bucket does not contain any OSDs. Instead, | |
318 | it contains lower-level host buckets and includes the sum of their weight in | |
319 | the item entry. | |
c07f9fc5 | 320 | |
c07f9fc5 FG |
321 | |
322 | .. topic:: Bucket Types | |
323 | ||
1e59de90 TL |
324 | Ceph supports five bucket types. Each bucket type provides a balance between |
325 | performance and reorganization efficiency, and each is different from the | |
326 | others. If you are unsure of which bucket type to use, use the ``straw2`` | |
327 | bucket. For a more technical discussion of bucket types than is offered | |
328 | here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized | |
329 | Placement of Replicated Data`_. | |
330 | ||
331 | The bucket types are as follows: | |
332 | ||
333 | #. **uniform**: Uniform buckets aggregate devices that have **exactly** | |
334 | the same weight. For example, when hardware is commissioned or | |
335 | decommissioned, it is often done in sets of machines that have exactly | |
336 | the same physical configuration (this can be the case, for example, | |
337 | after bulk purchases). When storage devices have exactly the same | |
338 | weight, you may use the ``uniform`` bucket type, which allows CRUSH to | |
339 | map replicas into uniform buckets in constant time. If your devices have | |
340 | non-uniform weights, you should not use the uniform bucket algorithm. | |
341 | ||
342 | #. **list**: List buckets aggregate their content as linked lists. The | |
343 | behavior of list buckets is governed by the :abbr:`RUSH (Replication | |
344 | Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this | |
345 | bucket type, an object is either relocated to the newest device in | |
346 | accordance with an appropriate probability, or it remains on the older | |
347 | devices as before. This results in optimal data migration when items are | |
348 | added to the bucket. The removal of items from the middle or the tail of | |
349 | the list, however, can result in a significant amount of unnecessary | |
350 | data movement. This means that list buckets are most suitable for | |
351 | circumstances in which they **never shrink or very rarely shrink**. | |
352 | ||
353 | #. **tree**: Tree buckets use a binary search tree. They are more efficient | |
354 | at dealing with buckets that contain many items than are list buckets. | |
355 | The behavior of tree buckets is governed by the :abbr:`RUSH (Replication | |
356 | Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the | |
357 | placement time to 0(log\ :sub:`n`). This means that tree buckets are | |
358 | suitable for managing large sets of devices or nested buckets. | |
359 | ||
360 | #. **straw**: Straw buckets allow all items in the bucket to "compete" | |
361 | against each other for replica placement through a process analogous to | |
362 | drawing straws. This is different from the behavior of list buckets and | |
363 | tree buckets, which use a divide-and-conquer strategy that either gives | |
364 | certain items precedence (for example, those at the beginning of a list) | |
365 | or obviates the need to consider entire subtrees of items. Such an | |
366 | approach improves the performance of the replica placement process, but | |
367 | can also introduce suboptimal reorganization behavior when the contents | |
368 | of a bucket change due an addition, a removal, or the re-weighting of an | |
369 | item. | |
370 | ||
371 | * **straw2**: Straw2 buckets improve on Straw by correctly avoiding | |
372 | any data movement between items when neighbor weights change. For | |
373 | example, if the weight of a given item changes (including during the | |
374 | operations of adding it to the cluster or removing it from the | |
375 | cluster), there will be data movement to or from only that item. | |
376 | Neighbor weights are not taken into account. | |
377 | ||
11fdf7f2 | 378 | |
c07f9fc5 FG |
379 | .. topic:: Hash |
380 | ||
1e59de90 TL |
381 | Each bucket uses a hash algorithm. As of Reef, Ceph supports the |
382 | ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm, | |
383 | enter ``0`` as your hash setting. | |
c07f9fc5 | 384 | |
c07f9fc5 FG |
385 | .. _weightingbucketitems: |
386 | ||
387 | .. topic:: Weighting Bucket Items | |
388 | ||
05a536ef | 389 | Ceph expresses bucket weights as doubles, which allows for fine-grained |
c07f9fc5 | 390 | weighting. A weight is the relative difference between device capacities. We |
05a536ef TL |
391 | recommend using ``1.00`` as the relative weight for a 1 TB storage device. |
392 | In such a scenario, a weight of ``0.50`` would represent approximately 500 | |
393 | GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets | |
394 | higher in the CRUSH hierarchy have a weight that is the sum of the weight of | |
395 | the leaf items aggregated by the bucket. | |
c07f9fc5 FG |
396 | |
397 | ||
398 | .. _crushmaprules: | |
399 | ||
400 | CRUSH Map Rules | |
401 | --------------- | |
402 | ||
05a536ef TL |
403 | CRUSH maps have rules that include data placement for a pool: these are |
404 | called "CRUSH rules". The default CRUSH map has one rule for each pool. If you | |
405 | are running a large cluster, you might create many pools and each of those | |
406 | pools might have its own non-default CRUSH rule. | |
c07f9fc5 | 407 | |
c07f9fc5 | 408 | |
05a536ef TL |
409 | .. note:: In most cases, there is no need to modify the default rule. When a |
410 | new pool is created, by default the rule will be set to the value ``0`` | |
411 | (which indicates the default CRUSH rule, which has the numeric ID ``0``). | |
c07f9fc5 | 412 | |
05a536ef TL |
413 | CRUSH rules define policy that governs how data is distributed across the devices in |
414 | the hierarchy. The rules define placement as well as replication strategies or | |
415 | distribution policies that allow you to specify exactly how CRUSH places data | |
416 | replicas. For example, you might create one rule selecting a pair of targets for | |
417 | two-way mirroring, another rule for selecting three targets in two different data | |
418 | centers for three-way replication, and yet another rule for erasure coding across | |
419 | six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2** | |
420 | of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_. | |
c07f9fc5 FG |
421 | |
422 | A rule takes the following form:: | |
423 | ||
05a536ef | 424 | rule <rulename> { |
c07f9fc5 | 425 | |
05a536ef TL |
426 | id [a unique integer ID] |
427 | type [replicated|erasure] | |
428 | step take <bucket-name> [class <device-class>] | |
429 | step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> | |
430 | step emit | |
431 | } | |
c07f9fc5 FG |
432 | |
433 | ||
11fdf7f2 | 434 | ``id`` |
05a536ef TL |
435 | :Description: A unique integer that identifies the rule. |
436 | :Purpose: A component of the rule mask. | |
437 | :Type: Integer | |
438 | :Required: Yes | |
439 | :Default: 0 | |
c07f9fc5 | 440 | |
c07f9fc5 FG |
441 | |
442 | ``type`` | |
05a536ef TL |
443 | :Description: Denotes the type of replication strategy to be enforced by the |
444 | rule. | |
445 | :Purpose: A component of the rule mask. | |
446 | :Type: String | |
447 | :Required: Yes | |
448 | :Default: ``replicated`` | |
449 | :Valid Values: ``replicated`` or ``erasure`` | |
c07f9fc5 | 450 | |
c07f9fc5 FG |
451 | |
452 | ``step take <bucket-name> [class <device-class>]`` | |
05a536ef TL |
453 | :Description: Takes a bucket name and iterates down the tree. If |
454 | the ``device-class`` argument is specified, the argument must | |
455 | match a class assigned to OSDs within the cluster. Only | |
456 | devices belonging to the class are included. | |
457 | :Purpose: A component of the rule. | |
458 | :Required: Yes | |
459 | :Example: ``step take data`` | |
c07f9fc5 | 460 | |
c07f9fc5 FG |
461 | |
462 | ||
463 | ``step choose firstn {num} type {bucket-type}`` | |
05a536ef TL |
464 | :Description: Selects ``num`` buckets of the given type from within the |
465 | current bucket. ``{num}`` is usually the number of replicas in | |
466 | the pool (in other words, the pool size). | |
c07f9fc5 | 467 | |
05a536ef TL |
468 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). |
469 | - If ``pool-num-replicas > {num} > 0``, choose that many buckets. | |
470 | - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. | |
c07f9fc5 | 471 | |
05a536ef TL |
472 | :Purpose: A component of the rule. |
473 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
474 | :Example: ``step choose firstn 1 type row`` | |
c07f9fc5 FG |
475 | |
476 | ||
477 | ``step chooseleaf firstn {num} type {bucket-type}`` | |
05a536ef TL |
478 | :Description: Selects a set of buckets of the given type and chooses a leaf |
479 | node (that is, an OSD) from the subtree of each bucket in that set of buckets. The | |
480 | number of buckets in the set is usually the number of replicas in | |
481 | the pool (in other words, the pool size). | |
c07f9fc5 | 482 | |
05a536ef TL |
483 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). |
484 | - If ``pool-num-replicas > {num} > 0``, choose that many buckets. | |
485 | - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. | |
486 | :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step. | |
487 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
488 | :Example: ``step chooseleaf firstn 0 type row`` | |
c07f9fc5 FG |
489 | |
490 | ||
c07f9fc5 | 491 | ``step emit`` |
05a536ef TL |
492 | :Description: Outputs the current value on the top of the stack and empties |
493 | the stack. Typically used | |
494 | at the end of a rule, but may also be used to choose from different | |
495 | trees in the same rule. | |
496 | ||
497 | :Purpose: A component of the rule. | |
498 | :Prerequisite: Follows ``step choose``. | |
499 | :Example: ``step emit`` | |
500 | ||
501 | .. important:: A single CRUSH rule can be assigned to multiple pools, but | |
502 | a single pool cannot have multiple CRUSH rules. | |
503 | ||
504 | ``firstn`` or ``indep`` | |
505 | ||
506 | :Description: Determines which replacement strategy CRUSH uses when items (OSDs) | |
507 | are marked ``down`` in the CRUSH map. When this rule is used | |
508 | with replicated pools, ``firstn`` is used. When this rule is | |
509 | used with erasure-coded pools, ``indep`` is used. | |
510 | ||
511 | Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then | |
512 | OSD 3 goes down. | |
513 | ||
514 | When in ``firstn`` mode, CRUSH simply adjusts its calculation | |
515 | to select OSDs 1 and 2, then selects 3 and discovers that 3 is | |
516 | down, retries and selects 4 and 5, and finally goes on to | |
517 | select a new OSD: OSD 6. The final CRUSH mapping | |
518 | transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6. | |
519 | ||
520 | However, if you were storing an erasure-coded pool, the above | |
521 | sequence would have changed the data that is mapped to OSDs 4, | |
522 | 5, and 6. The ``indep`` mode attempts to avoid this unwanted | |
523 | consequence. When in ``indep`` mode, CRUSH can be expected to | |
524 | select 3, discover that 3 is down, retry, and select 6. The | |
525 | final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5 | |
526 | → 1, 2, 6, 4, 5. | |
c07f9fc5 | 527 | |
f64942e4 AA |
528 | .. _crush-reclassify: |
529 | ||
530 | Migrating from a legacy SSD rule to device classes | |
531 | -------------------------------------------------- | |
532 | ||
05a536ef TL |
533 | Prior to the Luminous release's introduction of the *device class* feature, in |
534 | order to write rules that applied to a specialized device type (for example, | |
535 | SSD), it was necessary to manually edit the CRUSH map and maintain a parallel | |
536 | hierarchy for each device type. The device class feature provides a more | |
537 | transparent way to achieve this end. | |
f64942e4 | 538 | |
05a536ef TL |
539 | However, if your cluster is migrated from an existing manually-customized |
540 | per-device map to new device class-based rules, all data in the system will be | |
541 | reshuffled. | |
f64942e4 | 542 | |
05a536ef TL |
543 | The ``crushtool`` utility has several commands that can transform a legacy rule |
544 | and hierarchy and allow you to start using the new device class rules. There | |
545 | are three possible types of transformation: | |
f64942e4 AA |
546 | |
547 | #. ``--reclassify-root <root-name> <device-class>`` | |
548 | ||
05a536ef TL |
549 | This command examines everything under ``root-name`` in the hierarchy and |
550 | rewrites any rules that reference the specified root and that have the | |
551 | form ``take <root-name>`` so that they instead have the | |
552 | form ``take <root-name> class <device-class>``. The command also renumbers | |
553 | the buckets in such a way that the old IDs are used for the specified | |
554 | class's "shadow tree" and as a result no data movement takes place. | |
f64942e4 | 555 | |
05a536ef | 556 | For example, suppose you have the following as an existing rule:: |
f64942e4 | 557 | |
20effc67 | 558 | rule replicated_rule { |
f64942e4 AA |
559 | id 0 |
560 | type replicated | |
f64942e4 AA |
561 | step take default |
562 | step chooseleaf firstn 0 type rack | |
563 | step emit | |
564 | } | |
565 | ||
05a536ef TL |
566 | If the root ``default`` is reclassified as class ``hdd``, the new rule will |
567 | be as follows:: | |
f64942e4 | 568 | |
20effc67 | 569 | rule replicated_rule { |
f64942e4 AA |
570 | id 0 |
571 | type replicated | |
f64942e4 AA |
572 | step take default class hdd |
573 | step chooseleaf firstn 0 type rack | |
574 | step emit | |
575 | } | |
576 | ||
577 | #. ``--set-subtree-class <bucket-name> <device-class>`` | |
578 | ||
05a536ef | 579 | This command marks every device in the subtree that is rooted at *bucket-name* |
f64942e4 AA |
580 | with the specified device class. |
581 | ||
05a536ef TL |
582 | This command is typically used in conjunction with the ``--reclassify-root`` option |
583 | in order to ensure that all devices in that root are labeled with the | |
584 | correct class. In certain circumstances, however, some of those devices | |
585 | are correctly labeled with a different class and must not be relabeled. To | |
586 | manage this difficulty, one can exclude the ``--set-subtree-class`` | |
587 | option. The remapping process will not be perfect, because the previous rule | |
588 | had an effect on devices of multiple classes but the adjusted rules will map | |
589 | only to devices of the specified device class. However, when there are not many | |
590 | outlier devices, the resulting level of data movement is often within tolerable | |
591 | limits. | |
592 | ||
f64942e4 AA |
593 | |
594 | #. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` | |
595 | ||
05a536ef TL |
596 | This command allows you to merge a parallel type-specific hierarchy with the |
597 | normal hierarchy. For example, many users have maps that resemble the | |
598 | following:: | |
f64942e4 AA |
599 | |
600 | host node1 { | |
601 | id -2 # do not change unnecessarily | |
602 | # weight 109.152 | |
9f95a23c | 603 | alg straw2 |
f64942e4 AA |
604 | hash 0 # rjenkins1 |
605 | item osd.0 weight 9.096 | |
606 | item osd.1 weight 9.096 | |
607 | item osd.2 weight 9.096 | |
608 | item osd.3 weight 9.096 | |
609 | item osd.4 weight 9.096 | |
610 | item osd.5 weight 9.096 | |
611 | ... | |
612 | } | |
613 | ||
614 | host node1-ssd { | |
615 | id -10 # do not change unnecessarily | |
616 | # weight 2.000 | |
9f95a23c | 617 | alg straw2 |
f64942e4 AA |
618 | hash 0 # rjenkins1 |
619 | item osd.80 weight 2.000 | |
05a536ef | 620 | ... |
f64942e4 AA |
621 | } |
622 | ||
623 | root default { | |
624 | id -1 # do not change unnecessarily | |
9f95a23c | 625 | alg straw2 |
f64942e4 AA |
626 | hash 0 # rjenkins1 |
627 | item node1 weight 110.967 | |
628 | ... | |
629 | } | |
630 | ||
631 | root ssd { | |
632 | id -18 # do not change unnecessarily | |
633 | # weight 16.000 | |
9f95a23c | 634 | alg straw2 |
f64942e4 AA |
635 | hash 0 # rjenkins1 |
636 | item node1-ssd weight 2.000 | |
05a536ef | 637 | ... |
f64942e4 AA |
638 | } |
639 | ||
05a536ef TL |
640 | This command reclassifies each bucket that matches a certain |
641 | pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For | |
642 | example, in the above example, we would use the pattern | |
643 | ``%-ssd``. For each matched bucket, the remaining portion of the | |
644 | name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All | |
645 | devices in the matched bucket are labeled with the specified | |
646 | device class and then moved to the base bucket. If the base bucket | |
647 | does not exist (for example, ``node12-ssd`` exists but ``node12`` does | |
648 | not), then it is created and linked under the specified | |
649 | *default parent* bucket. In each case, care is taken to preserve | |
650 | the old bucket IDs for the new shadow buckets in order to prevent data | |
651 | movement. Any rules with ``take`` steps that reference the old | |
652 | buckets are adjusted accordingly. | |
653 | ||
f64942e4 AA |
654 | |
655 | #. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` | |
656 | ||
05a536ef TL |
657 | The same command can also be used without a wildcard in order to map a |
658 | single bucket. For example, in the previous example, we want the | |
f64942e4 AA |
659 | ``ssd`` bucket to be mapped to the ``default`` bucket. |
660 | ||
05a536ef TL |
661 | #. The final command to convert the map that consists of the above fragments |
662 | resembles the following: | |
39ae355f | 663 | |
05a536ef | 664 | .. prompt:: bash $ |
f64942e4 | 665 | |
05a536ef TL |
666 | ceph osd getcrushmap -o original |
667 | crushtool -i original --reclassify \ | |
668 | --set-subtree-class default hdd \ | |
669 | --reclassify-root default hdd \ | |
670 | --reclassify-bucket %-ssd ssd default \ | |
671 | --reclassify-bucket ssd ssd default \ | |
672 | -o adjusted | |
f64942e4 | 673 | |
05a536ef TL |
674 | ``--compare`` flag |
675 | ------------------ | |
676 | ||
677 | A ``--compare`` flag is available to make sure that the conversion performed in | |
678 | :ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is | |
679 | correct. This flag tests a large sample of inputs against the CRUSH map and | |
680 | checks that the expected result is output. The options that control these | |
681 | inputs are the same as the options that apply to the ``--test`` command. For an | |
682 | illustration of how this ``--compare`` command applies to the above example, | |
683 | see the following: | |
39ae355f TL |
684 | |
685 | .. prompt:: bash $ | |
686 | ||
687 | crushtool -i original --compare adjusted | |
688 | ||
689 | :: | |
f64942e4 | 690 | |
f64942e4 AA |
691 | rule 0 had 0/10240 mismatched mappings (0) |
692 | rule 1 had 0/10240 mismatched mappings (0) | |
693 | maps appear equivalent | |
694 | ||
05a536ef TL |
695 | If the command finds any differences, the ratio of remapped inputs is reported |
696 | in the parentheses. | |
39ae355f | 697 | |
05a536ef TL |
698 | When you are satisfied with the adjusted map, apply it to the cluster by |
699 | running the following command: | |
f64942e4 | 700 | |
39ae355f | 701 | .. prompt:: bash $ |
f64942e4 | 702 | |
39ae355f | 703 | ceph osd setcrushmap -i adjusted |
c07f9fc5 | 704 | |
05a536ef TL |
705 | Manually Tuning CRUSH |
706 | --------------------- | |
c07f9fc5 | 707 | |
05a536ef TL |
708 | If you have verified that all clients are running recent code, you can adjust |
709 | the CRUSH tunables by extracting the CRUSH map, modifying the values, and | |
710 | reinjecting the map into the cluster. The procedure is carried out as follows: | |
c07f9fc5 | 711 | |
05a536ef | 712 | #. Extract the latest CRUSH map: |
39ae355f | 713 | |
05a536ef | 714 | .. prompt:: bash $ |
c07f9fc5 | 715 | |
05a536ef | 716 | ceph osd getcrushmap -o /tmp/crush |
c07f9fc5 | 717 | |
05a536ef TL |
718 | #. Adjust tunables. In our tests, the following values appear to result in the |
719 | best behavior for both large and small clusters. The procedure requires that | |
720 | you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool`` | |
721 | command. Use this option with **extreme care**: | |
c07f9fc5 | 722 | |
05a536ef | 723 | .. prompt:: bash $ |
c07f9fc5 | 724 | |
05a536ef | 725 | crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new |
c07f9fc5 | 726 | |
05a536ef | 727 | #. Reinject the modified map: |
39ae355f | 728 | |
05a536ef | 729 | .. prompt:: bash $ |
39ae355f TL |
730 | |
731 | ceph osd setcrushmap -i /tmp/crush.new | |
c07f9fc5 FG |
732 | |
733 | Legacy values | |
734 | ------------- | |
735 | ||
05a536ef | 736 | To set the legacy values of the CRUSH tunables, run the following command: |
39ae355f TL |
737 | |
738 | .. prompt:: bash $ | |
c07f9fc5 FG |
739 | |
740 | crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy | |
741 | ||
05a536ef TL |
742 | The special ``--enable-unsafe-tunables`` flag is required. Be careful when |
743 | running old versions of the ``ceph-osd`` daemon after reverting to legacy | |
744 | values, because the feature bit is not perfectly enforced. | |
11fdf7f2 | 745 | |
39ae355f | 746 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |