]>
Commit | Line | Data |
---|---|---|
c07f9fc5 FG |
1 | Manually editing a CRUSH Map |
2 | ============================ | |
3 | ||
f67539c2 | 4 | .. note:: Manually editing the CRUSH map is an advanced |
c07f9fc5 FG |
5 | administrator operation. All CRUSH changes that are |
6 | necessary for the overwhelming majority of installations are | |
7 | possible via the standard ceph CLI and do not require manual | |
8 | CRUSH map edits. If you have identified a use case where | |
f67539c2 TL |
9 | manual edits *are* necessary with recent Ceph releases, consider |
10 | contacting the Ceph developers so that future versions of Ceph | |
11 | can obviate your corner case. | |
c07f9fc5 FG |
12 | |
13 | To edit an existing CRUSH map: | |
14 | ||
15 | #. `Get the CRUSH map`_. | |
16 | #. `Decompile`_ the CRUSH map. | |
17 | #. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. | |
18 | #. `Recompile`_ the CRUSH map. | |
19 | #. `Set the CRUSH map`_. | |
20 | ||
b32b8144 FG |
21 | For details on setting the CRUSH map rule for a specific pool, see `Set |
22 | Pool Values`_. | |
c07f9fc5 FG |
23 | |
24 | .. _Get the CRUSH map: #getcrushmap | |
25 | .. _Decompile: #decompilecrushmap | |
26 | .. _Devices: #crushmapdevices | |
27 | .. _Buckets: #crushmapbuckets | |
28 | .. _Rules: #crushmaprules | |
29 | .. _Recompile: #compilecrushmap | |
30 | .. _Set the CRUSH map: #setcrushmap | |
31 | .. _Set Pool Values: ../pools#setpoolvalues | |
32 | ||
33 | .. _getcrushmap: | |
34 | ||
35 | Get a CRUSH Map | |
36 | --------------- | |
37 | ||
39ae355f TL |
38 | To get the CRUSH map for your cluster, execute the following: |
39 | ||
40 | .. prompt:: bash $ | |
c07f9fc5 FG |
41 | |
42 | ceph osd getcrushmap -o {compiled-crushmap-filename} | |
43 | ||
44 | Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since | |
45 | the CRUSH map is in a compiled form, you must decompile it first before you can | |
46 | edit it. | |
47 | ||
48 | .. _decompilecrushmap: | |
49 | ||
50 | Decompile a CRUSH Map | |
51 | --------------------- | |
52 | ||
39ae355f TL |
53 | To decompile a CRUSH map, execute the following: |
54 | ||
55 | .. prompt:: bash $ | |
c07f9fc5 FG |
56 | |
57 | crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} | |
58 | ||
9f95a23c TL |
59 | .. _compilecrushmap: |
60 | ||
61 | Recompile a CRUSH Map | |
62 | --------------------- | |
63 | ||
39ae355f TL |
64 | To compile a CRUSH map, execute the following: |
65 | ||
66 | .. prompt:: bash $ | |
9f95a23c TL |
67 | |
68 | crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} | |
69 | ||
70 | .. _setcrushmap: | |
71 | ||
72 | Set the CRUSH Map | |
73 | ----------------- | |
74 | ||
39ae355f TL |
75 | To set the CRUSH map for your cluster, execute the following: |
76 | ||
77 | .. prompt:: bash $ | |
9f95a23c TL |
78 | |
79 | ceph osd setcrushmap -i {compiled-crushmap-filename} | |
80 | ||
81 | Ceph will load (-i) a compiled CRUSH map from the filename you specified. | |
c07f9fc5 FG |
82 | |
83 | Sections | |
84 | -------- | |
85 | ||
86 | There are six main sections to a CRUSH Map. | |
87 | ||
f67539c2 TL |
88 | #. **tunables:** The preamble at the top of the map describes any *tunables* |
89 | that differ from the historical / legacy CRUSH behavior. These | |
90 | correct for old bugs, optimizations, or other changes that have | |
c07f9fc5 FG |
91 | been made over the years to improve CRUSH's behavior. |
92 | ||
f67539c2 | 93 | #. **devices:** Devices are individual OSDs that store data. |
c07f9fc5 FG |
94 | |
95 | #. **types**: Bucket ``types`` define the types of buckets used in | |
96 | your CRUSH hierarchy. Buckets consist of a hierarchical aggregation | |
97 | of storage locations (e.g., rows, racks, chassis, hosts, etc.) and | |
98 | their assigned weights. | |
99 | ||
100 | #. **buckets:** Once you define bucket types, you must define each node | |
101 | in the hierarchy, its type, and which devices or other nodes it | |
11fdf7f2 | 102 | contains. |
c07f9fc5 FG |
103 | |
104 | #. **rules:** Rules define policy about how data is distributed across | |
105 | devices in the hierarchy. | |
106 | ||
107 | #. **choose_args:** Choose_args are alternative weights associated with | |
108 | the hierarchy that have been adjusted to optimize data placement. A single | |
109 | choose_args map can be used for the entire cluster, or one can be | |
110 | created for each individual pool. | |
111 | ||
112 | ||
113 | .. _crushmapdevices: | |
114 | ||
115 | CRUSH Map Devices | |
116 | ----------------- | |
117 | ||
f67539c2 TL |
118 | Devices are individual OSDs that store data. Usually one is defined here for each |
119 | OSD daemon in your | |
120 | cluster. Devices are identified by an ``id`` (a non-negative integer) and | |
121 | a ``name``, normally ``osd.N`` where ``N`` is the device id. | |
122 | ||
123 | .. _crush-map-device-class: | |
c07f9fc5 FG |
124 | |
125 | Devices may also have a *device class* associated with them (e.g., | |
11fdf7f2 | 126 | ``hdd`` or ``ssd``), allowing them to be conveniently targeted by a |
c07f9fc5 FG |
127 | crush rule. |
128 | ||
39ae355f TL |
129 | .. prompt:: bash # |
130 | ||
131 | devices | |
132 | ||
c07f9fc5 FG |
133 | :: |
134 | ||
c07f9fc5 FG |
135 | device {num} {osd.name} [class {class}] |
136 | ||
39ae355f TL |
137 | For example: |
138 | ||
139 | .. prompt:: bash # | |
140 | ||
141 | devices | |
142 | ||
143 | :: | |
c07f9fc5 | 144 | |
c07f9fc5 FG |
145 | device 0 osd.0 class ssd |
146 | device 1 osd.1 class hdd | |
147 | device 2 osd.2 | |
148 | device 3 osd.3 | |
149 | ||
150 | In most cases, each device maps to a single ``ceph-osd`` daemon. This | |
151 | is normally a single storage device, a pair of devices (for example, | |
152 | one for data and one for a journal or metadata), or in some cases a | |
153 | small RAID device. | |
154 | ||
c07f9fc5 FG |
155 | CRUSH Map Bucket Types |
156 | ---------------------- | |
157 | ||
158 | The second list in the CRUSH map defines 'bucket' types. Buckets facilitate | |
159 | a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent | |
160 | physical locations in a hierarchy. Nodes aggregate other nodes or leaves. | |
161 | Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage | |
162 | media. | |
163 | ||
164 | .. tip:: The term "bucket" used in the context of CRUSH means a node in | |
165 | the hierarchy, i.e. a location or a piece of physical hardware. It | |
166 | is a different concept from the term "bucket" when used in the | |
167 | context of RADOS Gateway APIs. | |
168 | ||
169 | To add a bucket type to the CRUSH map, create a new line under your list of | |
170 | bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. | |
171 | By convention, there is one leaf bucket and it is ``type 0``; however, you may | |
39ae355f | 172 | give it any name you like (e.g., osd, disk, drive, storage):: |
c07f9fc5 | 173 | |
39ae355f | 174 | # types |
c07f9fc5 FG |
175 | type {num} {bucket-name} |
176 | ||
177 | For example:: | |
178 | ||
179 | # types | |
180 | type 0 osd | |
181 | type 1 host | |
182 | type 2 chassis | |
183 | type 3 rack | |
184 | type 4 row | |
185 | type 5 pdu | |
186 | type 6 pod | |
187 | type 7 room | |
188 | type 8 datacenter | |
9f95a23c TL |
189 | type 9 zone |
190 | type 10 region | |
191 | type 11 root | |
c07f9fc5 FG |
192 | |
193 | ||
194 | ||
195 | .. _crushmapbuckets: | |
196 | ||
197 | CRUSH Map Bucket Hierarchy | |
198 | -------------------------- | |
199 | ||
200 | The CRUSH algorithm distributes data objects among storage devices according | |
201 | to a per-device weight value, approximating a uniform probability distribution. | |
202 | CRUSH distributes objects and their replicas according to the hierarchical | |
203 | cluster map you define. Your CRUSH map represents the available storage | |
204 | devices and the logical elements that contain them. | |
205 | ||
206 | To map placement groups to OSDs across failure domains, a CRUSH map defines a | |
207 | hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH | |
208 | map). The purpose of creating a bucket hierarchy is to segregate the | |
209 | leaf nodes by their failure domains, such as hosts, chassis, racks, power | |
210 | distribution units, pods, rows, rooms, and data centers. With the exception of | |
211 | the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and | |
212 | you may define it according to your own needs. | |
213 | ||
39ae355f TL |
214 | We recommend adapting your CRUSH map to your firm's hardware naming conventions |
215 | and using instance names that reflect the physical hardware. Your naming | |
c07f9fc5 FG |
216 | practice can make it easier to administer the cluster and troubleshoot |
217 | problems when an OSD and/or other hardware malfunctions and the administrator | |
218 | need access to physical hardware. | |
219 | ||
220 | In the following example, the bucket hierarchy has a leaf bucket named ``osd``, | |
221 | and two node buckets named ``host`` and ``rack`` respectively. | |
222 | ||
223 | .. ditaa:: | |
224 | +-----------+ | |
225 | | {o}rack | | |
226 | | Bucket | | |
227 | +-----+-----+ | |
228 | | | |
229 | +---------------+---------------+ | |
230 | | | | |
231 | +-----+-----+ +-----+-----+ | |
232 | | {o}host | | {o}host | | |
233 | | Bucket | | Bucket | | |
234 | +-----+-----+ +-----+-----+ | |
235 | | | | |
236 | +-------+-------+ +-------+-------+ | |
237 | | | | | | |
238 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
239 | | osd | | osd | | osd | | osd | | |
240 | | Bucket | | Bucket | | Bucket | | Bucket | | |
241 | +-----------+ +-----------+ +-----------+ +-----------+ | |
242 | ||
243 | .. note:: The higher numbered ``rack`` bucket type aggregates the lower | |
244 | numbered ``host`` bucket type. | |
245 | ||
246 | Since leaf nodes reflect storage devices declared under the ``#devices`` list | |
247 | at the beginning of the CRUSH map, you do not need to declare them as bucket | |
248 | instances. The second lowest bucket type in your hierarchy usually aggregates | |
249 | the devices (i.e., it's usually the computer containing the storage media, and | |
250 | uses whatever term you prefer to describe it, such as "node", "computer", | |
251 | "server," "host", "machine", etc.). In high density environments, it is | |
252 | increasingly common to see multiple hosts/nodes per chassis. You should account | |
253 | for chassis failure too--e.g., the need to pull a chassis if a node fails may | |
254 | result in bringing down numerous hosts/nodes and their OSDs. | |
255 | ||
256 | When declaring a bucket instance, you must specify its type, give it a unique | |
257 | name (string), assign it a unique ID expressed as a negative integer (optional), | |
258 | specify a weight relative to the total capacity/capability of its item(s), | |
9f95a23c | 259 | specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``, |
c07f9fc5 FG |
260 | reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. |
261 | The items may consist of node buckets or leaves. Items may have a weight that | |
262 | reflects the relative weight of the item. | |
263 | ||
264 | You may declare a node bucket with the following syntax:: | |
265 | ||
266 | [bucket-type] [bucket-name] { | |
267 | id [a unique negative numeric ID] | |
268 | weight [the relative capacity/capability of the item(s)] | |
9f95a23c | 269 | alg [the bucket type: uniform | list | tree | straw | straw2 ] |
c07f9fc5 FG |
270 | hash [the hash type: 0 by default] |
271 | item [item-name] weight [weight] | |
272 | } | |
273 | ||
274 | For example, using the diagram above, we would define two host buckets | |
275 | and one rack bucket. The OSDs are declared as items within the host buckets:: | |
276 | ||
277 | host node1 { | |
278 | id -1 | |
9f95a23c | 279 | alg straw2 |
c07f9fc5 FG |
280 | hash 0 |
281 | item osd.0 weight 1.00 | |
282 | item osd.1 weight 1.00 | |
283 | } | |
284 | ||
285 | host node2 { | |
286 | id -2 | |
9f95a23c | 287 | alg straw2 |
c07f9fc5 FG |
288 | hash 0 |
289 | item osd.2 weight 1.00 | |
290 | item osd.3 weight 1.00 | |
291 | } | |
292 | ||
293 | rack rack1 { | |
294 | id -3 | |
9f95a23c | 295 | alg straw2 |
c07f9fc5 FG |
296 | hash 0 |
297 | item node1 weight 2.00 | |
298 | item node2 weight 2.00 | |
299 | } | |
300 | ||
301 | .. note:: In the foregoing example, note that the rack bucket does not contain | |
302 | any OSDs. Rather it contains lower level host buckets, and includes the | |
303 | sum total of their weight in the item entry. | |
304 | ||
305 | .. topic:: Bucket Types | |
306 | ||
9f95a23c | 307 | Ceph supports five bucket types, each representing a tradeoff between |
c07f9fc5 | 308 | performance and reorganization efficiency. If you are unsure of which bucket |
9f95a23c | 309 | type to use, we recommend using a ``straw2`` bucket. For a detailed |
c07f9fc5 FG |
310 | discussion of bucket types, refer to |
311 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, | |
312 | and more specifically to **Section 3.4**. The bucket types are: | |
313 | ||
9f95a23c | 314 | #. **uniform**: Uniform buckets aggregate devices with **exactly** the same |
c07f9fc5 FG |
315 | weight. For example, when firms commission or decommission hardware, they |
316 | typically do so with many machines that have exactly the same physical | |
317 | configuration (e.g., bulk purchases). When storage devices have exactly | |
318 | the same weight, you may use the ``uniform`` bucket type, which allows | |
319 | CRUSH to map replicas into uniform buckets in constant time. With | |
320 | non-uniform weights, you should use another bucket algorithm. | |
321 | ||
9f95a23c | 322 | #. **list**: List buckets aggregate their content as linked lists. Based on |
c07f9fc5 FG |
323 | the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, |
324 | a list is a natural and intuitive choice for an **expanding cluster**: | |
325 | either an object is relocated to the newest device with some appropriate | |
326 | probability, or it remains on the older devices as before. The result is | |
327 | optimal data migration when items are added to the bucket. Items removed | |
328 | from the middle or tail of the list, however, can result in a significant | |
329 | amount of unnecessary movement, making list buckets most suitable for | |
330 | circumstances in which they **never (or very rarely) shrink**. | |
331 | ||
9f95a23c | 332 | #. **tree**: Tree buckets use a binary search tree. They are more efficient |
c07f9fc5 FG |
333 | than list buckets when a bucket contains a larger set of items. Based on |
334 | the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, | |
335 | tree buckets reduce the placement time to O(log :sub:`n`), making them | |
336 | suitable for managing much larger sets of devices or nested buckets. | |
337 | ||
9f95a23c | 338 | #. **straw**: List and Tree buckets use a divide and conquer strategy |
c07f9fc5 FG |
339 | in a way that either gives certain items precedence (e.g., those |
340 | at the beginning of a list) or obviates the need to consider entire | |
341 | subtrees of items at all. That improves the performance of the replica | |
342 | placement process, but can also introduce suboptimal reorganization | |
343 | behavior when the contents of a bucket change due an addition, removal, | |
344 | or re-weighting of an item. The straw bucket type allows all items to | |
345 | fairly “compete” against each other for replica placement through a | |
346 | process analogous to a draw of straws. | |
347 | ||
9f95a23c | 348 | #. **straw2**: Straw2 buckets improve Straw to correctly avoid any data |
11fdf7f2 TL |
349 | movement between items when neighbor weights change. |
350 | ||
351 | For example the weight of item A including adding it anew or removing | |
352 | it completely, there will be data movement only to or from item A. | |
353 | ||
c07f9fc5 FG |
354 | .. topic:: Hash |
355 | ||
356 | Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. | |
357 | Enter ``0`` as your hash setting to select ``rjenkins1``. | |
358 | ||
359 | ||
360 | .. _weightingbucketitems: | |
361 | ||
362 | .. topic:: Weighting Bucket Items | |
363 | ||
364 | Ceph expresses bucket weights as doubles, which allows for fine | |
365 | weighting. A weight is the relative difference between device capacities. We | |
366 | recommend using ``1.00`` as the relative weight for a 1TB storage device. | |
367 | In such a scenario, a weight of ``0.5`` would represent approximately 500GB, | |
368 | and a weight of ``3.00`` would represent approximately 3TB. Higher level | |
369 | buckets have a weight that is the sum total of the leaf items aggregated by | |
370 | the bucket. | |
371 | ||
372 | A bucket item weight is one dimensional, but you may also calculate your | |
373 | item weights to reflect the performance of the storage drive. For example, | |
374 | if you have many 1TB drives where some have relatively low data transfer | |
375 | rate and the others have a relatively high data transfer rate, you may | |
376 | weight them differently, even though they have the same capacity (e.g., | |
377 | a weight of 0.80 for the first set of drives with lower total throughput, | |
378 | and 1.20 for the second set of drives with higher total throughput). | |
379 | ||
380 | ||
381 | .. _crushmaprules: | |
382 | ||
383 | CRUSH Map Rules | |
384 | --------------- | |
385 | ||
386 | CRUSH maps support the notion of 'CRUSH rules', which are the rules that | |
b32b8144 FG |
387 | determine data placement for a pool. The default CRUSH map has a rule for each |
388 | pool. For large clusters, you will likely create many pools where each pool may | |
389 | have its own non-default CRUSH rule. | |
c07f9fc5 | 390 | |
b32b8144 FG |
391 | .. note:: In most cases, you will not need to modify the default rule. When |
392 | you create a new pool, by default the rule will be set to ``0``. | |
c07f9fc5 FG |
393 | |
394 | ||
395 | CRUSH rules define placement and replication strategies or distribution policies | |
396 | that allow you to specify exactly how CRUSH places object replicas. For | |
397 | example, you might create a rule selecting a pair of targets for 2-way | |
398 | mirroring, another rule for selecting three targets in two different data | |
399 | centers for 3-way mirroring, and yet another rule for erasure coding over six | |
400 | storage devices. For a detailed discussion of CRUSH rules, refer to | |
401 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, | |
402 | and more specifically to **Section 3.2**. | |
403 | ||
404 | A rule takes the following form:: | |
405 | ||
406 | rule <rulename> { | |
407 | ||
11fdf7f2 | 408 | id [a unique whole numeric ID] |
c07f9fc5 FG |
409 | type [ replicated | erasure ] |
410 | min_size <min-size> | |
411 | max_size <max-size> | |
412 | step take <bucket-name> [class <device-class>] | |
11fdf7f2 | 413 | step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> |
c07f9fc5 FG |
414 | step emit |
415 | } | |
416 | ||
417 | ||
11fdf7f2 | 418 | ``id`` |
c07f9fc5 | 419 | |
11fdf7f2 | 420 | :Description: A unique whole number for identifying the rule. |
c07f9fc5 FG |
421 | |
422 | :Purpose: A component of the rule mask. | |
423 | :Type: Integer | |
424 | :Required: Yes | |
425 | :Default: 0 | |
426 | ||
c07f9fc5 FG |
427 | |
428 | ``type`` | |
429 | ||
430 | :Description: Describes a rule for either a storage drive (replicated) | |
431 | or a RAID. | |
432 | ||
433 | :Purpose: A component of the rule mask. | |
434 | :Type: String | |
435 | :Required: Yes | |
436 | :Default: ``replicated`` | |
437 | :Valid Values: Currently only ``replicated`` and ``erasure`` | |
438 | ||
439 | ``min_size`` | |
440 | ||
441 | :Description: If a pool makes fewer replicas than this number, CRUSH will | |
442 | **NOT** select this rule. | |
443 | ||
444 | :Type: Integer | |
445 | :Purpose: A component of the rule mask. | |
446 | :Required: Yes | |
447 | :Default: ``1`` | |
448 | ||
449 | ``max_size`` | |
450 | ||
451 | :Description: If a pool makes more replicas than this number, CRUSH will | |
452 | **NOT** select this rule. | |
453 | ||
454 | :Type: Integer | |
455 | :Purpose: A component of the rule mask. | |
456 | :Required: Yes | |
457 | :Default: 10 | |
458 | ||
459 | ||
460 | ``step take <bucket-name> [class <device-class>]`` | |
461 | ||
462 | :Description: Takes a bucket name, and begins iterating down the tree. | |
463 | If the ``device-class`` is specified, it must match | |
464 | a class previously used when defining a device. All | |
465 | devices that do not belong to the class are excluded. | |
466 | :Purpose: A component of the rule. | |
467 | :Required: Yes | |
468 | :Example: ``step take data`` | |
469 | ||
470 | ||
471 | ``step choose firstn {num} type {bucket-type}`` | |
472 | ||
11fdf7f2 TL |
473 | :Description: Selects the number of buckets of the given type from within the |
474 | current bucket. The number is usually the number of replicas in | |
475 | the pool (i.e., pool size). | |
c07f9fc5 FG |
476 | |
477 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
478 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
479 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
480 | ||
481 | :Purpose: A component of the rule. | |
482 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
483 | :Example: ``step choose firstn 1 type row`` | |
484 | ||
485 | ||
486 | ``step chooseleaf firstn {num} type {bucket-type}`` | |
487 | ||
488 | :Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf | |
11fdf7f2 TL |
489 | node (that is, an OSD) from the subtree of each bucket in the set of buckets. |
490 | The number of buckets in the set is usually the number of replicas in | |
c07f9fc5 FG |
491 | the pool (i.e., pool size). |
492 | ||
493 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
494 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
495 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
496 | ||
497 | :Purpose: A component of the rule. Usage removes the need to select a device using two steps. | |
498 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
499 | :Example: ``step chooseleaf firstn 0 type row`` | |
500 | ||
501 | ||
c07f9fc5 FG |
502 | ``step emit`` |
503 | ||
504 | :Description: Outputs the current value and empties the stack. Typically used | |
505 | at the end of a rule, but may also be used to pick from different | |
506 | trees in the same rule. | |
507 | ||
508 | :Purpose: A component of the rule. | |
509 | :Prerequisite: Follows ``step choose``. | |
510 | :Example: ``step emit`` | |
511 | ||
b32b8144 FG |
512 | .. important:: A given CRUSH rule may be assigned to multiple pools, but it |
513 | is not possible for a single pool to have multiple CRUSH rules. | |
c07f9fc5 | 514 | |
11fdf7f2 TL |
515 | ``firstn`` versus ``indep`` |
516 | ||
517 | :Description: Controls the replacement strategy CRUSH uses when items (OSDs) | |
518 | are marked down in the CRUSH map. If this rule is to be used with | |
519 | replicated pools it should be ``firstn`` and if it's for | |
520 | erasure-coded pools it should be ``indep``. | |
521 | ||
522 | The reason has to do with how they behave when a | |
523 | previously-selected device fails. Let's say you have a PG stored | |
524 | on OSDs 1, 2, 3, 4, 5. Then 3 goes down. | |
525 | ||
526 | With the "firstn" mode, CRUSH simply adjusts its calculation to | |
527 | select 1 and 2, then selects 3 but discovers it's down, so it | |
528 | retries and selects 4 and 5, and then goes on to select a new | |
529 | OSD 6. So the final CRUSH mapping change is | |
530 | 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. | |
531 | ||
532 | But if you're storing an EC pool, that means you just changed the | |
533 | data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to | |
534 | not do that. You can instead expect it, when it selects the failed | |
535 | OSD 3, to try again and pick out 6, for a final transformation of: | |
536 | 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 | |
537 | ||
f64942e4 AA |
538 | .. _crush-reclassify: |
539 | ||
540 | Migrating from a legacy SSD rule to device classes | |
541 | -------------------------------------------------- | |
542 | ||
543 | It used to be necessary to manually edit your CRUSH map and maintain a | |
544 | parallel hierarchy for each specialized device type (e.g., SSD) in order to | |
545 | write rules that apply to those devices. Since the Luminous release, | |
546 | the *device class* feature has enabled this transparently. | |
547 | ||
548 | However, migrating from an existing, manually customized per-device map to | |
549 | the new device class rules in the trivial way will cause all data in the | |
550 | system to be reshuffled. | |
551 | ||
552 | The ``crushtool`` has a few commands that can transform a legacy rule | |
553 | and hierarchy so that you can start using the new class-based rules. | |
554 | There are three types of transformations possible: | |
555 | ||
556 | #. ``--reclassify-root <root-name> <device-class>`` | |
557 | ||
558 | This will take everything in the hierarchy beneath root-name and | |
559 | adjust any rules that reference that root via a ``take | |
560 | <root-name>`` to instead ``take <root-name> class <device-class>``. | |
561 | It renumbers the buckets in such a way that the old IDs are instead | |
562 | used for the specified class's "shadow tree" so that no data | |
563 | movement takes place. | |
564 | ||
565 | For example, imagine you have an existing rule like:: | |
566 | ||
20effc67 | 567 | rule replicated_rule { |
f64942e4 AA |
568 | id 0 |
569 | type replicated | |
f64942e4 AA |
570 | step take default |
571 | step chooseleaf firstn 0 type rack | |
572 | step emit | |
573 | } | |
574 | ||
575 | If you reclassify the root `default` as class `hdd`, the rule will | |
576 | become:: | |
577 | ||
20effc67 | 578 | rule replicated_rule { |
f64942e4 AA |
579 | id 0 |
580 | type replicated | |
f64942e4 AA |
581 | step take default class hdd |
582 | step chooseleaf firstn 0 type rack | |
583 | step emit | |
584 | } | |
585 | ||
586 | #. ``--set-subtree-class <bucket-name> <device-class>`` | |
587 | ||
588 | This will mark every device in the subtree rooted at *bucket-name* | |
589 | with the specified device class. | |
590 | ||
591 | This is normally used in conjunction with the ``--reclassify-root`` | |
592 | option to ensure that all devices in that root are labeled with the | |
593 | correct class. In some situations, however, some of those devices | |
594 | (correctly) have a different class and we do not want to relabel | |
595 | them. In such cases, one can exclude the ``--set-subtree-class`` | |
596 | option. This means that the remapping process will not be perfect, | |
597 | since the previous rule distributed across devices of multiple | |
598 | classes but the adjusted rules will only map to devices of the | |
599 | specified *device-class*, but that often is an accepted level of | |
9f95a23c | 600 | data movement when the number of outlier devices is small. |
f64942e4 AA |
601 | |
602 | #. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` | |
603 | ||
9f95a23c | 604 | This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: |
f64942e4 AA |
605 | |
606 | host node1 { | |
607 | id -2 # do not change unnecessarily | |
608 | # weight 109.152 | |
9f95a23c | 609 | alg straw2 |
f64942e4 AA |
610 | hash 0 # rjenkins1 |
611 | item osd.0 weight 9.096 | |
612 | item osd.1 weight 9.096 | |
613 | item osd.2 weight 9.096 | |
614 | item osd.3 weight 9.096 | |
615 | item osd.4 weight 9.096 | |
616 | item osd.5 weight 9.096 | |
617 | ... | |
618 | } | |
619 | ||
620 | host node1-ssd { | |
621 | id -10 # do not change unnecessarily | |
622 | # weight 2.000 | |
9f95a23c | 623 | alg straw2 |
f64942e4 AA |
624 | hash 0 # rjenkins1 |
625 | item osd.80 weight 2.000 | |
626 | ... | |
627 | } | |
628 | ||
629 | root default { | |
630 | id -1 # do not change unnecessarily | |
9f95a23c | 631 | alg straw2 |
f64942e4 AA |
632 | hash 0 # rjenkins1 |
633 | item node1 weight 110.967 | |
634 | ... | |
635 | } | |
636 | ||
637 | root ssd { | |
638 | id -18 # do not change unnecessarily | |
639 | # weight 16.000 | |
9f95a23c | 640 | alg straw2 |
f64942e4 AA |
641 | hash 0 # rjenkins1 |
642 | item node1-ssd weight 2.000 | |
643 | ... | |
644 | } | |
645 | ||
646 | This function will reclassify each bucket that matches a | |
647 | pattern. The pattern can look like ``%suffix`` or ``prefix%``. | |
648 | For example, in the above example, we would use the pattern | |
649 | ``%-ssd``. For each matched bucket, the remaining portion of the | |
650 | name (that matches the ``%`` wildcard) specifies the *base bucket*. | |
651 | All devices in the matched bucket are labeled with the specified | |
652 | device class and then moved to the base bucket. If the base bucket | |
653 | does not exist (e.g., ``node12-ssd`` exists but ``node12`` does | |
654 | not), then it is created and linked underneath the specified | |
655 | *default parent* bucket. In each case, we are careful to preserve | |
656 | the old bucket IDs for the new shadow buckets to prevent data | |
657 | movement. Any rules with ``take`` steps referencing the old | |
658 | buckets are adjusted. | |
659 | ||
660 | #. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` | |
661 | ||
662 | The same command can also be used without a wildcard to map a | |
663 | single bucket. For example, in the previous example, we want the | |
664 | ``ssd`` bucket to be mapped to the ``default`` bucket. | |
665 | ||
39ae355f TL |
666 | The final command to convert the map comprising the above fragments would be something like: |
667 | ||
668 | .. prompt:: bash $ | |
f64942e4 | 669 | |
39ae355f TL |
670 | ceph osd getcrushmap -o original |
671 | crushtool -i original --reclassify \ | |
672 | --set-subtree-class default hdd \ | |
673 | --reclassify-root default hdd \ | |
674 | --reclassify-bucket %-ssd ssd default \ | |
675 | --reclassify-bucket ssd ssd default \ | |
676 | -o adjusted | |
f64942e4 | 677 | |
39ae355f TL |
678 | In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: |
679 | ||
680 | .. prompt:: bash $ | |
681 | ||
682 | crushtool -i original --compare adjusted | |
683 | ||
684 | :: | |
f64942e4 | 685 | |
f64942e4 AA |
686 | rule 0 had 0/10240 mismatched mappings (0) |
687 | rule 1 had 0/10240 mismatched mappings (0) | |
688 | maps appear equivalent | |
689 | ||
39ae355f TL |
690 | If there were differences, the ratio of remapped inputs would be reported in |
691 | the parentheses. | |
692 | ||
693 | When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: | |
f64942e4 | 694 | |
39ae355f | 695 | .. prompt:: bash $ |
f64942e4 | 696 | |
39ae355f | 697 | ceph osd setcrushmap -i adjusted |
c07f9fc5 FG |
698 | |
699 | Tuning CRUSH, the hard way | |
700 | -------------------------- | |
701 | ||
702 | If you can ensure that all clients are running recent code, you can | |
703 | adjust the tunables by extracting the CRUSH map, modifying the values, | |
704 | and reinjecting it into the cluster. | |
705 | ||
39ae355f TL |
706 | * Extract the latest CRUSH map: |
707 | ||
708 | .. prompt:: bash $ | |
c07f9fc5 FG |
709 | |
710 | ceph osd getcrushmap -o /tmp/crush | |
711 | ||
712 | * Adjust tunables. These values appear to offer the best behavior | |
713 | for both large and small clusters we tested with. You will need to | |
714 | additionally specify the ``--enable-unsafe-tunables`` argument to | |
715 | ``crushtool`` for this to work. Please use this option with | |
39ae355f | 716 | extreme care.: |
c07f9fc5 | 717 | |
39ae355f | 718 | .. prompt:: bash $ |
c07f9fc5 | 719 | |
39ae355f | 720 | crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new |
c07f9fc5 | 721 | |
39ae355f TL |
722 | * Reinject modified map: |
723 | ||
724 | .. prompt:: bash $ | |
725 | ||
726 | ceph osd setcrushmap -i /tmp/crush.new | |
c07f9fc5 FG |
727 | |
728 | Legacy values | |
729 | ------------- | |
730 | ||
731 | For reference, the legacy values for the CRUSH tunables can be set | |
39ae355f TL |
732 | with: |
733 | ||
734 | .. prompt:: bash $ | |
c07f9fc5 FG |
735 | |
736 | crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy | |
737 | ||
738 | Again, the special ``--enable-unsafe-tunables`` option is required. | |
739 | Further, as noted above, be careful running old versions of the | |
740 | ``ceph-osd`` daemon after reverting to legacy values as the feature | |
741 | bit is not perfectly enforced. | |
11fdf7f2 | 742 | |
39ae355f | 743 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf |