]>
Commit | Line | Data |
---|---|---|
c07f9fc5 FG |
1 | Manually editing a CRUSH Map |
2 | ============================ | |
3 | ||
4 | .. note:: Manually editing the CRUSH map is considered an advanced | |
5 | administrator operation. All CRUSH changes that are | |
6 | necessary for the overwhelming majority of installations are | |
7 | possible via the standard ceph CLI and do not require manual | |
8 | CRUSH map edits. If you have identified a use case where | |
9 | manual edits *are* necessary, consider contacting the Ceph | |
10 | developers so that future versions of Ceph can make this | |
11 | unnecessary. | |
12 | ||
13 | To edit an existing CRUSH map: | |
14 | ||
15 | #. `Get the CRUSH map`_. | |
16 | #. `Decompile`_ the CRUSH map. | |
17 | #. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. | |
18 | #. `Recompile`_ the CRUSH map. | |
19 | #. `Set the CRUSH map`_. | |
20 | ||
b32b8144 FG |
21 | For details on setting the CRUSH map rule for a specific pool, see `Set |
22 | Pool Values`_. | |
c07f9fc5 FG |
23 | |
24 | .. _Get the CRUSH map: #getcrushmap | |
25 | .. _Decompile: #decompilecrushmap | |
26 | .. _Devices: #crushmapdevices | |
27 | .. _Buckets: #crushmapbuckets | |
28 | .. _Rules: #crushmaprules | |
29 | .. _Recompile: #compilecrushmap | |
30 | .. _Set the CRUSH map: #setcrushmap | |
31 | .. _Set Pool Values: ../pools#setpoolvalues | |
32 | ||
33 | .. _getcrushmap: | |
34 | ||
35 | Get a CRUSH Map | |
36 | --------------- | |
37 | ||
38 | To get the CRUSH map for your cluster, execute the following:: | |
39 | ||
40 | ceph osd getcrushmap -o {compiled-crushmap-filename} | |
41 | ||
42 | Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since | |
43 | the CRUSH map is in a compiled form, you must decompile it first before you can | |
44 | edit it. | |
45 | ||
46 | .. _decompilecrushmap: | |
47 | ||
48 | Decompile a CRUSH Map | |
49 | --------------------- | |
50 | ||
51 | To decompile a CRUSH map, execute the following:: | |
52 | ||
53 | crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} | |
54 | ||
55 | ||
56 | Sections | |
57 | -------- | |
58 | ||
59 | There are six main sections to a CRUSH Map. | |
60 | ||
61 | #. **tunables:** The preamble at the top of the map described any *tunables* | |
62 | for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These | |
63 | correct for old bugs, optimizations, or other changes in behavior that have | |
64 | been made over the years to improve CRUSH's behavior. | |
65 | ||
66 | #. **devices:** Devices are individual ``ceph-osd`` daemons that can | |
67 | store data. | |
68 | ||
69 | #. **types**: Bucket ``types`` define the types of buckets used in | |
70 | your CRUSH hierarchy. Buckets consist of a hierarchical aggregation | |
71 | of storage locations (e.g., rows, racks, chassis, hosts, etc.) and | |
72 | their assigned weights. | |
73 | ||
74 | #. **buckets:** Once you define bucket types, you must define each node | |
75 | in the hierarchy, its type, and which devices or other nodes it | |
76 | containes. | |
77 | ||
78 | #. **rules:** Rules define policy about how data is distributed across | |
79 | devices in the hierarchy. | |
80 | ||
81 | #. **choose_args:** Choose_args are alternative weights associated with | |
82 | the hierarchy that have been adjusted to optimize data placement. A single | |
83 | choose_args map can be used for the entire cluster, or one can be | |
84 | created for each individual pool. | |
85 | ||
86 | ||
87 | .. _crushmapdevices: | |
88 | ||
89 | CRUSH Map Devices | |
90 | ----------------- | |
91 | ||
92 | Devices are individual ``ceph-osd`` daemons that can store data. You | |
93 | will normally have one defined here for each OSD daemon in your | |
94 | cluster. Devices are identified by an id (a non-negative integer) and | |
95 | a name, normally ``osd.N`` where ``N`` is the device id. | |
96 | ||
97 | Devices may also have a *device class* associated with them (e.g., | |
98 | ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a | |
99 | crush rule. | |
100 | ||
101 | :: | |
102 | ||
103 | # devices | |
104 | device {num} {osd.name} [class {class}] | |
105 | ||
106 | For example:: | |
107 | ||
108 | # devices | |
109 | device 0 osd.0 class ssd | |
110 | device 1 osd.1 class hdd | |
111 | device 2 osd.2 | |
112 | device 3 osd.3 | |
113 | ||
114 | In most cases, each device maps to a single ``ceph-osd`` daemon. This | |
115 | is normally a single storage device, a pair of devices (for example, | |
116 | one for data and one for a journal or metadata), or in some cases a | |
117 | small RAID device. | |
118 | ||
119 | ||
120 | ||
121 | ||
122 | ||
123 | CRUSH Map Bucket Types | |
124 | ---------------------- | |
125 | ||
126 | The second list in the CRUSH map defines 'bucket' types. Buckets facilitate | |
127 | a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent | |
128 | physical locations in a hierarchy. Nodes aggregate other nodes or leaves. | |
129 | Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage | |
130 | media. | |
131 | ||
132 | .. tip:: The term "bucket" used in the context of CRUSH means a node in | |
133 | the hierarchy, i.e. a location or a piece of physical hardware. It | |
134 | is a different concept from the term "bucket" when used in the | |
135 | context of RADOS Gateway APIs. | |
136 | ||
137 | To add a bucket type to the CRUSH map, create a new line under your list of | |
138 | bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. | |
139 | By convention, there is one leaf bucket and it is ``type 0``; however, you may | |
140 | give it any name you like (e.g., osd, disk, drive, storage, etc.):: | |
141 | ||
142 | #types | |
143 | type {num} {bucket-name} | |
144 | ||
145 | For example:: | |
146 | ||
147 | # types | |
148 | type 0 osd | |
149 | type 1 host | |
150 | type 2 chassis | |
151 | type 3 rack | |
152 | type 4 row | |
153 | type 5 pdu | |
154 | type 6 pod | |
155 | type 7 room | |
156 | type 8 datacenter | |
157 | type 9 region | |
158 | type 10 root | |
159 | ||
160 | ||
161 | ||
162 | .. _crushmapbuckets: | |
163 | ||
164 | CRUSH Map Bucket Hierarchy | |
165 | -------------------------- | |
166 | ||
167 | The CRUSH algorithm distributes data objects among storage devices according | |
168 | to a per-device weight value, approximating a uniform probability distribution. | |
169 | CRUSH distributes objects and their replicas according to the hierarchical | |
170 | cluster map you define. Your CRUSH map represents the available storage | |
171 | devices and the logical elements that contain them. | |
172 | ||
173 | To map placement groups to OSDs across failure domains, a CRUSH map defines a | |
174 | hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH | |
175 | map). The purpose of creating a bucket hierarchy is to segregate the | |
176 | leaf nodes by their failure domains, such as hosts, chassis, racks, power | |
177 | distribution units, pods, rows, rooms, and data centers. With the exception of | |
178 | the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and | |
179 | you may define it according to your own needs. | |
180 | ||
181 | We recommend adapting your CRUSH map to your firms's hardware naming conventions | |
182 | and using instances names that reflect the physical hardware. Your naming | |
183 | practice can make it easier to administer the cluster and troubleshoot | |
184 | problems when an OSD and/or other hardware malfunctions and the administrator | |
185 | need access to physical hardware. | |
186 | ||
187 | In the following example, the bucket hierarchy has a leaf bucket named ``osd``, | |
188 | and two node buckets named ``host`` and ``rack`` respectively. | |
189 | ||
190 | .. ditaa:: | |
191 | +-----------+ | |
192 | | {o}rack | | |
193 | | Bucket | | |
194 | +-----+-----+ | |
195 | | | |
196 | +---------------+---------------+ | |
197 | | | | |
198 | +-----+-----+ +-----+-----+ | |
199 | | {o}host | | {o}host | | |
200 | | Bucket | | Bucket | | |
201 | +-----+-----+ +-----+-----+ | |
202 | | | | |
203 | +-------+-------+ +-------+-------+ | |
204 | | | | | | |
205 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
206 | | osd | | osd | | osd | | osd | | |
207 | | Bucket | | Bucket | | Bucket | | Bucket | | |
208 | +-----------+ +-----------+ +-----------+ +-----------+ | |
209 | ||
210 | .. note:: The higher numbered ``rack`` bucket type aggregates the lower | |
211 | numbered ``host`` bucket type. | |
212 | ||
213 | Since leaf nodes reflect storage devices declared under the ``#devices`` list | |
214 | at the beginning of the CRUSH map, you do not need to declare them as bucket | |
215 | instances. The second lowest bucket type in your hierarchy usually aggregates | |
216 | the devices (i.e., it's usually the computer containing the storage media, and | |
217 | uses whatever term you prefer to describe it, such as "node", "computer", | |
218 | "server," "host", "machine", etc.). In high density environments, it is | |
219 | increasingly common to see multiple hosts/nodes per chassis. You should account | |
220 | for chassis failure too--e.g., the need to pull a chassis if a node fails may | |
221 | result in bringing down numerous hosts/nodes and their OSDs. | |
222 | ||
223 | When declaring a bucket instance, you must specify its type, give it a unique | |
224 | name (string), assign it a unique ID expressed as a negative integer (optional), | |
225 | specify a weight relative to the total capacity/capability of its item(s), | |
226 | specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, | |
227 | reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. | |
228 | The items may consist of node buckets or leaves. Items may have a weight that | |
229 | reflects the relative weight of the item. | |
230 | ||
231 | You may declare a node bucket with the following syntax:: | |
232 | ||
233 | [bucket-type] [bucket-name] { | |
234 | id [a unique negative numeric ID] | |
235 | weight [the relative capacity/capability of the item(s)] | |
236 | alg [the bucket type: uniform | list | tree | straw ] | |
237 | hash [the hash type: 0 by default] | |
238 | item [item-name] weight [weight] | |
239 | } | |
240 | ||
241 | For example, using the diagram above, we would define two host buckets | |
242 | and one rack bucket. The OSDs are declared as items within the host buckets:: | |
243 | ||
244 | host node1 { | |
245 | id -1 | |
246 | alg straw | |
247 | hash 0 | |
248 | item osd.0 weight 1.00 | |
249 | item osd.1 weight 1.00 | |
250 | } | |
251 | ||
252 | host node2 { | |
253 | id -2 | |
254 | alg straw | |
255 | hash 0 | |
256 | item osd.2 weight 1.00 | |
257 | item osd.3 weight 1.00 | |
258 | } | |
259 | ||
260 | rack rack1 { | |
261 | id -3 | |
262 | alg straw | |
263 | hash 0 | |
264 | item node1 weight 2.00 | |
265 | item node2 weight 2.00 | |
266 | } | |
267 | ||
268 | .. note:: In the foregoing example, note that the rack bucket does not contain | |
269 | any OSDs. Rather it contains lower level host buckets, and includes the | |
270 | sum total of their weight in the item entry. | |
271 | ||
272 | .. topic:: Bucket Types | |
273 | ||
274 | Ceph supports four bucket types, each representing a tradeoff between | |
275 | performance and reorganization efficiency. If you are unsure of which bucket | |
276 | type to use, we recommend using a ``straw`` bucket. For a detailed | |
277 | discussion of bucket types, refer to | |
278 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, | |
279 | and more specifically to **Section 3.4**. The bucket types are: | |
280 | ||
281 | #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same | |
282 | weight. For example, when firms commission or decommission hardware, they | |
283 | typically do so with many machines that have exactly the same physical | |
284 | configuration (e.g., bulk purchases). When storage devices have exactly | |
285 | the same weight, you may use the ``uniform`` bucket type, which allows | |
286 | CRUSH to map replicas into uniform buckets in constant time. With | |
287 | non-uniform weights, you should use another bucket algorithm. | |
288 | ||
289 | #. **List**: List buckets aggregate their content as linked lists. Based on | |
290 | the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, | |
291 | a list is a natural and intuitive choice for an **expanding cluster**: | |
292 | either an object is relocated to the newest device with some appropriate | |
293 | probability, or it remains on the older devices as before. The result is | |
294 | optimal data migration when items are added to the bucket. Items removed | |
295 | from the middle or tail of the list, however, can result in a significant | |
296 | amount of unnecessary movement, making list buckets most suitable for | |
297 | circumstances in which they **never (or very rarely) shrink**. | |
298 | ||
299 | #. **Tree**: Tree buckets use a binary search tree. They are more efficient | |
300 | than list buckets when a bucket contains a larger set of items. Based on | |
301 | the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, | |
302 | tree buckets reduce the placement time to O(log :sub:`n`), making them | |
303 | suitable for managing much larger sets of devices or nested buckets. | |
304 | ||
305 | #. **Straw:** List and Tree buckets use a divide and conquer strategy | |
306 | in a way that either gives certain items precedence (e.g., those | |
307 | at the beginning of a list) or obviates the need to consider entire | |
308 | subtrees of items at all. That improves the performance of the replica | |
309 | placement process, but can also introduce suboptimal reorganization | |
310 | behavior when the contents of a bucket change due an addition, removal, | |
311 | or re-weighting of an item. The straw bucket type allows all items to | |
312 | fairly “compete” against each other for replica placement through a | |
313 | process analogous to a draw of straws. | |
314 | ||
315 | .. topic:: Hash | |
316 | ||
317 | Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. | |
318 | Enter ``0`` as your hash setting to select ``rjenkins1``. | |
319 | ||
320 | ||
321 | .. _weightingbucketitems: | |
322 | ||
323 | .. topic:: Weighting Bucket Items | |
324 | ||
325 | Ceph expresses bucket weights as doubles, which allows for fine | |
326 | weighting. A weight is the relative difference between device capacities. We | |
327 | recommend using ``1.00`` as the relative weight for a 1TB storage device. | |
328 | In such a scenario, a weight of ``0.5`` would represent approximately 500GB, | |
329 | and a weight of ``3.00`` would represent approximately 3TB. Higher level | |
330 | buckets have a weight that is the sum total of the leaf items aggregated by | |
331 | the bucket. | |
332 | ||
333 | A bucket item weight is one dimensional, but you may also calculate your | |
334 | item weights to reflect the performance of the storage drive. For example, | |
335 | if you have many 1TB drives where some have relatively low data transfer | |
336 | rate and the others have a relatively high data transfer rate, you may | |
337 | weight them differently, even though they have the same capacity (e.g., | |
338 | a weight of 0.80 for the first set of drives with lower total throughput, | |
339 | and 1.20 for the second set of drives with higher total throughput). | |
340 | ||
341 | ||
342 | .. _crushmaprules: | |
343 | ||
344 | CRUSH Map Rules | |
345 | --------------- | |
346 | ||
347 | CRUSH maps support the notion of 'CRUSH rules', which are the rules that | |
b32b8144 FG |
348 | determine data placement for a pool. The default CRUSH map has a rule for each |
349 | pool. For large clusters, you will likely create many pools where each pool may | |
350 | have its own non-default CRUSH rule. | |
c07f9fc5 | 351 | |
b32b8144 FG |
352 | .. note:: In most cases, you will not need to modify the default rule. When |
353 | you create a new pool, by default the rule will be set to ``0``. | |
c07f9fc5 FG |
354 | |
355 | ||
356 | CRUSH rules define placement and replication strategies or distribution policies | |
357 | that allow you to specify exactly how CRUSH places object replicas. For | |
358 | example, you might create a rule selecting a pair of targets for 2-way | |
359 | mirroring, another rule for selecting three targets in two different data | |
360 | centers for 3-way mirroring, and yet another rule for erasure coding over six | |
361 | storage devices. For a detailed discussion of CRUSH rules, refer to | |
362 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, | |
363 | and more specifically to **Section 3.2**. | |
364 | ||
365 | A rule takes the following form:: | |
366 | ||
367 | rule <rulename> { | |
368 | ||
369 | ruleset <ruleset> | |
370 | type [ replicated | erasure ] | |
371 | min_size <min-size> | |
372 | max_size <max-size> | |
373 | step take <bucket-name> [class <device-class>] | |
374 | step [choose|chooseleaf] [firstn|indep] <N> <bucket-type> | |
375 | step emit | |
376 | } | |
377 | ||
378 | ||
379 | ``ruleset`` | |
380 | ||
b32b8144 FG |
381 | :Description: A unique whole number for identifying the rule. The name ``ruleset`` |
382 | is a carry-over from the past, when it was possible to have multiple | |
383 | CRUSH rules per pool. | |
c07f9fc5 FG |
384 | |
385 | :Purpose: A component of the rule mask. | |
386 | :Type: Integer | |
387 | :Required: Yes | |
388 | :Default: 0 | |
389 | ||
c07f9fc5 FG |
390 | |
391 | ``type`` | |
392 | ||
393 | :Description: Describes a rule for either a storage drive (replicated) | |
394 | or a RAID. | |
395 | ||
396 | :Purpose: A component of the rule mask. | |
397 | :Type: String | |
398 | :Required: Yes | |
399 | :Default: ``replicated`` | |
400 | :Valid Values: Currently only ``replicated`` and ``erasure`` | |
401 | ||
402 | ``min_size`` | |
403 | ||
404 | :Description: If a pool makes fewer replicas than this number, CRUSH will | |
405 | **NOT** select this rule. | |
406 | ||
407 | :Type: Integer | |
408 | :Purpose: A component of the rule mask. | |
409 | :Required: Yes | |
410 | :Default: ``1`` | |
411 | ||
412 | ``max_size`` | |
413 | ||
414 | :Description: If a pool makes more replicas than this number, CRUSH will | |
415 | **NOT** select this rule. | |
416 | ||
417 | :Type: Integer | |
418 | :Purpose: A component of the rule mask. | |
419 | :Required: Yes | |
420 | :Default: 10 | |
421 | ||
422 | ||
423 | ``step take <bucket-name> [class <device-class>]`` | |
424 | ||
425 | :Description: Takes a bucket name, and begins iterating down the tree. | |
426 | If the ``device-class`` is specified, it must match | |
427 | a class previously used when defining a device. All | |
428 | devices that do not belong to the class are excluded. | |
429 | :Purpose: A component of the rule. | |
430 | :Required: Yes | |
431 | :Example: ``step take data`` | |
432 | ||
433 | ||
434 | ``step choose firstn {num} type {bucket-type}`` | |
435 | ||
436 | :Description: Selects the number of buckets of the given type. The number is | |
437 | usually the number of replicas in the pool (i.e., pool size). | |
438 | ||
439 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
440 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
441 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
442 | ||
443 | :Purpose: A component of the rule. | |
444 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
445 | :Example: ``step choose firstn 1 type row`` | |
446 | ||
447 | ||
448 | ``step chooseleaf firstn {num} type {bucket-type}`` | |
449 | ||
450 | :Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf | |
451 | node from the subtree of each bucket in the set of buckets. The | |
452 | number of buckets in the set is usually the number of replicas in | |
453 | the pool (i.e., pool size). | |
454 | ||
455 | - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). | |
456 | - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. | |
457 | - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. | |
458 | ||
459 | :Purpose: A component of the rule. Usage removes the need to select a device using two steps. | |
460 | :Prerequisite: Follows ``step take`` or ``step choose``. | |
461 | :Example: ``step chooseleaf firstn 0 type row`` | |
462 | ||
463 | ||
464 | ||
465 | ``step emit`` | |
466 | ||
467 | :Description: Outputs the current value and empties the stack. Typically used | |
468 | at the end of a rule, but may also be used to pick from different | |
469 | trees in the same rule. | |
470 | ||
471 | :Purpose: A component of the rule. | |
472 | :Prerequisite: Follows ``step choose``. | |
473 | :Example: ``step emit`` | |
474 | ||
b32b8144 FG |
475 | .. important:: A given CRUSH rule may be assigned to multiple pools, but it |
476 | is not possible for a single pool to have multiple CRUSH rules. | |
c07f9fc5 FG |
477 | |
478 | ||
479 | Placing Different Pools on Different OSDS: | |
480 | ========================================== | |
481 | ||
482 | Suppose you want to have most pools default to OSDs backed by large hard drives, | |
483 | but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). | |
484 | It's possible to have multiple independent CRUSH hierarchies within the same | |
485 | CRUSH map. Define two hierarchies with two different root nodes--one for hard | |
486 | disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown | |
487 | below:: | |
488 | ||
489 | device 0 osd.0 | |
490 | device 1 osd.1 | |
491 | device 2 osd.2 | |
492 | device 3 osd.3 | |
493 | device 4 osd.4 | |
494 | device 5 osd.5 | |
495 | device 6 osd.6 | |
496 | device 7 osd.7 | |
497 | ||
498 | host ceph-osd-ssd-server-1 { | |
499 | id -1 | |
500 | alg straw | |
501 | hash 0 | |
502 | item osd.0 weight 1.00 | |
503 | item osd.1 weight 1.00 | |
504 | } | |
505 | ||
506 | host ceph-osd-ssd-server-2 { | |
507 | id -2 | |
508 | alg straw | |
509 | hash 0 | |
510 | item osd.2 weight 1.00 | |
511 | item osd.3 weight 1.00 | |
512 | } | |
513 | ||
514 | host ceph-osd-platter-server-1 { | |
515 | id -3 | |
516 | alg straw | |
517 | hash 0 | |
518 | item osd.4 weight 1.00 | |
519 | item osd.5 weight 1.00 | |
520 | } | |
521 | ||
522 | host ceph-osd-platter-server-2 { | |
523 | id -4 | |
524 | alg straw | |
525 | hash 0 | |
526 | item osd.6 weight 1.00 | |
527 | item osd.7 weight 1.00 | |
528 | } | |
529 | ||
530 | root platter { | |
531 | id -5 | |
532 | alg straw | |
533 | hash 0 | |
534 | item ceph-osd-platter-server-1 weight 2.00 | |
535 | item ceph-osd-platter-server-2 weight 2.00 | |
536 | } | |
537 | ||
538 | root ssd { | |
539 | id -6 | |
540 | alg straw | |
541 | hash 0 | |
542 | item ceph-osd-ssd-server-1 weight 2.00 | |
543 | item ceph-osd-ssd-server-2 weight 2.00 | |
544 | } | |
545 | ||
546 | rule data { | |
547 | ruleset 0 | |
548 | type replicated | |
549 | min_size 2 | |
550 | max_size 2 | |
551 | step take platter | |
552 | step chooseleaf firstn 0 type host | |
553 | step emit | |
554 | } | |
555 | ||
556 | rule metadata { | |
557 | ruleset 1 | |
558 | type replicated | |
559 | min_size 0 | |
560 | max_size 10 | |
561 | step take platter | |
562 | step chooseleaf firstn 0 type host | |
563 | step emit | |
564 | } | |
565 | ||
566 | rule rbd { | |
567 | ruleset 2 | |
568 | type replicated | |
569 | min_size 0 | |
570 | max_size 10 | |
571 | step take platter | |
572 | step chooseleaf firstn 0 type host | |
573 | step emit | |
574 | } | |
575 | ||
576 | rule platter { | |
577 | ruleset 3 | |
578 | type replicated | |
579 | min_size 0 | |
580 | max_size 10 | |
581 | step take platter | |
582 | step chooseleaf firstn 0 type host | |
583 | step emit | |
584 | } | |
585 | ||
586 | rule ssd { | |
587 | ruleset 4 | |
588 | type replicated | |
589 | min_size 0 | |
590 | max_size 4 | |
591 | step take ssd | |
592 | step chooseleaf firstn 0 type host | |
593 | step emit | |
594 | } | |
595 | ||
596 | rule ssd-primary { | |
597 | ruleset 5 | |
598 | type replicated | |
599 | min_size 5 | |
600 | max_size 10 | |
601 | step take ssd | |
602 | step chooseleaf firstn 1 type host | |
603 | step emit | |
604 | step take platter | |
605 | step chooseleaf firstn -1 type host | |
606 | step emit | |
607 | } | |
608 | ||
609 | You can then set a pool to use the SSD rule by:: | |
610 | ||
611 | ceph osd pool set <poolname> crush_ruleset 4 | |
612 | ||
613 | Similarly, using the ``ssd-primary`` rule will cause each placement group in the | |
614 | pool to be placed with an SSD as the primary and platters as the replicas. | |
615 | ||
616 | ||
617 | Tuning CRUSH, the hard way | |
618 | -------------------------- | |
619 | ||
620 | If you can ensure that all clients are running recent code, you can | |
621 | adjust the tunables by extracting the CRUSH map, modifying the values, | |
622 | and reinjecting it into the cluster. | |
623 | ||
624 | * Extract the latest CRUSH map:: | |
625 | ||
626 | ceph osd getcrushmap -o /tmp/crush | |
627 | ||
628 | * Adjust tunables. These values appear to offer the best behavior | |
629 | for both large and small clusters we tested with. You will need to | |
630 | additionally specify the ``--enable-unsafe-tunables`` argument to | |
631 | ``crushtool`` for this to work. Please use this option with | |
632 | extreme care.:: | |
633 | ||
634 | crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new | |
635 | ||
636 | * Reinject modified map:: | |
637 | ||
638 | ceph osd setcrushmap -i /tmp/crush.new | |
639 | ||
640 | Legacy values | |
641 | ------------- | |
642 | ||
643 | For reference, the legacy values for the CRUSH tunables can be set | |
644 | with:: | |
645 | ||
646 | crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy | |
647 | ||
648 | Again, the special ``--enable-unsafe-tunables`` option is required. | |
649 | Further, as noted above, be careful running old versions of the | |
650 | ``ceph-osd`` daemon after reverting to legacy values as the feature | |
651 | bit is not perfectly enforced. |