]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map-edits.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / crush-map-edits.rst
CommitLineData
c07f9fc5
FG
1Manually editing a CRUSH Map
2============================
3
f67539c2 4.. note:: Manually editing the CRUSH map is an advanced
c07f9fc5
FG
5 administrator operation. All CRUSH changes that are
6 necessary for the overwhelming majority of installations are
7 possible via the standard ceph CLI and do not require manual
8 CRUSH map edits. If you have identified a use case where
f67539c2
TL
9 manual edits *are* necessary with recent Ceph releases, consider
10 contacting the Ceph developers so that future versions of Ceph
11 can obviate your corner case.
c07f9fc5
FG
12
13To edit an existing CRUSH map:
14
15#. `Get the CRUSH map`_.
16#. `Decompile`_ the CRUSH map.
17#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
18#. `Recompile`_ the CRUSH map.
19#. `Set the CRUSH map`_.
20
b32b8144
FG
21For details on setting the CRUSH map rule for a specific pool, see `Set
22Pool Values`_.
c07f9fc5
FG
23
24.. _Get the CRUSH map: #getcrushmap
25.. _Decompile: #decompilecrushmap
26.. _Devices: #crushmapdevices
27.. _Buckets: #crushmapbuckets
28.. _Rules: #crushmaprules
29.. _Recompile: #compilecrushmap
30.. _Set the CRUSH map: #setcrushmap
31.. _Set Pool Values: ../pools#setpoolvalues
32
33.. _getcrushmap:
34
35Get a CRUSH Map
36---------------
37
39ae355f
TL
38To get the CRUSH map for your cluster, execute the following:
39
40.. prompt:: bash $
c07f9fc5
FG
41
42 ceph osd getcrushmap -o {compiled-crushmap-filename}
43
44Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
45the CRUSH map is in a compiled form, you must decompile it first before you can
46edit it.
47
48.. _decompilecrushmap:
49
50Decompile a CRUSH Map
51---------------------
52
39ae355f
TL
53To decompile a CRUSH map, execute the following:
54
55.. prompt:: bash $
c07f9fc5
FG
56
57 crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
58
9f95a23c
TL
59.. _compilecrushmap:
60
61Recompile a CRUSH Map
62---------------------
63
39ae355f
TL
64To compile a CRUSH map, execute the following:
65
66.. prompt:: bash $
9f95a23c
TL
67
68 crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename}
69
70.. _setcrushmap:
71
72Set the CRUSH Map
73-----------------
74
39ae355f
TL
75To set the CRUSH map for your cluster, execute the following:
76
77.. prompt:: bash $
9f95a23c
TL
78
79 ceph osd setcrushmap -i {compiled-crushmap-filename}
80
81Ceph will load (-i) a compiled CRUSH map from the filename you specified.
c07f9fc5
FG
82
83Sections
84--------
85
86There are six main sections to a CRUSH Map.
87
f67539c2
TL
88#. **tunables:** The preamble at the top of the map describes any *tunables*
89 that differ from the historical / legacy CRUSH behavior. These
90 correct for old bugs, optimizations, or other changes that have
c07f9fc5
FG
91 been made over the years to improve CRUSH's behavior.
92
f67539c2 93#. **devices:** Devices are individual OSDs that store data.
c07f9fc5
FG
94
95#. **types**: Bucket ``types`` define the types of buckets used in
96 your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
97 of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
98 their assigned weights.
99
100#. **buckets:** Once you define bucket types, you must define each node
101 in the hierarchy, its type, and which devices or other nodes it
11fdf7f2 102 contains.
c07f9fc5
FG
103
104#. **rules:** Rules define policy about how data is distributed across
105 devices in the hierarchy.
106
107#. **choose_args:** Choose_args are alternative weights associated with
108 the hierarchy that have been adjusted to optimize data placement. A single
109 choose_args map can be used for the entire cluster, or one can be
110 created for each individual pool.
111
112
113.. _crushmapdevices:
114
115CRUSH Map Devices
116-----------------
117
f67539c2
TL
118Devices are individual OSDs that store data. Usually one is defined here for each
119OSD daemon in your
120cluster. Devices are identified by an ``id`` (a non-negative integer) and
121a ``name``, normally ``osd.N`` where ``N`` is the device id.
122
123.. _crush-map-device-class:
c07f9fc5
FG
124
125Devices may also have a *device class* associated with them (e.g.,
11fdf7f2 126``hdd`` or ``ssd``), allowing them to be conveniently targeted by a
c07f9fc5
FG
127crush rule.
128
39ae355f
TL
129.. prompt:: bash #
130
131 devices
132
c07f9fc5
FG
133::
134
c07f9fc5
FG
135 device {num} {osd.name} [class {class}]
136
39ae355f
TL
137For example:
138
139.. prompt:: bash #
140
141 devices
142
143::
c07f9fc5 144
c07f9fc5
FG
145 device 0 osd.0 class ssd
146 device 1 osd.1 class hdd
147 device 2 osd.2
148 device 3 osd.3
149
150In most cases, each device maps to a single ``ceph-osd`` daemon. This
151is normally a single storage device, a pair of devices (for example,
152one for data and one for a journal or metadata), or in some cases a
153small RAID device.
154
c07f9fc5
FG
155CRUSH Map Bucket Types
156----------------------
157
158The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
159a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
160physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
161Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
162media.
163
164.. tip:: The term "bucket" used in the context of CRUSH means a node in
165 the hierarchy, i.e. a location or a piece of physical hardware. It
166 is a different concept from the term "bucket" when used in the
167 context of RADOS Gateway APIs.
168
169To add a bucket type to the CRUSH map, create a new line under your list of
170bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
171By convention, there is one leaf bucket and it is ``type 0``; however, you may
39ae355f 172give it any name you like (e.g., osd, disk, drive, storage)::
c07f9fc5 173
39ae355f 174 # types
c07f9fc5
FG
175 type {num} {bucket-name}
176
177For example::
178
179 # types
180 type 0 osd
181 type 1 host
182 type 2 chassis
183 type 3 rack
184 type 4 row
185 type 5 pdu
186 type 6 pod
187 type 7 room
188 type 8 datacenter
9f95a23c
TL
189 type 9 zone
190 type 10 region
191 type 11 root
c07f9fc5
FG
192
193
194
195.. _crushmapbuckets:
196
197CRUSH Map Bucket Hierarchy
198--------------------------
199
200The CRUSH algorithm distributes data objects among storage devices according
201to a per-device weight value, approximating a uniform probability distribution.
202CRUSH distributes objects and their replicas according to the hierarchical
203cluster map you define. Your CRUSH map represents the available storage
204devices and the logical elements that contain them.
205
206To map placement groups to OSDs across failure domains, a CRUSH map defines a
207hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
208map). The purpose of creating a bucket hierarchy is to segregate the
209leaf nodes by their failure domains, such as hosts, chassis, racks, power
210distribution units, pods, rows, rooms, and data centers. With the exception of
211the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
212you may define it according to your own needs.
213
39ae355f
TL
214We recommend adapting your CRUSH map to your firm's hardware naming conventions
215and using instance names that reflect the physical hardware. Your naming
c07f9fc5
FG
216practice can make it easier to administer the cluster and troubleshoot
217problems when an OSD and/or other hardware malfunctions and the administrator
218need access to physical hardware.
219
220In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
221and two node buckets named ``host`` and ``rack`` respectively.
222
223.. ditaa::
224 +-----------+
225 | {o}rack |
226 | Bucket |
227 +-----+-----+
228 |
229 +---------------+---------------+
230 | |
231 +-----+-----+ +-----+-----+
232 | {o}host | | {o}host |
233 | Bucket | | Bucket |
234 +-----+-----+ +-----+-----+
235 | |
236 +-------+-------+ +-------+-------+
237 | | | |
238 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
239 | osd | | osd | | osd | | osd |
240 | Bucket | | Bucket | | Bucket | | Bucket |
241 +-----------+ +-----------+ +-----------+ +-----------+
242
243.. note:: The higher numbered ``rack`` bucket type aggregates the lower
244 numbered ``host`` bucket type.
245
246Since leaf nodes reflect storage devices declared under the ``#devices`` list
247at the beginning of the CRUSH map, you do not need to declare them as bucket
248instances. The second lowest bucket type in your hierarchy usually aggregates
249the devices (i.e., it's usually the computer containing the storage media, and
250uses whatever term you prefer to describe it, such as "node", "computer",
251"server," "host", "machine", etc.). In high density environments, it is
252increasingly common to see multiple hosts/nodes per chassis. You should account
253for chassis failure too--e.g., the need to pull a chassis if a node fails may
254result in bringing down numerous hosts/nodes and their OSDs.
255
256When declaring a bucket instance, you must specify its type, give it a unique
257name (string), assign it a unique ID expressed as a negative integer (optional),
258specify a weight relative to the total capacity/capability of its item(s),
9f95a23c 259specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``,
c07f9fc5
FG
260reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
261The items may consist of node buckets or leaves. Items may have a weight that
262reflects the relative weight of the item.
263
264You may declare a node bucket with the following syntax::
265
266 [bucket-type] [bucket-name] {
267 id [a unique negative numeric ID]
268 weight [the relative capacity/capability of the item(s)]
9f95a23c 269 alg [the bucket type: uniform | list | tree | straw | straw2 ]
c07f9fc5
FG
270 hash [the hash type: 0 by default]
271 item [item-name] weight [weight]
272 }
273
274For example, using the diagram above, we would define two host buckets
275and one rack bucket. The OSDs are declared as items within the host buckets::
276
277 host node1 {
278 id -1
9f95a23c 279 alg straw2
c07f9fc5
FG
280 hash 0
281 item osd.0 weight 1.00
282 item osd.1 weight 1.00
283 }
284
285 host node2 {
286 id -2
9f95a23c 287 alg straw2
c07f9fc5
FG
288 hash 0
289 item osd.2 weight 1.00
290 item osd.3 weight 1.00
291 }
292
293 rack rack1 {
294 id -3
9f95a23c 295 alg straw2
c07f9fc5
FG
296 hash 0
297 item node1 weight 2.00
298 item node2 weight 2.00
299 }
300
301.. note:: In the foregoing example, note that the rack bucket does not contain
302 any OSDs. Rather it contains lower level host buckets, and includes the
303 sum total of their weight in the item entry.
304
305.. topic:: Bucket Types
306
9f95a23c 307 Ceph supports five bucket types, each representing a tradeoff between
c07f9fc5 308 performance and reorganization efficiency. If you are unsure of which bucket
9f95a23c 309 type to use, we recommend using a ``straw2`` bucket. For a detailed
c07f9fc5
FG
310 discussion of bucket types, refer to
311 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
312 and more specifically to **Section 3.4**. The bucket types are:
313
9f95a23c 314 #. **uniform**: Uniform buckets aggregate devices with **exactly** the same
c07f9fc5
FG
315 weight. For example, when firms commission or decommission hardware, they
316 typically do so with many machines that have exactly the same physical
317 configuration (e.g., bulk purchases). When storage devices have exactly
318 the same weight, you may use the ``uniform`` bucket type, which allows
319 CRUSH to map replicas into uniform buckets in constant time. With
320 non-uniform weights, you should use another bucket algorithm.
321
9f95a23c 322 #. **list**: List buckets aggregate their content as linked lists. Based on
c07f9fc5
FG
323 the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
324 a list is a natural and intuitive choice for an **expanding cluster**:
325 either an object is relocated to the newest device with some appropriate
326 probability, or it remains on the older devices as before. The result is
327 optimal data migration when items are added to the bucket. Items removed
328 from the middle or tail of the list, however, can result in a significant
329 amount of unnecessary movement, making list buckets most suitable for
330 circumstances in which they **never (or very rarely) shrink**.
331
9f95a23c 332 #. **tree**: Tree buckets use a binary search tree. They are more efficient
c07f9fc5
FG
333 than list buckets when a bucket contains a larger set of items. Based on
334 the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
335 tree buckets reduce the placement time to O(log :sub:`n`), making them
336 suitable for managing much larger sets of devices or nested buckets.
337
9f95a23c 338 #. **straw**: List and Tree buckets use a divide and conquer strategy
c07f9fc5
FG
339 in a way that either gives certain items precedence (e.g., those
340 at the beginning of a list) or obviates the need to consider entire
341 subtrees of items at all. That improves the performance of the replica
342 placement process, but can also introduce suboptimal reorganization
343 behavior when the contents of a bucket change due an addition, removal,
344 or re-weighting of an item. The straw bucket type allows all items to
345 fairly “compete” against each other for replica placement through a
346 process analogous to a draw of straws.
347
9f95a23c 348 #. **straw2**: Straw2 buckets improve Straw to correctly avoid any data
11fdf7f2
TL
349 movement between items when neighbor weights change.
350
351 For example the weight of item A including adding it anew or removing
352 it completely, there will be data movement only to or from item A.
353
c07f9fc5
FG
354.. topic:: Hash
355
356 Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
357 Enter ``0`` as your hash setting to select ``rjenkins1``.
358
359
360.. _weightingbucketitems:
361
362.. topic:: Weighting Bucket Items
363
364 Ceph expresses bucket weights as doubles, which allows for fine
365 weighting. A weight is the relative difference between device capacities. We
366 recommend using ``1.00`` as the relative weight for a 1TB storage device.
367 In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
368 and a weight of ``3.00`` would represent approximately 3TB. Higher level
369 buckets have a weight that is the sum total of the leaf items aggregated by
370 the bucket.
371
372 A bucket item weight is one dimensional, but you may also calculate your
373 item weights to reflect the performance of the storage drive. For example,
374 if you have many 1TB drives where some have relatively low data transfer
375 rate and the others have a relatively high data transfer rate, you may
376 weight them differently, even though they have the same capacity (e.g.,
377 a weight of 0.80 for the first set of drives with lower total throughput,
378 and 1.20 for the second set of drives with higher total throughput).
379
380
381.. _crushmaprules:
382
383CRUSH Map Rules
384---------------
385
386CRUSH maps support the notion of 'CRUSH rules', which are the rules that
b32b8144
FG
387determine data placement for a pool. The default CRUSH map has a rule for each
388pool. For large clusters, you will likely create many pools where each pool may
389have its own non-default CRUSH rule.
c07f9fc5 390
b32b8144
FG
391.. note:: In most cases, you will not need to modify the default rule. When
392 you create a new pool, by default the rule will be set to ``0``.
c07f9fc5
FG
393
394
395CRUSH rules define placement and replication strategies or distribution policies
396that allow you to specify exactly how CRUSH places object replicas. For
397example, you might create a rule selecting a pair of targets for 2-way
398mirroring, another rule for selecting three targets in two different data
399centers for 3-way mirroring, and yet another rule for erasure coding over six
400storage devices. For a detailed discussion of CRUSH rules, refer to
401`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
402and more specifically to **Section 3.2**.
403
404A rule takes the following form::
405
406 rule <rulename> {
407
11fdf7f2 408 id [a unique whole numeric ID]
c07f9fc5
FG
409 type [ replicated | erasure ]
410 min_size <min-size>
411 max_size <max-size>
412 step take <bucket-name> [class <device-class>]
11fdf7f2 413 step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type>
c07f9fc5
FG
414 step emit
415 }
416
417
11fdf7f2 418``id``
c07f9fc5 419
11fdf7f2 420:Description: A unique whole number for identifying the rule.
c07f9fc5
FG
421
422:Purpose: A component of the rule mask.
423:Type: Integer
424:Required: Yes
425:Default: 0
426
c07f9fc5
FG
427
428``type``
429
430:Description: Describes a rule for either a storage drive (replicated)
431 or a RAID.
432
433:Purpose: A component of the rule mask.
434:Type: String
435:Required: Yes
436:Default: ``replicated``
437:Valid Values: Currently only ``replicated`` and ``erasure``
438
439``min_size``
440
441:Description: If a pool makes fewer replicas than this number, CRUSH will
442 **NOT** select this rule.
443
444:Type: Integer
445:Purpose: A component of the rule mask.
446:Required: Yes
447:Default: ``1``
448
449``max_size``
450
451:Description: If a pool makes more replicas than this number, CRUSH will
452 **NOT** select this rule.
453
454:Type: Integer
455:Purpose: A component of the rule mask.
456:Required: Yes
457:Default: 10
458
459
460``step take <bucket-name> [class <device-class>]``
461
462:Description: Takes a bucket name, and begins iterating down the tree.
463 If the ``device-class`` is specified, it must match
464 a class previously used when defining a device. All
465 devices that do not belong to the class are excluded.
466:Purpose: A component of the rule.
467:Required: Yes
468:Example: ``step take data``
469
470
471``step choose firstn {num} type {bucket-type}``
472
11fdf7f2
TL
473:Description: Selects the number of buckets of the given type from within the
474 current bucket. The number is usually the number of replicas in
475 the pool (i.e., pool size).
c07f9fc5
FG
476
477 - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
478 - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
479 - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
480
481:Purpose: A component of the rule.
482:Prerequisite: Follows ``step take`` or ``step choose``.
483:Example: ``step choose firstn 1 type row``
484
485
486``step chooseleaf firstn {num} type {bucket-type}``
487
488:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
11fdf7f2
TL
489 node (that is, an OSD) from the subtree of each bucket in the set of buckets.
490 The number of buckets in the set is usually the number of replicas in
c07f9fc5
FG
491 the pool (i.e., pool size).
492
493 - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
494 - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
495 - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
496
497:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
498:Prerequisite: Follows ``step take`` or ``step choose``.
499:Example: ``step chooseleaf firstn 0 type row``
500
501
c07f9fc5
FG
502``step emit``
503
504:Description: Outputs the current value and empties the stack. Typically used
505 at the end of a rule, but may also be used to pick from different
506 trees in the same rule.
507
508:Purpose: A component of the rule.
509:Prerequisite: Follows ``step choose``.
510:Example: ``step emit``
511
b32b8144
FG
512.. important:: A given CRUSH rule may be assigned to multiple pools, but it
513 is not possible for a single pool to have multiple CRUSH rules.
c07f9fc5 514
11fdf7f2
TL
515``firstn`` versus ``indep``
516
517:Description: Controls the replacement strategy CRUSH uses when items (OSDs)
518 are marked down in the CRUSH map. If this rule is to be used with
519 replicated pools it should be ``firstn`` and if it's for
520 erasure-coded pools it should be ``indep``.
521
522 The reason has to do with how they behave when a
523 previously-selected device fails. Let's say you have a PG stored
524 on OSDs 1, 2, 3, 4, 5. Then 3 goes down.
525
526 With the "firstn" mode, CRUSH simply adjusts its calculation to
527 select 1 and 2, then selects 3 but discovers it's down, so it
528 retries and selects 4 and 5, and then goes on to select a new
529 OSD 6. So the final CRUSH mapping change is
530 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6.
531
532 But if you're storing an EC pool, that means you just changed the
533 data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to
534 not do that. You can instead expect it, when it selects the failed
535 OSD 3, to try again and pick out 6, for a final transformation of:
536 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5
537
f64942e4
AA
538.. _crush-reclassify:
539
540Migrating from a legacy SSD rule to device classes
541--------------------------------------------------
542
543It used to be necessary to manually edit your CRUSH map and maintain a
544parallel hierarchy for each specialized device type (e.g., SSD) in order to
545write rules that apply to those devices. Since the Luminous release,
546the *device class* feature has enabled this transparently.
547
548However, migrating from an existing, manually customized per-device map to
549the new device class rules in the trivial way will cause all data in the
550system to be reshuffled.
551
552The ``crushtool`` has a few commands that can transform a legacy rule
553and hierarchy so that you can start using the new class-based rules.
554There are three types of transformations possible:
555
556#. ``--reclassify-root <root-name> <device-class>``
557
558 This will take everything in the hierarchy beneath root-name and
559 adjust any rules that reference that root via a ``take
560 <root-name>`` to instead ``take <root-name> class <device-class>``.
561 It renumbers the buckets in such a way that the old IDs are instead
562 used for the specified class's "shadow tree" so that no data
563 movement takes place.
564
565 For example, imagine you have an existing rule like::
566
20effc67 567 rule replicated_rule {
f64942e4
AA
568 id 0
569 type replicated
f64942e4
AA
570 step take default
571 step chooseleaf firstn 0 type rack
572 step emit
573 }
574
575 If you reclassify the root `default` as class `hdd`, the rule will
576 become::
577
20effc67 578 rule replicated_rule {
f64942e4
AA
579 id 0
580 type replicated
f64942e4
AA
581 step take default class hdd
582 step chooseleaf firstn 0 type rack
583 step emit
584 }
585
586#. ``--set-subtree-class <bucket-name> <device-class>``
587
588 This will mark every device in the subtree rooted at *bucket-name*
589 with the specified device class.
590
591 This is normally used in conjunction with the ``--reclassify-root``
592 option to ensure that all devices in that root are labeled with the
593 correct class. In some situations, however, some of those devices
594 (correctly) have a different class and we do not want to relabel
595 them. In such cases, one can exclude the ``--set-subtree-class``
596 option. This means that the remapping process will not be perfect,
597 since the previous rule distributed across devices of multiple
598 classes but the adjusted rules will only map to devices of the
599 specified *device-class*, but that often is an accepted level of
9f95a23c 600 data movement when the number of outlier devices is small.
f64942e4
AA
601
602#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>``
603
9f95a23c 604 This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like::
f64942e4
AA
605
606 host node1 {
607 id -2 # do not change unnecessarily
608 # weight 109.152
9f95a23c 609 alg straw2
f64942e4
AA
610 hash 0 # rjenkins1
611 item osd.0 weight 9.096
612 item osd.1 weight 9.096
613 item osd.2 weight 9.096
614 item osd.3 weight 9.096
615 item osd.4 weight 9.096
616 item osd.5 weight 9.096
617 ...
618 }
619
620 host node1-ssd {
621 id -10 # do not change unnecessarily
622 # weight 2.000
9f95a23c 623 alg straw2
f64942e4
AA
624 hash 0 # rjenkins1
625 item osd.80 weight 2.000
626 ...
627 }
628
629 root default {
630 id -1 # do not change unnecessarily
9f95a23c 631 alg straw2
f64942e4
AA
632 hash 0 # rjenkins1
633 item node1 weight 110.967
634 ...
635 }
636
637 root ssd {
638 id -18 # do not change unnecessarily
639 # weight 16.000
9f95a23c 640 alg straw2
f64942e4
AA
641 hash 0 # rjenkins1
642 item node1-ssd weight 2.000
643 ...
644 }
645
646 This function will reclassify each bucket that matches a
647 pattern. The pattern can look like ``%suffix`` or ``prefix%``.
648 For example, in the above example, we would use the pattern
649 ``%-ssd``. For each matched bucket, the remaining portion of the
650 name (that matches the ``%`` wildcard) specifies the *base bucket*.
651 All devices in the matched bucket are labeled with the specified
652 device class and then moved to the base bucket. If the base bucket
653 does not exist (e.g., ``node12-ssd`` exists but ``node12`` does
654 not), then it is created and linked underneath the specified
655 *default parent* bucket. In each case, we are careful to preserve
656 the old bucket IDs for the new shadow buckets to prevent data
657 movement. Any rules with ``take`` steps referencing the old
658 buckets are adjusted.
659
660#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>``
661
662 The same command can also be used without a wildcard to map a
663 single bucket. For example, in the previous example, we want the
664 ``ssd`` bucket to be mapped to the ``default`` bucket.
665
39ae355f
TL
666The final command to convert the map comprising the above fragments would be something like:
667
668.. prompt:: bash $
f64942e4 669
39ae355f
TL
670 ceph osd getcrushmap -o original
671 crushtool -i original --reclassify \
672 --set-subtree-class default hdd \
673 --reclassify-root default hdd \
674 --reclassify-bucket %-ssd ssd default \
675 --reclassify-bucket ssd ssd default \
676 -o adjusted
f64942e4 677
39ae355f
TL
678In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,:
679
680.. prompt:: bash $
681
682 crushtool -i original --compare adjusted
683
684::
f64942e4 685
f64942e4
AA
686 rule 0 had 0/10240 mismatched mappings (0)
687 rule 1 had 0/10240 mismatched mappings (0)
688 maps appear equivalent
689
39ae355f
TL
690If there were differences, the ratio of remapped inputs would be reported in
691the parentheses.
692
693When you are satisfied with the adjusted map, apply it to the cluster with a command of the form:
f64942e4 694
39ae355f 695.. prompt:: bash $
f64942e4 696
39ae355f 697 ceph osd setcrushmap -i adjusted
c07f9fc5
FG
698
699Tuning CRUSH, the hard way
700--------------------------
701
702If you can ensure that all clients are running recent code, you can
703adjust the tunables by extracting the CRUSH map, modifying the values,
704and reinjecting it into the cluster.
705
39ae355f
TL
706* Extract the latest CRUSH map:
707
708 .. prompt:: bash $
c07f9fc5
FG
709
710 ceph osd getcrushmap -o /tmp/crush
711
712* Adjust tunables. These values appear to offer the best behavior
713 for both large and small clusters we tested with. You will need to
714 additionally specify the ``--enable-unsafe-tunables`` argument to
715 ``crushtool`` for this to work. Please use this option with
39ae355f 716 extreme care.:
c07f9fc5 717
39ae355f 718 .. prompt:: bash $
c07f9fc5 719
39ae355f 720 crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
c07f9fc5 721
39ae355f
TL
722* Reinject modified map:
723
724 .. prompt:: bash $
725
726 ceph osd setcrushmap -i /tmp/crush.new
c07f9fc5
FG
727
728Legacy values
729-------------
730
731For reference, the legacy values for the CRUSH tunables can be set
39ae355f
TL
732with:
733
734.. prompt:: bash $
c07f9fc5
FG
735
736 crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
737
738Again, the special ``--enable-unsafe-tunables`` option is required.
739Further, as noted above, be careful running old versions of the
740``ceph-osd`` daemon after reverting to legacy values as the feature
741bit is not perfectly enforced.
11fdf7f2 742
39ae355f 743.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf