]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/crush-map.rst
update sources to v12.1.2
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
1 ============
2 CRUSH Maps
3 ============
4
5 The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
6 determines how to store and retrieve data by computing data storage locations.
7 CRUSH empowers Ceph clients to communicate with OSDs directly rather than
8 through a centralized server or broker. With an algorithmically determined
9 method of storing and retrieving data, Ceph avoids a single point of failure, a
10 performance bottleneck, and a physical limit to its scalability.
11
12 CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
13 store and retrieve data in OSDs with a uniform distribution of data across the
14 cluster. For a detailed discussion of CRUSH, see
15 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
17 CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
18 'buckets' for aggregating the devices into physical locations, and a list of
19 rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
20 reflecting the underlying physical organization of the installation, CRUSH can
21 model—and thereby address—potential sources of correlated device failures.
22 Typical sources include physical proximity, a shared power source, and a shared
23 network. By encoding this information into the cluster map, CRUSH placement
24 policies can separate object replicas across different failure domains while
25 still maintaining the desired distribution. For example, to address the
26 possibility of concurrent failures, it may be desirable to ensure that data
27 replicas are on devices using different shelves, racks, power supplies,
28 controllers, and/or physical locations.
29
30 When you deploy OSDs they are automatically placed within the CRUSH map under a
31 ``host`` node named with the hostname for the host they are running on. This,
32 combined with the default CRUSH failure domain, ensures that replicas or erasure
33 code shards are separated across hosts and a single host failure will not
34 affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
35 for example, is common for mid- to large-sized clusters.
36
37
38 CRUSH Location
39 ==============
40
41 The location of an OSD in terms of the CRUSH map's hierarchy is
42 referred to as a ``crush location``. This location specifier takes the
43 form of a list of key and value pairs describing a position. For
44 example, if an OSD is in a particular row, rack, chassis and host, and
45 is part of the 'default' CRUSH tree (this is the case for the vast
46 majority of clusters), its crush location could be described as::
47
48 root=default row=a rack=a2 chassis=a2a host=a2a1
49
50 Note:
51
52 #. Note that the order of the keys does not matter.
53 #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
54 these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
55 but those types can be customized to be anything appropriate by modifying
56 the CRUSH map.
57 #. Not all keys need to be specified. For example, by default, Ceph
58 automatically sets a ``ceph-osd`` daemon's location to be
59 ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
60
61 The crush location for an OSD is normally expressed via the ``crush location``
62 config option being set in the ``ceph.conf`` file. Each time the OSD starts,
63 it verifies it is in the correct location in the CRUSH map and, if it is not,
64 it moved itself. To disable this automatic CRUSH map management, add the
65 following to your configuration file in the ``[osd]`` section::
66
67 osd crush update on start = false
68
69
70 Custom location hooks
71 ---------------------
72
73 A customized location hook can be used to generate a more complete
74 crush location on startup. The sample ``ceph-crush-location`` utility
75 will generate a CRUSH location string for a given daemon. The
76 location is based on, in order of preference:
77
78 #. A ``crush location`` option in ceph.conf.
79 #. A default of ``root=default host=HOSTNAME`` where the hostname is
80 generated with the ``hostname -s`` command.
81
82 This is not useful by itself, as the OSD itself has the exact same
83 behavior. However, the script can be modified to provide additional
84 location fields (for example, the rack or datacenter), and then the
85 hook enabled via the config option::
86
87 crush location hook = /path/to/customized-ceph-crush-location
88
89 This hook is passed several arguments (below) and should output a single line
90 to stdout with the CRUSH location description.::
91
92 $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
93
94 where the cluster name is typically 'ceph', the id is the daemon
95 identifier (the OSD number), and the daemon type is typically ``osd``.
96
97
98 CRUSH structure
99 ===============
100
101 The CRUSH map consists of, loosely speaking, a hierarchy describing
102 the physical topology of the cluster, and a set of rules defining
103 policy about how we place data on those devices. The hierarchy has
104 devices (``ceph-osd`` daemons) at the leaves, and internal nodes
105 corresponding to other physical features or groupings: hosts, racks,
106 rows, datacenters, and so on. The rules describe how replicas are
107 placed in terms of that hierarchy (e.g., 'three replicas in different
108 racks').
109
110 Devices
111 -------
112
113 Devices are individual ``ceph-osd`` daemons that can store data. You
114 will normally have one defined here for each OSD daemon in your
115 cluster. Devices are identified by an id (a non-negative integer) and
116 a name, normally ``osd.N`` where ``N`` is the device id.
117
118 Devices may also have a *device class* associated with them (e.g.,
119 ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
120 crush rule.
121
122 Types and Buckets
123 -----------------
124
125 A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
126 racks, rows, etc. The CRUSH map defines a series of *types* that are
127 used to describe these nodes. By default, these types include:
128
129 - osd (or device)
130 - host
131 - chassis
132 - rack
133 - row
134 - pdu
135 - pod
136 - room
137 - datacenter
138 - region
139 - root
140
141 Most clusters make use of only a handful of these types, and others
142 can be defined as needed.
143
144 The hierarchy is built with devices (normally type ``osd``) at the
145 leaves, interior nodes with non-device types, and a root node of type
146 ``root``. For example,
147
148 .. ditaa::
149
150 +-----------------+
151 | {o}root default |
152 +--------+--------+
153 |
154 +---------------+---------------+
155 | |
156 +-------+-------+ +-----+-------+
157 | {o}host foo | | {o}host bar |
158 +-------+-------+ +-----+-------+
159 | |
160 +-------+-------+ +-------+-------+
161 | | | |
162 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
163 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
164 +-----------+ +-----------+ +-----------+ +-----------+
165
166 Each node (device or bucket) in the hierarchy has a *weight*
167 associated with it, indicating the relative proportion of the total
168 data that device or hierarchy subtree should store. Weights are set
169 at the leaves, indicating the size of the device, and automatically
170 sum up the tree from there, such that the weight of the default node
171 will be the total of all devices contained beneath it. Normally
172 weights are in units of terabytes (TB).
173
174 You can get a simple view the CRUSH hierarchy for your cluster,
175 including the weights, with::
176
177 ceph osd crush tree
178
179 Rules
180 -----
181
182 Rules define policy about how data is distributed across the devices
183 in the hierarchy.
184
185 CRUSH rules define placement and replication strategies or
186 distribution policies that allow you to specify exactly how CRUSH
187 places object replicas. For example, you might create a rule selecting
188 a pair of targets for 2-way mirroring, another rule for selecting
189 three targets in two different data centers for 3-way mirroring, and
190 yet another rule for erasure coding over six storage devices. For a
191 detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
192 Scalable, Decentralized Placement of Replicated Data`_, and more
193 specifically to **Section 3.2**.
194
195 In almost all cases, CRUSH rules can be created via the CLI by
196 specifying the *pool type* they will be used for (replicated or
197 erasure coded), the *failure domain*, and optionally a *device class*.
198 In rare cases rules must be written by hand by manually editing the
199 CRUSH map.
200
201 You can see what rules are defined for your cluster with::
202
203 ceph osd crush rule ls
204
205 You can view the contents of the rules with::
206
207 ceph osd crush rule dump
208
209
210 Weights sets
211 ------------
212
213 A *weight set* is an alternative set of weights to use when
214 calculating data placement. The normal weights associated with each
215 device in the CRUSH map are set based on the device size and indicate
216 how much data we *should* be storing where. However, because CRUSH is
217 based on a pseudorandom placement process, there is always some
218 variation from this ideal distribution, the same way that rolling a
219 dice sixty times will not result in rolling exactly 10 ones and 10
220 sixes. Weight sets allow the cluster to do a numerical optimization
221 based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
222 a balanced distribution.
223
224 There are two types of weight sets supported:
225
226 #. A **compat** weight set is a single alternative set of weights for
227 each device and node in the cluster. This is not well-suited for
228 correcting for all anomalies (for example, placement groups for
229 different pools may be different sizes and have different load
230 levels, but will be mostly treated the same by the balancer).
231 However, compat weight sets have the huge advantage that they are
232 *backward compatible* with previous versions of Ceph, which means
233 that even though weight sets were first introduced in Luminous
234 v12.2.z, older clients (e.g., firefly) can still connect to the
235 cluster when a compat weight set is being used to balance data.
236 #. A **per-pool** weight set is more flexible in that it allows
237 placement to be optimized for each data pool. Additionally,
238 weights can be adjusted for each position of placement, allowing
239 the optimizer to correct for a suble skew of data toward devices
240 with small weights relative to their peers (and effect that is
241 usually only apparently in very large clusters but which can cause
242 balancing problems).
243
244 When weight sets are in use, the weights associated with each node in
245 the hierarchy is visible as a separate column (labeled either
246 ``(compat)`` or the pool name) from the command::
247
248 ceph osd crush tree
249
250 When both *compat* and *per-pool* weight sets are in use, data
251 placement for a particular pool will use its own per-pool weight set
252 if present. If not, it will use the compat weight set if present. If
253 neither are present, it will use the normal CRUSH weights.
254
255 Although weight sets can be set up and manipulated by hand, it is
256 recommended that the *balancer* module be enabled to do so
257 automatically.
258
259
260 Modifying the CRUSH map
261 =======================
262
263 .. _addosd:
264
265 Add/Move an OSD
266 ---------------
267
268 .. note: OSDs are normally automatically added to the CRUSH map when
269 the OSD is created. This command is rarely needed.
270
271 To add or move an OSD in the CRUSH map of a running cluster::
272
273 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
274
275 Where:
276
277 ``name``
278
279 :Description: The full name of the OSD.
280 :Type: String
281 :Required: Yes
282 :Example: ``osd.0``
283
284
285 ``weight``
286
287 :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
288 :Type: Double
289 :Required: Yes
290 :Example: ``2.0``
291
292
293 ``root``
294
295 :Description: The root node of the tree in which the OSD resides (normally ``default``)
296 :Type: Key/value pair.
297 :Required: Yes
298 :Example: ``root=default``
299
300
301 ``bucket-type``
302
303 :Description: You may specify the OSD's location in the CRUSH hierarchy.
304 :Type: Key/value pairs.
305 :Required: No
306 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
307
308
309 The following example adds ``osd.0`` to the hierarchy, or moves the
310 OSD from a previous location. ::
311
312 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
313
314
315 Adjust OSD weight
316 -----------------
317
318 .. note: Normally OSDs automatically add themselves to the CRUSH map
319 with the correct weight when they are created. This command
320 is rarely needed.
321
322 To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
323 the following::
324
325 ceph osd crush reweight {name} {weight}
326
327 Where:
328
329 ``name``
330
331 :Description: The full name of the OSD.
332 :Type: String
333 :Required: Yes
334 :Example: ``osd.0``
335
336
337 ``weight``
338
339 :Description: The CRUSH weight for the OSD.
340 :Type: Double
341 :Required: Yes
342 :Example: ``2.0``
343
344
345 .. _removeosd:
346
347 Remove an OSD
348 -------------
349
350 .. note: OSDs are normally removed from the CRUSH as part of the
351 ``ceph osd purge`` command. This command is rarely needed.
352
353 To remove an OSD from the CRUSH map of a running cluster, execute the
354 following::
355
356 ceph osd crush remove {name}
357
358 Where:
359
360 ``name``
361
362 :Description: The full name of the OSD.
363 :Type: String
364 :Required: Yes
365 :Example: ``osd.0``
366
367
368 Add a Bucket
369 ------------
370
371 .. note: Buckets are normally implicitly created when an OSD is added
372 that specifies a ``{bucket-type}={bucket-name}`` as part of its
373 location and a bucket with that name does not already exist. This
374 command is typically used when manually adjusting the structure of the
375 hierarchy after OSDs have been created (for example, to move a
376 series of hosts underneath a new rack-level bucket).
377
378 To add a bucket in the CRUSH map of a running cluster, execute the
379 ``ceph osd crush add-bucket`` command::
380
381 ceph osd crush add-bucket {bucket-name} {bucket-type}
382
383 Where:
384
385 ``bucket-name``
386
387 :Description: The full name of the bucket.
388 :Type: String
389 :Required: Yes
390 :Example: ``rack12``
391
392
393 ``bucket-type``
394
395 :Description: The type of the bucket. The type must already exist in the hierarchy.
396 :Type: String
397 :Required: Yes
398 :Example: ``rack``
399
400
401 The following example adds the ``rack12`` bucket to the hierarchy::
402
403 ceph osd crush add-bucket rack12 rack
404
405 Move a Bucket
406 -------------
407
408 To move a bucket to a different location or position in the CRUSH map
409 hierarchy, execute the following::
410
411 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
412
413 Where:
414
415 ``bucket-name``
416
417 :Description: The name of the bucket to move/reposition.
418 :Type: String
419 :Required: Yes
420 :Example: ``foo-bar-1``
421
422 ``bucket-type``
423
424 :Description: You may specify the bucket's location in the CRUSH hierarchy.
425 :Type: Key/value pairs.
426 :Required: No
427 :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
428
429 Remove a Bucket
430 ---------------
431
432 To remove a bucket from the CRUSH map hierarchy, execute the following::
433
434 ceph osd crush remove {bucket-name}
435
436 .. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
437
438 Where:
439
440 ``bucket-name``
441
442 :Description: The name of the bucket that you'd like to remove.
443 :Type: String
444 :Required: Yes
445 :Example: ``rack12``
446
447 The following example removes the ``rack12`` bucket from the hierarchy::
448
449 ceph osd crush remove rack12
450
451 Creating a compat weight set
452 ----------------------------
453
454 .. note: This step is normally done automatically by the ``balancer``
455 module when enabled.
456
457 To create a *compat* weight set::
458
459 ceph osd crush weight-set create-compat
460
461 Weights for the compat weight set can be adjusted with::
462
463 ceph osd crush weight-set reweight-compat {name} {weight}
464
465 The compat weight set can be destroyed with::
466
467 ceph osd crush weight-set rm-compat
468
469 Creating per-pool weight sets
470 -----------------------------
471
472 To create a weight set for a specific pool,::
473
474 ceph osd crush weight-set create {pool-name} {mode}
475
476 .. note:: Per-pool weight sets require that all servers and daemons
477 run Luminous v12.2.z or later.
478
479 Where:
480
481 ``pool-name``
482
483 :Description: The name of a RADOS pool
484 :Type: String
485 :Required: Yes
486 :Example: ``rbd``
487
488 ``mode``
489
490 :Description: Either ``flat`` or ``positional``. A *flat* weight set
491 has a single weight for each device or bucket. A
492 *positional* weight set has a potentially different
493 weight for each position in the resulting placement
494 mapping. For example, if a pool has a replica count of
495 3, then a positional weight set will have three weights
496 for each device and bucket.
497 :Type: String
498 :Required: Yes
499 :Example: ``flat``
500
501 To adjust the weight of an item in a weight set::
502
503 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
504
505 To list existing weight sets,::
506
507 ceph osd crush weight-set ls
508
509 To remove a weight set,::
510
511 ceph osd crush weight-set rm {pool-name}
512
513 Creating a rule for a replicated pool
514 -------------------------------------
515
516 For a replicated pool, the primary decision when creating the CRUSH
517 rule is what the failure domain is going to be. For example, if a
518 failure domain of ``host`` is selected, then CRUSH will ensure that
519 each replica of the data is stored on a different host. If ``rack``
520 is selected, then each replica will be stored in a different rack.
521 What failure domain you choose primarily depends on the size of your
522 cluster and how your hierarchy is structured.
523
524 Normally, the entire cluster hierarchy is nested beneath a root node
525 named ``default``. If you have customized your hierarchy, you may
526 want to create a rule nested at some other node in the hierarchy. It
527 doesn't matter what type is associated with that node (it doesn't have
528 to be a ``root`` node).
529
530 It is also possible to create a rule that restricts data placement to
531 a specific *class* of device. By default, Ceph OSDs automatically
532 classify themselves as either ``hdd`` or ``ssd``, depending on the
533 underlying type of device being used. These classes can also be
534 customized.
535
536 To create a replicated rule,::
537
538 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
539
540 Where:
541
542 ``name``
543
544 :Description: The name of the rule
545 :Type: String
546 :Required: Yes
547 :Example: ``rbd-rule``
548
549 ``root``
550
551 :Description: The name of the node under which data should be placed.
552 :Type: String
553 :Required: Yes
554 :Example: ``default``
555
556 ``failure-domain-type``
557
558 :Description: The type of CRUSH nodes across which we should separate replicas.
559 :Type: String
560 :Required: Yes
561 :Example: ``rack``
562
563 ``class``
564
565 :Description: The device class data should be placed on.
566 :Type: String
567 :Required: No
568 :Example: ``ssd``
569
570 Creating a rule for an erasure coded pool
571 -----------------------------------------
572
573 For an erasure-coded pool, the same basic decisions need to be made as
574 with a replicated pool: what is the failure domain, what node in the
575 hierarchy will data be placed under (usually ``default``), and will
576 placement be restricted to a specific device class. Erasure code
577 pools are created a bit differently, however, because they need to be
578 constructed carefully based on the erasure code being used. For this reason,
579 you must include this information in the *erasure code profile*. A CRUSH
580 rule will then be created from that either explicitly or automatically when
581 the profile is used to create a pool.
582
583 The erasure code profiles can be listed with::
584
585 ceph osd erasure-code-profile ls
586
587 An existing profile can be viewed with::
588
589 ceph osd erasure-code-profile get {profile-name}
590
591 Normally profiles should never be modified; instead, a new profile
592 should be created and used when creating a new pool or creating a new
593 rule for an existing pool.
594
595 An erasure code profile consists of a set of key=value pairs. Most of
596 these control the behavior of the erasure code that is encoding data
597 in the pool. Those that begin with ``crush-``, however, affect the
598 CRUSH rule that is created.
599
600 The erasure code profile properties of interest are:
601
602 * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
603 * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
604 * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
605 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
606
607 Once a profile is defined, you can create a CRUSH rule with::
608
609 ceph osd crush rule create-erasure {name} {profile-name}
610
611 .. note: When creating a new pool, it is not actually necessary to
612 explicitly create the rule. If the erasure code profile alone is
613 specified and the rule argument is left off then Ceph will create
614 the CRUSH rule automatically.
615
616 Deleting rules
617 --------------
618
619 Rules that are not in use by pools can be deleted with::
620
621 ceph osd crush rule rm {rule-name}
622
623
624 Tunables
625 ========
626
627 Over time, we have made (and continue to make) improvements to the
628 CRUSH algorithm used to calculate the placement of data. In order to
629 support the change in behavior, we have introduced a series of tunable
630 options that control whether the legacy or improved variation of the
631 algorithm is used.
632
633 In order to use newer tunables, both clients and servers must support
634 the new version of CRUSH. For this reason, we have created
635 ``profiles`` that are named after the Ceph version in which they were
636 introduced. For example, the ``firefly`` tunables are first supported
637 in the firefly release, and will not work with older (e.g., dumpling)
638 clients. Once a given set of tunables are changed from the legacy
639 default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
640 clients who do not support the new CRUSH features from connecting to
641 the cluster.
642
643 argonaut (legacy)
644 -----------------
645
646 The legacy CRUSH behavior used by argonaut and older releases works
647 fine for most clusters, provided there are not too many OSDs that have
648 been marked out.
649
650 bobtail (CRUSH_TUNABLES2)
651 -------------------------
652
653 The bobtail tunable profile fixes a few key misbehaviors:
654
655 * For hierarchies with a small number of devices in the leaf buckets,
656 some PGs map to fewer than the desired number of replicas. This
657 commonly happens for hierarchies with "host" nodes with a small
658 number (1-3) of OSDs nested beneath each one.
659
660 * For large clusters, some small percentages of PGs map to less than
661 the desired number of OSDs. This is more prevalent when there are
662 several layers of the hierarchy (e.g., row, rack, host, osd).
663
664 * When some OSDs are marked out, the data tends to get redistributed
665 to nearby OSDs instead of across the entire hierarchy.
666
667 The new tunables are:
668
669 * ``choose_local_tries``: Number of local retries. Legacy value is
670 2, optimal value is 0.
671
672 * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
673 is 0.
674
675 * ``choose_total_tries``: Total number of attempts to choose an item.
676 Legacy value was 19, subsequent testing indicates that a value of
677 50 is more appropriate for typical clusters. For extremely large
678 clusters, a larger value might be necessary.
679
680 * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
681 will retry, or only try once and allow the original placement to
682 retry. Legacy default is 0, optimal value is 1.
683
684 Migration impact:
685
686 * Moving from argonaut to bobtail tunables triggers a moderate amount
687 of data movement. Use caution on a cluster that is already
688 populated with data.
689
690 firefly (CRUSH_TUNABLES3)
691 -------------------------
692
693 The firefly tunable profile fixes a problem
694 with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
695 mappings with too few results when too many OSDs have been marked out.
696
697 The new tunable is:
698
699 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
700 start with a non-zero value of r, based on how many attempts the
701 parent has already made. Legacy default is 0, but with this value
702 CRUSH is sometimes unable to find a mapping. The optimal value (in
703 terms of computational cost and correctness) is 1.
704
705 Migration impact:
706
707 * For existing clusters that have lots of existing data, changing
708 from 0 to 1 will cause a lot of data to move; a value of 4 or 5
709 will allow CRUSH to find a valid mapping but will make less data
710 move.
711
712 straw_calc_version tunable (introduced with Firefly too)
713 --------------------------------------------------------
714
715 There were some problems with the internal weights calculated and
716 stored in the CRUSH map for ``straw`` buckets. Specifically, when
717 there were items with a CRUSH weight of 0 or both a mix of weights and
718 some duplicated weights CRUSH would distribute data incorrectly (i.e.,
719 not in proportion to the weights).
720
721 The new tunable is:
722
723 * ``straw_calc_version``: A value of 0 preserves the old, broken
724 internal weight calculation; a value of 1 fixes the behavior.
725
726 Migration impact:
727
728 * Moving to straw_calc_version 1 and then adjusting a straw bucket
729 (by adding, removing, or reweighting an item, or by using the
730 reweight-all command) can trigger a small to moderate amount of
731 data movement *if* the cluster has hit one of the problematic
732 conditions.
733
734 This tunable option is special because it has absolutely no impact
735 concerning the required kernel version in the client side.
736
737 hammer (CRUSH_V4)
738 -----------------
739
740 The hammer tunable profile does not affect the
741 mapping of existing CRUSH maps simply by changing the profile. However:
742
743 * There is a new bucket type (``straw2``) supported. The new
744 ``straw2`` bucket type fixes several limitations in the original
745 ``straw`` bucket. Specifically, the old ``straw`` buckets would
746 change some mappings that should have changed when a weight was
747 adjusted, while ``straw2`` achieves the original goal of only
748 changing mappings to or from the bucket item whose weight has
749 changed.
750
751 * ``straw2`` is the default for any newly created buckets.
752
753 Migration impact:
754
755 * Changing a bucket type from ``straw`` to ``straw2`` will result in
756 a reasonably small amount of data movement, depending on how much
757 the bucket item weights vary from each other. When the weights are
758 all the same no data will move, and when item weights vary
759 significantly there will be more movement.
760
761 jewel (CRUSH_TUNABLES5)
762 -----------------------
763
764 The jewel tunable profile improves the
765 overall behavior of CRUSH such that significantly fewer mappings
766 change when an OSD is marked out of the cluster.
767
768 The new tunable is:
769
770 * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
771 use a better value for an inner loop that greatly reduces the number
772 of mapping changes when an OSD is marked out. The legacy value is 0,
773 while the new value of 1 uses the new approach.
774
775 Migration impact:
776
777 * Changing this value on an existing cluster will result in a very
778 large amount of data movement as almost every PG mapping is likely
779 to change.
780
781
782
783
784 Which client versions support CRUSH_TUNABLES
785 --------------------------------------------
786
787 * argonaut series, v0.48.1 or later
788 * v0.49 or later
789 * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
790
791 Which client versions support CRUSH_TUNABLES2
792 ---------------------------------------------
793
794 * v0.55 or later, including bobtail series (v0.56.x)
795 * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
796
797 Which client versions support CRUSH_TUNABLES3
798 ---------------------------------------------
799
800 * v0.78 (firefly) or later
801 * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
802
803 Which client versions support CRUSH_V4
804 --------------------------------------
805
806 * v0.94 (hammer) or later
807 * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
808
809 Which client versions support CRUSH_TUNABLES5
810 ---------------------------------------------
811
812 * v10.0.2 (jewel) or later
813 * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
814
815 Warning when tunables are non-optimal
816 -------------------------------------
817
818 Starting with version v0.74, Ceph will issue a health warning if the
819 current CRUSH tunables don't include all the optimal values from the
820 ``default`` profile (see below for the meaning of the ``default`` profile).
821 To make this warning go away, you have two options:
822
823 1. Adjust the tunables on the existing cluster. Note that this will
824 result in some data movement (possibly as much as 10%). This is the
825 preferred route, but should be taken with care on a production cluster
826 where the data movement may affect performance. You can enable optimal
827 tunables with::
828
829 ceph osd crush tunables optimal
830
831 If things go poorly (e.g., too much load) and not very much
832 progress has been made, or there is a client compatibility problem
833 (old kernel cephfs or rbd clients, or pre-bobtail librados
834 clients), you can switch back with::
835
836 ceph osd crush tunables legacy
837
838 2. You can make the warning go away without making any changes to CRUSH by
839 adding the following option to your ceph.conf ``[mon]`` section::
840
841 mon warn on legacy crush tunables = false
842
843 For the change to take effect, you will need to restart the monitors, or
844 apply the option to running monitors with::
845
846 ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
847
848
849 A few important points
850 ----------------------
851
852 * Adjusting these values will result in the shift of some PGs between
853 storage nodes. If the Ceph cluster is already storing a lot of
854 data, be prepared for some fraction of the data to move.
855 * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
856 feature bits of new connections as soon as they get
857 the updated map. However, already-connected clients are
858 effectively grandfathered in, and will misbehave if they do not
859 support the new feature.
860 * If the CRUSH tunables are set to non-legacy values and then later
861 changed back to the defult values, ``ceph-osd`` daemons will not be
862 required to support the feature. However, the OSD peering process
863 requires examining and understanding old maps. Therefore, you
864 should not run old versions of the ``ceph-osd`` daemon
865 if the cluster has previously used non-legacy CRUSH values, even if
866 the latest version of the map has been switched back to using the
867 legacy defaults.
868
869 Tuning CRUSH
870 ------------
871
872 The simplest way to adjust the crush tunables is by changing to a known
873 profile. Those are:
874
875 * ``legacy``: the legacy behavior from argonaut and earlier.
876 * ``argonaut``: the legacy values supported by the original argonaut release
877 * ``bobtail``: the values supported by the bobtail release
878 * ``firefly``: the values supported by the firefly release
879 * ``hammer``: the values supported by the hammer release
880 * ``jewel``: the values supported by the jewel release
881 * ``optimal``: the best (ie optimal) values of the current version of Ceph
882 * ``default``: the default values of a new cluster installed from
883 scratch. These values, which depend on the current version of Ceph,
884 are hard coded and are generally a mix of optimal and legacy values.
885 These values generally match the ``optimal`` profile of the previous
886 LTS release, or the most recent release for which we generally except
887 more users to have up to date clients for.
888
889 You can select a profile on a running cluster with the command::
890
891 ceph osd crush tunables {PROFILE}
892
893 Note that this may result in some data movement.
894
895
896 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
897
898
899 Primary Affinity
900 ================
901
902 When a Ceph Client reads or writes data, it always contacts the primary OSD in
903 the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
904 OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
905 a slow disk or a slow controller). To prevent performance bottlenecks
906 (especially on read operations) while maximizing utilization of your hardware,
907 you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
908 the OSD as a primary in an acting set. ::
909
910 ceph osd primary-affinity <osd-id> <weight>
911
912 Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
913 may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
914 **NOT** be used as a primary and ``1`` means that an OSD may be used as a
915 primary. When the weight is ``< 1``, it is less likely that CRUSH will select
916 the Ceph OSD Daemon to act as a primary.
917
918
919