]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map.rst
update sources to v12.1.3
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
CommitLineData
7c673cae
FG
1============
2 CRUSH Maps
3============
4
5The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
6determines how to store and retrieve data by computing data storage locations.
7CRUSH empowers Ceph clients to communicate with OSDs directly rather than
8through a centralized server or broker. With an algorithmically determined
9method of storing and retrieving data, Ceph avoids a single point of failure, a
10performance bottleneck, and a physical limit to its scalability.
11
12CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
13store and retrieve data in OSDs with a uniform distribution of data across the
14cluster. For a detailed discussion of CRUSH, see
15`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
17CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
18'buckets' for aggregating the devices into physical locations, and a list of
19rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
20reflecting the underlying physical organization of the installation, CRUSH can
21model—and thereby address—potential sources of correlated device failures.
22Typical sources include physical proximity, a shared power source, and a shared
23network. By encoding this information into the cluster map, CRUSH placement
24policies can separate object replicas across different failure domains while
25still maintaining the desired distribution. For example, to address the
26possibility of concurrent failures, it may be desirable to ensure that data
27replicas are on devices using different shelves, racks, power supplies,
28controllers, and/or physical locations.
29
c07f9fc5
FG
30When you deploy OSDs they are automatically placed within the CRUSH map under a
31``host`` node named with the hostname for the host they are running on. This,
32combined with the default CRUSH failure domain, ensures that replicas or erasure
33code shards are separated across hosts and a single host failure will not
34affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
35for example, is common for mid- to large-sized clusters.
7c673cae
FG
36
37
38CRUSH Location
39==============
40
c07f9fc5
FG
41The location of an OSD in terms of the CRUSH map's hierarchy is
42referred to as a ``crush location``. This location specifier takes the
43form of a list of key and value pairs describing a position. For
44example, if an OSD is in a particular row, rack, chassis and host, and
45is part of the 'default' CRUSH tree (this is the case for the vast
46majority of clusters), its crush location could be described as::
7c673cae
FG
47
48 root=default row=a rack=a2 chassis=a2a host=a2a1
49
50Note:
51
52#. Note that the order of the keys does not matter.
53#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
54 these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
55 but those types can be customized to be anything appropriate by modifying
56 the CRUSH map.
57#. Not all keys need to be specified. For example, by default, Ceph
58 automatically sets a ``ceph-osd`` daemon's location to be
59 ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
60
c07f9fc5
FG
61The crush location for an OSD is normally expressed via the ``crush location``
62config option being set in the ``ceph.conf`` file. Each time the OSD starts,
63it verifies it is in the correct location in the CRUSH map and, if it is not,
64it moved itself. To disable this automatic CRUSH map management, add the
65following to your configuration file in the ``[osd]`` section::
7c673cae
FG
66
67 osd crush update on start = false
68
c07f9fc5 69
7c673cae
FG
70Custom location hooks
71---------------------
72
c07f9fc5
FG
73A customized location hook can be used to generate a more complete
74crush location on startup. The sample ``ceph-crush-location`` utility
75will generate a CRUSH location string for a given daemon. The
76location is based on, in order of preference:
77
78#. A ``crush location`` option in ceph.conf.
79#. A default of ``root=default host=HOSTNAME`` where the hostname is
80 generated with the ``hostname -s`` command.
81
82This is not useful by itself, as the OSD itself has the exact same
83behavior. However, the script can be modified to provide additional
84location fields (for example, the rack or datacenter), and then the
85hook enabled via the config option::
7c673cae 86
c07f9fc5 87 crush location hook = /path/to/customized-ceph-crush-location
7c673cae
FG
88
89This hook is passed several arguments (below) and should output a single line
90to stdout with the CRUSH location description.::
91
92 $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
93
94where the cluster name is typically 'ceph', the id is the daemon
95identifier (the OSD number), and the daemon type is typically ``osd``.
96
97
c07f9fc5
FG
98CRUSH structure
99===============
7c673cae 100
c07f9fc5
FG
101The CRUSH map consists of, loosely speaking, a hierarchy describing
102the physical topology of the cluster, and a set of rules defining
103policy about how we place data on those devices. The hierarchy has
104devices (``ceph-osd`` daemons) at the leaves, and internal nodes
105corresponding to other physical features or groupings: hosts, racks,
106rows, datacenters, and so on. The rules describe how replicas are
107placed in terms of that hierarchy (e.g., 'three replicas in different
108racks').
7c673cae 109
c07f9fc5
FG
110Devices
111-------
7c673cae 112
c07f9fc5
FG
113Devices are individual ``ceph-osd`` daemons that can store data. You
114will normally have one defined here for each OSD daemon in your
115cluster. Devices are identified by an id (a non-negative integer) and
116a name, normally ``osd.N`` where ``N`` is the device id.
7c673cae 117
c07f9fc5
FG
118Devices may also have a *device class* associated with them (e.g.,
119``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
120crush rule.
7c673cae 121
c07f9fc5 122Types and Buckets
7c673cae
FG
123-----------------
124
c07f9fc5
FG
125A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
126racks, rows, etc. The CRUSH map defines a series of *types* that are
127used to describe these nodes. By default, these types include:
128
129- osd (or device)
130- host
131- chassis
132- rack
133- row
134- pdu
135- pod
136- room
137- datacenter
138- region
139- root
140
141Most clusters make use of only a handful of these types, and others
142can be defined as needed.
143
144The hierarchy is built with devices (normally type ``osd``) at the
145leaves, interior nodes with non-device types, and a root node of type
146``root``. For example,
147
148.. ditaa::
149
150 +-----------------+
151 | {o}root default |
152 +--------+--------+
7c673cae
FG
153 |
154 +---------------+---------------+
155 | |
c07f9fc5
FG
156 +-------+-------+ +-----+-------+
157 | {o}host foo | | {o}host bar |
158 +-------+-------+ +-----+-------+
7c673cae
FG
159 | |
160 +-------+-------+ +-------+-------+
161 | | | |
162 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
c07f9fc5 163 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
7c673cae
FG
164 +-----------+ +-----------+ +-----------+ +-----------+
165
c07f9fc5
FG
166Each node (device or bucket) in the hierarchy has a *weight*
167associated with it, indicating the relative proportion of the total
168data that device or hierarchy subtree should store. Weights are set
169at the leaves, indicating the size of the device, and automatically
170sum up the tree from there, such that the weight of the default node
171will be the total of all devices contained beneath it. Normally
172weights are in units of terabytes (TB).
173
174You can get a simple view the CRUSH hierarchy for your cluster,
175including the weights, with::
176
177 ceph osd crush tree
178
179Rules
180-----
181
182Rules define policy about how data is distributed across the devices
183in the hierarchy.
184
185CRUSH rules define placement and replication strategies or
186distribution policies that allow you to specify exactly how CRUSH
187places object replicas. For example, you might create a rule selecting
188a pair of targets for 2-way mirroring, another rule for selecting
189three targets in two different data centers for 3-way mirroring, and
190yet another rule for erasure coding over six storage devices. For a
191detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
192Scalable, Decentralized Placement of Replicated Data`_, and more
193specifically to **Section 3.2**.
194
195In almost all cases, CRUSH rules can be created via the CLI by
196specifying the *pool type* they will be used for (replicated or
197erasure coded), the *failure domain*, and optionally a *device class*.
198In rare cases rules must be written by hand by manually editing the
199CRUSH map.
7c673cae 200
c07f9fc5 201You can see what rules are defined for your cluster with::
7c673cae 202
c07f9fc5 203 ceph osd crush rule ls
7c673cae 204
c07f9fc5 205You can view the contents of the rules with::
7c673cae 206
c07f9fc5 207 ceph osd crush rule dump
7c673cae 208
d2e6a577
FG
209Device classes
210--------------
211
212Each device can optionally have a *class* associated with it. By
213default, OSDs automatically set their class on startup to either
214`hdd`, `ssd`, or `nvme` based on the type of device they are backed
215by.
216
217The device class for one or more OSDs can be explicitly set with::
218
219 ceph osd crush set-device-class <class> <osd-name> [...]
220
221Once a device class is set, it cannot be changed to another class
222until the old class is unset with::
223
224 ceph osd crush rm-device-class <osd-name> [...]
225
226This allows administrators to set device classes without the class
227being changed on OSD restart or by some other script.
228
229A placement rule that targets a specific device class can be created with::
230
231 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
232
233A pool can then be changed to use the new rule with::
234
235 ceph osd pool set <pool-name> crush_rule <rule-name>
236
237Device classes are implemented by creating a "shadow" CRUSH hierarchy
238for each device class in use that contains only devices of that class.
239Rules can then distribute data over the shadow hierarchy. One nice
240thing about this approach is that it is fully backward compatible with
241old Ceph clients. You can view the CRUSH hierarchy with shadow items
242with::
243
244 ceph osd crush tree --show-shadow
245
7c673cae 246
c07f9fc5
FG
247Weights sets
248------------
7c673cae 249
c07f9fc5
FG
250A *weight set* is an alternative set of weights to use when
251calculating data placement. The normal weights associated with each
252device in the CRUSH map are set based on the device size and indicate
253how much data we *should* be storing where. However, because CRUSH is
254based on a pseudorandom placement process, there is always some
255variation from this ideal distribution, the same way that rolling a
256dice sixty times will not result in rolling exactly 10 ones and 10
257sixes. Weight sets allow the cluster to do a numerical optimization
258based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
259a balanced distribution.
260
261There are two types of weight sets supported:
262
263 #. A **compat** weight set is a single alternative set of weights for
264 each device and node in the cluster. This is not well-suited for
265 correcting for all anomalies (for example, placement groups for
266 different pools may be different sizes and have different load
267 levels, but will be mostly treated the same by the balancer).
268 However, compat weight sets have the huge advantage that they are
269 *backward compatible* with previous versions of Ceph, which means
270 that even though weight sets were first introduced in Luminous
271 v12.2.z, older clients (e.g., firefly) can still connect to the
272 cluster when a compat weight set is being used to balance data.
273 #. A **per-pool** weight set is more flexible in that it allows
274 placement to be optimized for each data pool. Additionally,
275 weights can be adjusted for each position of placement, allowing
276 the optimizer to correct for a suble skew of data toward devices
277 with small weights relative to their peers (and effect that is
278 usually only apparently in very large clusters but which can cause
279 balancing problems).
280
281When weight sets are in use, the weights associated with each node in
282the hierarchy is visible as a separate column (labeled either
283``(compat)`` or the pool name) from the command::
284
285 ceph osd crush tree
286
287When both *compat* and *per-pool* weight sets are in use, data
288placement for a particular pool will use its own per-pool weight set
289if present. If not, it will use the compat weight set if present. If
290neither are present, it will use the normal CRUSH weights.
291
292Although weight sets can be set up and manipulated by hand, it is
293recommended that the *balancer* module be enabled to do so
294automatically.
295
296
297Modifying the CRUSH map
298=======================
7c673cae
FG
299
300.. _addosd:
301
302Add/Move an OSD
c07f9fc5 303---------------
7c673cae 304
c07f9fc5
FG
305.. note: OSDs are normally automatically added to the CRUSH map when
306 the OSD is created. This command is rarely needed.
7c673cae 307
c07f9fc5 308To add or move an OSD in the CRUSH map of a running cluster::
7c673cae 309
c07f9fc5 310 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
7c673cae
FG
311
312Where:
313
7c673cae
FG
314``name``
315
316:Description: The full name of the OSD.
317:Type: String
318:Required: Yes
319:Example: ``osd.0``
320
321
322``weight``
323
c07f9fc5 324:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
7c673cae
FG
325:Type: Double
326:Required: Yes
327:Example: ``2.0``
328
329
330``root``
331
c07f9fc5 332:Description: The root node of the tree in which the OSD resides (normally ``default``)
7c673cae
FG
333:Type: Key/value pair.
334:Required: Yes
335:Example: ``root=default``
336
337
338``bucket-type``
339
340:Description: You may specify the OSD's location in the CRUSH hierarchy.
341:Type: Key/value pairs.
342:Required: No
343:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
344
345
c07f9fc5
FG
346The following example adds ``osd.0`` to the hierarchy, or moves the
347OSD from a previous location. ::
7c673cae 348
c07f9fc5 349 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7c673cae
FG
350
351
c07f9fc5
FG
352Adjust OSD weight
353-----------------
354
355.. note: Normally OSDs automatically add themselves to the CRUSH map
356 with the correct weight when they are created. This command
357 is rarely needed.
7c673cae
FG
358
359To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
360the following::
361
c07f9fc5 362 ceph osd crush reweight {name} {weight}
7c673cae
FG
363
364Where:
365
366``name``
367
368:Description: The full name of the OSD.
369:Type: String
370:Required: Yes
371:Example: ``osd.0``
372
373
374``weight``
375
376:Description: The CRUSH weight for the OSD.
377:Type: Double
378:Required: Yes
379:Example: ``2.0``
380
381
382.. _removeosd:
383
384Remove an OSD
c07f9fc5
FG
385-------------
386
387.. note: OSDs are normally removed from the CRUSH as part of the
388 ``ceph osd purge`` command. This command is rarely needed.
7c673cae 389
c07f9fc5
FG
390To remove an OSD from the CRUSH map of a running cluster, execute the
391following::
7c673cae 392
c07f9fc5 393 ceph osd crush remove {name}
7c673cae
FG
394
395Where:
396
397``name``
398
399:Description: The full name of the OSD.
400:Type: String
401:Required: Yes
402:Example: ``osd.0``
403
c07f9fc5 404
7c673cae 405Add a Bucket
c07f9fc5
FG
406------------
407
408.. note: Buckets are normally implicitly created when an OSD is added
409 that specifies a ``{bucket-type}={bucket-name}`` as part of its
410 location and a bucket with that name does not already exist. This
411 command is typically used when manually adjusting the structure of the
412 hierarchy after OSDs have been created (for example, to move a
413 series of hosts underneath a new rack-level bucket).
7c673cae 414
c07f9fc5
FG
415To add a bucket in the CRUSH map of a running cluster, execute the
416``ceph osd crush add-bucket`` command::
7c673cae 417
c07f9fc5 418 ceph osd crush add-bucket {bucket-name} {bucket-type}
7c673cae
FG
419
420Where:
421
422``bucket-name``
423
424:Description: The full name of the bucket.
425:Type: String
426:Required: Yes
427:Example: ``rack12``
428
429
430``bucket-type``
431
432:Description: The type of the bucket. The type must already exist in the hierarchy.
433:Type: String
434:Required: Yes
435:Example: ``rack``
436
437
438The following example adds the ``rack12`` bucket to the hierarchy::
439
c07f9fc5 440 ceph osd crush add-bucket rack12 rack
7c673cae
FG
441
442Move a Bucket
c07f9fc5 443-------------
7c673cae 444
c07f9fc5
FG
445To move a bucket to a different location or position in the CRUSH map
446hierarchy, execute the following::
7c673cae 447
c07f9fc5 448 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
7c673cae
FG
449
450Where:
451
452``bucket-name``
453
454:Description: The name of the bucket to move/reposition.
455:Type: String
456:Required: Yes
457:Example: ``foo-bar-1``
458
459``bucket-type``
460
461:Description: You may specify the bucket's location in the CRUSH hierarchy.
462:Type: Key/value pairs.
463:Required: No
464:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
465
466Remove a Bucket
c07f9fc5 467---------------
7c673cae
FG
468
469To remove a bucket from the CRUSH map hierarchy, execute the following::
470
c07f9fc5 471 ceph osd crush remove {bucket-name}
7c673cae
FG
472
473.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
474
475Where:
476
477``bucket-name``
478
479:Description: The name of the bucket that you'd like to remove.
480:Type: String
481:Required: Yes
482:Example: ``rack12``
483
484The following example removes the ``rack12`` bucket from the hierarchy::
485
c07f9fc5
FG
486 ceph osd crush remove rack12
487
488Creating a compat weight set
489----------------------------
490
491.. note: This step is normally done automatically by the ``balancer``
492 module when enabled.
493
494To create a *compat* weight set::
495
496 ceph osd crush weight-set create-compat
497
498Weights for the compat weight set can be adjusted with::
499
500 ceph osd crush weight-set reweight-compat {name} {weight}
501
502The compat weight set can be destroyed with::
503
504 ceph osd crush weight-set rm-compat
505
506Creating per-pool weight sets
507-----------------------------
508
509To create a weight set for a specific pool,::
510
511 ceph osd crush weight-set create {pool-name} {mode}
512
513.. note:: Per-pool weight sets require that all servers and daemons
514 run Luminous v12.2.z or later.
515
516Where:
517
518``pool-name``
519
520:Description: The name of a RADOS pool
521:Type: String
522:Required: Yes
523:Example: ``rbd``
524
525``mode``
526
527:Description: Either ``flat`` or ``positional``. A *flat* weight set
528 has a single weight for each device or bucket. A
529 *positional* weight set has a potentially different
530 weight for each position in the resulting placement
531 mapping. For example, if a pool has a replica count of
532 3, then a positional weight set will have three weights
533 for each device and bucket.
534:Type: String
535:Required: Yes
536:Example: ``flat``
537
538To adjust the weight of an item in a weight set::
539
540 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
541
542To list existing weight sets,::
543
544 ceph osd crush weight-set ls
545
546To remove a weight set,::
547
548 ceph osd crush weight-set rm {pool-name}
549
550Creating a rule for a replicated pool
551-------------------------------------
552
553For a replicated pool, the primary decision when creating the CRUSH
554rule is what the failure domain is going to be. For example, if a
555failure domain of ``host`` is selected, then CRUSH will ensure that
556each replica of the data is stored on a different host. If ``rack``
557is selected, then each replica will be stored in a different rack.
558What failure domain you choose primarily depends on the size of your
559cluster and how your hierarchy is structured.
560
561Normally, the entire cluster hierarchy is nested beneath a root node
562named ``default``. If you have customized your hierarchy, you may
563want to create a rule nested at some other node in the hierarchy. It
564doesn't matter what type is associated with that node (it doesn't have
565to be a ``root`` node).
566
567It is also possible to create a rule that restricts data placement to
568a specific *class* of device. By default, Ceph OSDs automatically
569classify themselves as either ``hdd`` or ``ssd``, depending on the
570underlying type of device being used. These classes can also be
571customized.
572
573To create a replicated rule,::
574
575 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
576
577Where:
578
579``name``
580
581:Description: The name of the rule
582:Type: String
583:Required: Yes
584:Example: ``rbd-rule``
585
586``root``
587
588:Description: The name of the node under which data should be placed.
589:Type: String
590:Required: Yes
591:Example: ``default``
592
593``failure-domain-type``
594
595:Description: The type of CRUSH nodes across which we should separate replicas.
596:Type: String
597:Required: Yes
598:Example: ``rack``
599
600``class``
601
602:Description: The device class data should be placed on.
603:Type: String
604:Required: No
605:Example: ``ssd``
606
607Creating a rule for an erasure coded pool
608-----------------------------------------
609
610For an erasure-coded pool, the same basic decisions need to be made as
611with a replicated pool: what is the failure domain, what node in the
612hierarchy will data be placed under (usually ``default``), and will
613placement be restricted to a specific device class. Erasure code
614pools are created a bit differently, however, because they need to be
615constructed carefully based on the erasure code being used. For this reason,
616you must include this information in the *erasure code profile*. A CRUSH
617rule will then be created from that either explicitly or automatically when
618the profile is used to create a pool.
619
620The erasure code profiles can be listed with::
621
622 ceph osd erasure-code-profile ls
623
624An existing profile can be viewed with::
625
626 ceph osd erasure-code-profile get {profile-name}
627
628Normally profiles should never be modified; instead, a new profile
629should be created and used when creating a new pool or creating a new
630rule for an existing pool.
631
632An erasure code profile consists of a set of key=value pairs. Most of
633these control the behavior of the erasure code that is encoding data
634in the pool. Those that begin with ``crush-``, however, affect the
635CRUSH rule that is created.
636
637The erasure code profile properties of interest are:
638
639 * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
640 * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
641 * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
642 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
643
644Once a profile is defined, you can create a CRUSH rule with::
645
646 ceph osd crush rule create-erasure {name} {profile-name}
647
648.. note: When creating a new pool, it is not actually necessary to
649 explicitly create the rule. If the erasure code profile alone is
650 specified and the rule argument is left off then Ceph will create
651 the CRUSH rule automatically.
652
653Deleting rules
654--------------
655
656Rules that are not in use by pools can be deleted with::
657
658 ceph osd crush rule rm {rule-name}
659
7c673cae
FG
660
661Tunables
662========
663
664Over time, we have made (and continue to make) improvements to the
665CRUSH algorithm used to calculate the placement of data. In order to
666support the change in behavior, we have introduced a series of tunable
667options that control whether the legacy or improved variation of the
668algorithm is used.
669
670In order to use newer tunables, both clients and servers must support
671the new version of CRUSH. For this reason, we have created
672``profiles`` that are named after the Ceph version in which they were
673introduced. For example, the ``firefly`` tunables are first supported
674in the firefly release, and will not work with older (e.g., dumpling)
675clients. Once a given set of tunables are changed from the legacy
676default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
677clients who do not support the new CRUSH features from connecting to
678the cluster.
679
680argonaut (legacy)
681-----------------
682
683The legacy CRUSH behavior used by argonaut and older releases works
684fine for most clusters, provided there are not too many OSDs that have
685been marked out.
686
687bobtail (CRUSH_TUNABLES2)
688-------------------------
689
690The bobtail tunable profile fixes a few key misbehaviors:
691
692 * For hierarchies with a small number of devices in the leaf buckets,
693 some PGs map to fewer than the desired number of replicas. This
694 commonly happens for hierarchies with "host" nodes with a small
695 number (1-3) of OSDs nested beneath each one.
696
697 * For large clusters, some small percentages of PGs map to less than
698 the desired number of OSDs. This is more prevalent when there are
699 several layers of the hierarchy (e.g., row, rack, host, osd).
700
701 * When some OSDs are marked out, the data tends to get redistributed
702 to nearby OSDs instead of across the entire hierarchy.
703
704The new tunables are:
705
706 * ``choose_local_tries``: Number of local retries. Legacy value is
707 2, optimal value is 0.
708
709 * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
710 is 0.
711
712 * ``choose_total_tries``: Total number of attempts to choose an item.
713 Legacy value was 19, subsequent testing indicates that a value of
714 50 is more appropriate for typical clusters. For extremely large
715 clusters, a larger value might be necessary.
716
717 * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
718 will retry, or only try once and allow the original placement to
719 retry. Legacy default is 0, optimal value is 1.
720
721Migration impact:
722
723 * Moving from argonaut to bobtail tunables triggers a moderate amount
724 of data movement. Use caution on a cluster that is already
725 populated with data.
726
727firefly (CRUSH_TUNABLES3)
728-------------------------
729
730The firefly tunable profile fixes a problem
731with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
732mappings with too few results when too many OSDs have been marked out.
733
734The new tunable is:
735
736 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
737 start with a non-zero value of r, based on how many attempts the
738 parent has already made. Legacy default is 0, but with this value
739 CRUSH is sometimes unable to find a mapping. The optimal value (in
740 terms of computational cost and correctness) is 1.
741
742Migration impact:
743
744 * For existing clusters that have lots of existing data, changing
745 from 0 to 1 will cause a lot of data to move; a value of 4 or 5
746 will allow CRUSH to find a valid mapping but will make less data
747 move.
748
749straw_calc_version tunable (introduced with Firefly too)
750--------------------------------------------------------
751
752There were some problems with the internal weights calculated and
753stored in the CRUSH map for ``straw`` buckets. Specifically, when
754there were items with a CRUSH weight of 0 or both a mix of weights and
755some duplicated weights CRUSH would distribute data incorrectly (i.e.,
756not in proportion to the weights).
757
758The new tunable is:
759
760 * ``straw_calc_version``: A value of 0 preserves the old, broken
761 internal weight calculation; a value of 1 fixes the behavior.
762
763Migration impact:
764
765 * Moving to straw_calc_version 1 and then adjusting a straw bucket
766 (by adding, removing, or reweighting an item, or by using the
767 reweight-all command) can trigger a small to moderate amount of
768 data movement *if* the cluster has hit one of the problematic
769 conditions.
770
771This tunable option is special because it has absolutely no impact
772concerning the required kernel version in the client side.
773
774hammer (CRUSH_V4)
775-----------------
776
777The hammer tunable profile does not affect the
778mapping of existing CRUSH maps simply by changing the profile. However:
779
780 * There is a new bucket type (``straw2``) supported. The new
781 ``straw2`` bucket type fixes several limitations in the original
782 ``straw`` bucket. Specifically, the old ``straw`` buckets would
783 change some mappings that should have changed when a weight was
784 adjusted, while ``straw2`` achieves the original goal of only
785 changing mappings to or from the bucket item whose weight has
786 changed.
787
788 * ``straw2`` is the default for any newly created buckets.
789
790Migration impact:
791
792 * Changing a bucket type from ``straw`` to ``straw2`` will result in
793 a reasonably small amount of data movement, depending on how much
794 the bucket item weights vary from each other. When the weights are
795 all the same no data will move, and when item weights vary
796 significantly there will be more movement.
797
798jewel (CRUSH_TUNABLES5)
799-----------------------
800
801The jewel tunable profile improves the
802overall behavior of CRUSH such that significantly fewer mappings
803change when an OSD is marked out of the cluster.
804
805The new tunable is:
806
807 * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
808 use a better value for an inner loop that greatly reduces the number
809 of mapping changes when an OSD is marked out. The legacy value is 0,
810 while the new value of 1 uses the new approach.
811
812Migration impact:
813
814 * Changing this value on an existing cluster will result in a very
815 large amount of data movement as almost every PG mapping is likely
816 to change.
817
818
819
820
821Which client versions support CRUSH_TUNABLES
822--------------------------------------------
823
824 * argonaut series, v0.48.1 or later
825 * v0.49 or later
826 * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
827
828Which client versions support CRUSH_TUNABLES2
829---------------------------------------------
830
831 * v0.55 or later, including bobtail series (v0.56.x)
832 * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
833
834Which client versions support CRUSH_TUNABLES3
835---------------------------------------------
836
837 * v0.78 (firefly) or later
838 * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
839
840Which client versions support CRUSH_V4
841--------------------------------------
842
843 * v0.94 (hammer) or later
844 * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
845
846Which client versions support CRUSH_TUNABLES5
847---------------------------------------------
848
849 * v10.0.2 (jewel) or later
850 * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
851
852Warning when tunables are non-optimal
853-------------------------------------
854
855Starting with version v0.74, Ceph will issue a health warning if the
856current CRUSH tunables don't include all the optimal values from the
857``default`` profile (see below for the meaning of the ``default`` profile).
858To make this warning go away, you have two options:
859
8601. Adjust the tunables on the existing cluster. Note that this will
861 result in some data movement (possibly as much as 10%). This is the
862 preferred route, but should be taken with care on a production cluster
863 where the data movement may affect performance. You can enable optimal
864 tunables with::
865
866 ceph osd crush tunables optimal
867
868 If things go poorly (e.g., too much load) and not very much
869 progress has been made, or there is a client compatibility problem
870 (old kernel cephfs or rbd clients, or pre-bobtail librados
871 clients), you can switch back with::
872
873 ceph osd crush tunables legacy
874
8752. You can make the warning go away without making any changes to CRUSH by
876 adding the following option to your ceph.conf ``[mon]`` section::
877
878 mon warn on legacy crush tunables = false
879
880 For the change to take effect, you will need to restart the monitors, or
881 apply the option to running monitors with::
882
883 ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
884
885
886A few important points
887----------------------
888
889 * Adjusting these values will result in the shift of some PGs between
890 storage nodes. If the Ceph cluster is already storing a lot of
891 data, be prepared for some fraction of the data to move.
892 * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
893 feature bits of new connections as soon as they get
894 the updated map. However, already-connected clients are
895 effectively grandfathered in, and will misbehave if they do not
896 support the new feature.
897 * If the CRUSH tunables are set to non-legacy values and then later
898 changed back to the defult values, ``ceph-osd`` daemons will not be
899 required to support the feature. However, the OSD peering process
900 requires examining and understanding old maps. Therefore, you
901 should not run old versions of the ``ceph-osd`` daemon
902 if the cluster has previously used non-legacy CRUSH values, even if
903 the latest version of the map has been switched back to using the
904 legacy defaults.
905
906Tuning CRUSH
907------------
908
909The simplest way to adjust the crush tunables is by changing to a known
910profile. Those are:
911
912 * ``legacy``: the legacy behavior from argonaut and earlier.
913 * ``argonaut``: the legacy values supported by the original argonaut release
914 * ``bobtail``: the values supported by the bobtail release
915 * ``firefly``: the values supported by the firefly release
c07f9fc5
FG
916 * ``hammer``: the values supported by the hammer release
917 * ``jewel``: the values supported by the jewel release
7c673cae
FG
918 * ``optimal``: the best (ie optimal) values of the current version of Ceph
919 * ``default``: the default values of a new cluster installed from
920 scratch. These values, which depend on the current version of Ceph,
921 are hard coded and are generally a mix of optimal and legacy values.
922 These values generally match the ``optimal`` profile of the previous
923 LTS release, or the most recent release for which we generally except
924 more users to have up to date clients for.
925
926You can select a profile on a running cluster with the command::
927
928 ceph osd crush tunables {PROFILE}
929
930Note that this may result in some data movement.
931
932
c07f9fc5 933.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
7c673cae 934
7c673cae 935
c07f9fc5
FG
936Primary Affinity
937================
7c673cae 938
c07f9fc5
FG
939When a Ceph Client reads or writes data, it always contacts the primary OSD in
940the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
941OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
942a slow disk or a slow controller). To prevent performance bottlenecks
943(especially on read operations) while maximizing utilization of your hardware,
944you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
945the OSD as a primary in an acting set. ::
7c673cae 946
c07f9fc5 947 ceph osd primary-affinity <osd-id> <weight>
7c673cae 948
c07f9fc5
FG
949Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
950may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
951**NOT** be used as a primary and ``1`` means that an OSD may be used as a
952primary. When the weight is ``< 1``, it is less likely that CRUSH will select
953the Ceph OSD Daemon to act as a primary.
7c673cae 954
7c673cae 955
7c673cae 956