]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/crush-map.rst
update source to 12.2.11
[ceph.git] / ceph / doc / rados / operations / crush-map.rst
CommitLineData
7c673cae
FG
1============
2 CRUSH Maps
3============
4
5The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
6determines how to store and retrieve data by computing data storage locations.
7CRUSH empowers Ceph clients to communicate with OSDs directly rather than
8through a centralized server or broker. With an algorithmically determined
9method of storing and retrieving data, Ceph avoids a single point of failure, a
10performance bottleneck, and a physical limit to its scalability.
11
12CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
13store and retrieve data in OSDs with a uniform distribution of data across the
14cluster. For a detailed discussion of CRUSH, see
15`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
16
17CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
18'buckets' for aggregating the devices into physical locations, and a list of
19rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
20reflecting the underlying physical organization of the installation, CRUSH can
21model—and thereby address—potential sources of correlated device failures.
22Typical sources include physical proximity, a shared power source, and a shared
23network. By encoding this information into the cluster map, CRUSH placement
24policies can separate object replicas across different failure domains while
25still maintaining the desired distribution. For example, to address the
26possibility of concurrent failures, it may be desirable to ensure that data
27replicas are on devices using different shelves, racks, power supplies,
28controllers, and/or physical locations.
29
c07f9fc5
FG
30When you deploy OSDs they are automatically placed within the CRUSH map under a
31``host`` node named with the hostname for the host they are running on. This,
32combined with the default CRUSH failure domain, ensures that replicas or erasure
33code shards are separated across hosts and a single host failure will not
34affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
35for example, is common for mid- to large-sized clusters.
7c673cae
FG
36
37
38CRUSH Location
39==============
40
c07f9fc5
FG
41The location of an OSD in terms of the CRUSH map's hierarchy is
42referred to as a ``crush location``. This location specifier takes the
43form of a list of key and value pairs describing a position. For
44example, if an OSD is in a particular row, rack, chassis and host, and
45is part of the 'default' CRUSH tree (this is the case for the vast
46majority of clusters), its crush location could be described as::
7c673cae
FG
47
48 root=default row=a rack=a2 chassis=a2a host=a2a1
49
50Note:
51
52#. Note that the order of the keys does not matter.
53#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
54 these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
55 but those types can be customized to be anything appropriate by modifying
56 the CRUSH map.
57#. Not all keys need to be specified. For example, by default, Ceph
58 automatically sets a ``ceph-osd`` daemon's location to be
59 ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
60
c07f9fc5
FG
61The crush location for an OSD is normally expressed via the ``crush location``
62config option being set in the ``ceph.conf`` file. Each time the OSD starts,
63it verifies it is in the correct location in the CRUSH map and, if it is not,
64it moved itself. To disable this automatic CRUSH map management, add the
65following to your configuration file in the ``[osd]`` section::
7c673cae
FG
66
67 osd crush update on start = false
68
c07f9fc5 69
7c673cae
FG
70Custom location hooks
71---------------------
72
c07f9fc5
FG
73A customized location hook can be used to generate a more complete
74crush location on startup. The sample ``ceph-crush-location`` utility
75will generate a CRUSH location string for a given daemon. The
76location is based on, in order of preference:
77
78#. A ``crush location`` option in ceph.conf.
79#. A default of ``root=default host=HOSTNAME`` where the hostname is
80 generated with the ``hostname -s`` command.
81
82This is not useful by itself, as the OSD itself has the exact same
83behavior. However, the script can be modified to provide additional
84location fields (for example, the rack or datacenter), and then the
85hook enabled via the config option::
7c673cae 86
c07f9fc5 87 crush location hook = /path/to/customized-ceph-crush-location
7c673cae
FG
88
89This hook is passed several arguments (below) and should output a single line
90to stdout with the CRUSH location description.::
91
92 $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
93
94where the cluster name is typically 'ceph', the id is the daemon
95identifier (the OSD number), and the daemon type is typically ``osd``.
96
97
c07f9fc5
FG
98CRUSH structure
99===============
7c673cae 100
c07f9fc5
FG
101The CRUSH map consists of, loosely speaking, a hierarchy describing
102the physical topology of the cluster, and a set of rules defining
103policy about how we place data on those devices. The hierarchy has
104devices (``ceph-osd`` daemons) at the leaves, and internal nodes
105corresponding to other physical features or groupings: hosts, racks,
106rows, datacenters, and so on. The rules describe how replicas are
107placed in terms of that hierarchy (e.g., 'three replicas in different
108racks').
7c673cae 109
c07f9fc5
FG
110Devices
111-------
7c673cae 112
c07f9fc5
FG
113Devices are individual ``ceph-osd`` daemons that can store data. You
114will normally have one defined here for each OSD daemon in your
115cluster. Devices are identified by an id (a non-negative integer) and
116a name, normally ``osd.N`` where ``N`` is the device id.
7c673cae 117
c07f9fc5
FG
118Devices may also have a *device class* associated with them (e.g.,
119``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
120crush rule.
7c673cae 121
c07f9fc5 122Types and Buckets
7c673cae
FG
123-----------------
124
c07f9fc5
FG
125A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
126racks, rows, etc. The CRUSH map defines a series of *types* that are
127used to describe these nodes. By default, these types include:
128
129- osd (or device)
130- host
131- chassis
132- rack
133- row
134- pdu
135- pod
136- room
137- datacenter
138- region
139- root
140
141Most clusters make use of only a handful of these types, and others
142can be defined as needed.
143
144The hierarchy is built with devices (normally type ``osd``) at the
145leaves, interior nodes with non-device types, and a root node of type
146``root``. For example,
147
148.. ditaa::
149
150 +-----------------+
151 | {o}root default |
152 +--------+--------+
7c673cae
FG
153 |
154 +---------------+---------------+
155 | |
c07f9fc5
FG
156 +-------+-------+ +-----+-------+
157 | {o}host foo | | {o}host bar |
158 +-------+-------+ +-----+-------+
7c673cae
FG
159 | |
160 +-------+-------+ +-------+-------+
161 | | | |
162 +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
c07f9fc5 163 | osd.0 | | osd.1 | | osd.2 | | osd.3 |
7c673cae
FG
164 +-----------+ +-----------+ +-----------+ +-----------+
165
c07f9fc5
FG
166Each node (device or bucket) in the hierarchy has a *weight*
167associated with it, indicating the relative proportion of the total
168data that device or hierarchy subtree should store. Weights are set
169at the leaves, indicating the size of the device, and automatically
170sum up the tree from there, such that the weight of the default node
171will be the total of all devices contained beneath it. Normally
172weights are in units of terabytes (TB).
173
174You can get a simple view the CRUSH hierarchy for your cluster,
175including the weights, with::
176
177 ceph osd crush tree
178
179Rules
180-----
181
182Rules define policy about how data is distributed across the devices
183in the hierarchy.
184
185CRUSH rules define placement and replication strategies or
186distribution policies that allow you to specify exactly how CRUSH
187places object replicas. For example, you might create a rule selecting
188a pair of targets for 2-way mirroring, another rule for selecting
189three targets in two different data centers for 3-way mirroring, and
190yet another rule for erasure coding over six storage devices. For a
191detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
192Scalable, Decentralized Placement of Replicated Data`_, and more
193specifically to **Section 3.2**.
194
195In almost all cases, CRUSH rules can be created via the CLI by
196specifying the *pool type* they will be used for (replicated or
197erasure coded), the *failure domain*, and optionally a *device class*.
198In rare cases rules must be written by hand by manually editing the
199CRUSH map.
7c673cae 200
c07f9fc5 201You can see what rules are defined for your cluster with::
7c673cae 202
c07f9fc5 203 ceph osd crush rule ls
7c673cae 204
c07f9fc5 205You can view the contents of the rules with::
7c673cae 206
c07f9fc5 207 ceph osd crush rule dump
7c673cae 208
d2e6a577
FG
209Device classes
210--------------
211
212Each device can optionally have a *class* associated with it. By
213default, OSDs automatically set their class on startup to either
214`hdd`, `ssd`, or `nvme` based on the type of device they are backed
215by.
216
217The device class for one or more OSDs can be explicitly set with::
218
219 ceph osd crush set-device-class <class> <osd-name> [...]
220
221Once a device class is set, it cannot be changed to another class
222until the old class is unset with::
223
224 ceph osd crush rm-device-class <osd-name> [...]
225
226This allows administrators to set device classes without the class
227being changed on OSD restart or by some other script.
228
229A placement rule that targets a specific device class can be created with::
230
231 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
232
233A pool can then be changed to use the new rule with::
234
235 ceph osd pool set <pool-name> crush_rule <rule-name>
236
237Device classes are implemented by creating a "shadow" CRUSH hierarchy
238for each device class in use that contains only devices of that class.
239Rules can then distribute data over the shadow hierarchy. One nice
240thing about this approach is that it is fully backward compatible with
241old Ceph clients. You can view the CRUSH hierarchy with shadow items
242with::
243
244 ceph osd crush tree --show-shadow
245
f64942e4
AA
246For older clusters created before Luminous that relied on manually
247crafted CRUSH maps to maintain per-device-type hierarchies, there is a
248*reclassify* tool available to help transition to device classes
249without triggering data movement (see :ref:`crush-reclassify`).
250
7c673cae 251
c07f9fc5
FG
252Weights sets
253------------
7c673cae 254
c07f9fc5
FG
255A *weight set* is an alternative set of weights to use when
256calculating data placement. The normal weights associated with each
257device in the CRUSH map are set based on the device size and indicate
258how much data we *should* be storing where. However, because CRUSH is
259based on a pseudorandom placement process, there is always some
260variation from this ideal distribution, the same way that rolling a
261dice sixty times will not result in rolling exactly 10 ones and 10
262sixes. Weight sets allow the cluster to do a numerical optimization
263based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
264a balanced distribution.
265
266There are two types of weight sets supported:
267
268 #. A **compat** weight set is a single alternative set of weights for
269 each device and node in the cluster. This is not well-suited for
270 correcting for all anomalies (for example, placement groups for
271 different pools may be different sizes and have different load
272 levels, but will be mostly treated the same by the balancer).
273 However, compat weight sets have the huge advantage that they are
274 *backward compatible* with previous versions of Ceph, which means
275 that even though weight sets were first introduced in Luminous
276 v12.2.z, older clients (e.g., firefly) can still connect to the
277 cluster when a compat weight set is being used to balance data.
278 #. A **per-pool** weight set is more flexible in that it allows
279 placement to be optimized for each data pool. Additionally,
280 weights can be adjusted for each position of placement, allowing
281 the optimizer to correct for a suble skew of data toward devices
282 with small weights relative to their peers (and effect that is
283 usually only apparently in very large clusters but which can cause
284 balancing problems).
285
286When weight sets are in use, the weights associated with each node in
287the hierarchy is visible as a separate column (labeled either
288``(compat)`` or the pool name) from the command::
289
290 ceph osd crush tree
291
292When both *compat* and *per-pool* weight sets are in use, data
293placement for a particular pool will use its own per-pool weight set
294if present. If not, it will use the compat weight set if present. If
295neither are present, it will use the normal CRUSH weights.
296
297Although weight sets can be set up and manipulated by hand, it is
298recommended that the *balancer* module be enabled to do so
299automatically.
300
301
302Modifying the CRUSH map
303=======================
7c673cae
FG
304
305.. _addosd:
306
307Add/Move an OSD
c07f9fc5 308---------------
7c673cae 309
c07f9fc5
FG
310.. note: OSDs are normally automatically added to the CRUSH map when
311 the OSD is created. This command is rarely needed.
7c673cae 312
c07f9fc5 313To add or move an OSD in the CRUSH map of a running cluster::
7c673cae 314
c07f9fc5 315 ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
7c673cae
FG
316
317Where:
318
7c673cae
FG
319``name``
320
321:Description: The full name of the OSD.
322:Type: String
323:Required: Yes
324:Example: ``osd.0``
325
326
327``weight``
328
c07f9fc5 329:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
7c673cae
FG
330:Type: Double
331:Required: Yes
332:Example: ``2.0``
333
334
335``root``
336
c07f9fc5 337:Description: The root node of the tree in which the OSD resides (normally ``default``)
7c673cae
FG
338:Type: Key/value pair.
339:Required: Yes
340:Example: ``root=default``
341
342
343``bucket-type``
344
345:Description: You may specify the OSD's location in the CRUSH hierarchy.
346:Type: Key/value pairs.
347:Required: No
348:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
349
350
c07f9fc5
FG
351The following example adds ``osd.0`` to the hierarchy, or moves the
352OSD from a previous location. ::
7c673cae 353
c07f9fc5 354 ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7c673cae
FG
355
356
c07f9fc5
FG
357Adjust OSD weight
358-----------------
359
360.. note: Normally OSDs automatically add themselves to the CRUSH map
361 with the correct weight when they are created. This command
362 is rarely needed.
7c673cae
FG
363
364To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
365the following::
366
c07f9fc5 367 ceph osd crush reweight {name} {weight}
7c673cae
FG
368
369Where:
370
371``name``
372
373:Description: The full name of the OSD.
374:Type: String
375:Required: Yes
376:Example: ``osd.0``
377
378
379``weight``
380
381:Description: The CRUSH weight for the OSD.
382:Type: Double
383:Required: Yes
384:Example: ``2.0``
385
386
387.. _removeosd:
388
389Remove an OSD
c07f9fc5
FG
390-------------
391
392.. note: OSDs are normally removed from the CRUSH as part of the
393 ``ceph osd purge`` command. This command is rarely needed.
7c673cae 394
c07f9fc5
FG
395To remove an OSD from the CRUSH map of a running cluster, execute the
396following::
7c673cae 397
c07f9fc5 398 ceph osd crush remove {name}
7c673cae
FG
399
400Where:
401
402``name``
403
404:Description: The full name of the OSD.
405:Type: String
406:Required: Yes
407:Example: ``osd.0``
408
c07f9fc5 409
7c673cae 410Add a Bucket
c07f9fc5
FG
411------------
412
413.. note: Buckets are normally implicitly created when an OSD is added
414 that specifies a ``{bucket-type}={bucket-name}`` as part of its
415 location and a bucket with that name does not already exist. This
416 command is typically used when manually adjusting the structure of the
417 hierarchy after OSDs have been created (for example, to move a
418 series of hosts underneath a new rack-level bucket).
7c673cae 419
c07f9fc5
FG
420To add a bucket in the CRUSH map of a running cluster, execute the
421``ceph osd crush add-bucket`` command::
7c673cae 422
c07f9fc5 423 ceph osd crush add-bucket {bucket-name} {bucket-type}
7c673cae
FG
424
425Where:
426
427``bucket-name``
428
429:Description: The full name of the bucket.
430:Type: String
431:Required: Yes
432:Example: ``rack12``
433
434
435``bucket-type``
436
437:Description: The type of the bucket. The type must already exist in the hierarchy.
438:Type: String
439:Required: Yes
440:Example: ``rack``
441
442
443The following example adds the ``rack12`` bucket to the hierarchy::
444
c07f9fc5 445 ceph osd crush add-bucket rack12 rack
7c673cae
FG
446
447Move a Bucket
c07f9fc5 448-------------
7c673cae 449
c07f9fc5
FG
450To move a bucket to a different location or position in the CRUSH map
451hierarchy, execute the following::
7c673cae 452
c07f9fc5 453 ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
7c673cae
FG
454
455Where:
456
457``bucket-name``
458
459:Description: The name of the bucket to move/reposition.
460:Type: String
461:Required: Yes
462:Example: ``foo-bar-1``
463
464``bucket-type``
465
466:Description: You may specify the bucket's location in the CRUSH hierarchy.
467:Type: Key/value pairs.
468:Required: No
469:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
470
471Remove a Bucket
c07f9fc5 472---------------
7c673cae
FG
473
474To remove a bucket from the CRUSH map hierarchy, execute the following::
475
c07f9fc5 476 ceph osd crush remove {bucket-name}
7c673cae
FG
477
478.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
479
480Where:
481
482``bucket-name``
483
484:Description: The name of the bucket that you'd like to remove.
485:Type: String
486:Required: Yes
487:Example: ``rack12``
488
489The following example removes the ``rack12`` bucket from the hierarchy::
490
c07f9fc5
FG
491 ceph osd crush remove rack12
492
493Creating a compat weight set
494----------------------------
495
496.. note: This step is normally done automatically by the ``balancer``
497 module when enabled.
498
499To create a *compat* weight set::
500
501 ceph osd crush weight-set create-compat
502
503Weights for the compat weight set can be adjusted with::
504
505 ceph osd crush weight-set reweight-compat {name} {weight}
506
507The compat weight set can be destroyed with::
508
509 ceph osd crush weight-set rm-compat
510
511Creating per-pool weight sets
512-----------------------------
513
514To create a weight set for a specific pool,::
515
516 ceph osd crush weight-set create {pool-name} {mode}
517
518.. note:: Per-pool weight sets require that all servers and daemons
519 run Luminous v12.2.z or later.
520
521Where:
522
523``pool-name``
524
525:Description: The name of a RADOS pool
526:Type: String
527:Required: Yes
528:Example: ``rbd``
529
530``mode``
531
532:Description: Either ``flat`` or ``positional``. A *flat* weight set
533 has a single weight for each device or bucket. A
534 *positional* weight set has a potentially different
535 weight for each position in the resulting placement
536 mapping. For example, if a pool has a replica count of
537 3, then a positional weight set will have three weights
538 for each device and bucket.
539:Type: String
540:Required: Yes
541:Example: ``flat``
542
543To adjust the weight of an item in a weight set::
544
545 ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
546
547To list existing weight sets,::
548
549 ceph osd crush weight-set ls
550
551To remove a weight set,::
552
553 ceph osd crush weight-set rm {pool-name}
554
555Creating a rule for a replicated pool
556-------------------------------------
557
558For a replicated pool, the primary decision when creating the CRUSH
559rule is what the failure domain is going to be. For example, if a
560failure domain of ``host`` is selected, then CRUSH will ensure that
561each replica of the data is stored on a different host. If ``rack``
562is selected, then each replica will be stored in a different rack.
563What failure domain you choose primarily depends on the size of your
564cluster and how your hierarchy is structured.
565
566Normally, the entire cluster hierarchy is nested beneath a root node
567named ``default``. If you have customized your hierarchy, you may
568want to create a rule nested at some other node in the hierarchy. It
569doesn't matter what type is associated with that node (it doesn't have
570to be a ``root`` node).
571
572It is also possible to create a rule that restricts data placement to
573a specific *class* of device. By default, Ceph OSDs automatically
574classify themselves as either ``hdd`` or ``ssd``, depending on the
575underlying type of device being used. These classes can also be
576customized.
577
578To create a replicated rule,::
579
580 ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
581
582Where:
583
584``name``
585
586:Description: The name of the rule
587:Type: String
588:Required: Yes
589:Example: ``rbd-rule``
590
591``root``
592
593:Description: The name of the node under which data should be placed.
594:Type: String
595:Required: Yes
596:Example: ``default``
597
598``failure-domain-type``
599
600:Description: The type of CRUSH nodes across which we should separate replicas.
601:Type: String
602:Required: Yes
603:Example: ``rack``
604
605``class``
606
607:Description: The device class data should be placed on.
608:Type: String
609:Required: No
610:Example: ``ssd``
611
612Creating a rule for an erasure coded pool
613-----------------------------------------
614
615For an erasure-coded pool, the same basic decisions need to be made as
616with a replicated pool: what is the failure domain, what node in the
617hierarchy will data be placed under (usually ``default``), and will
618placement be restricted to a specific device class. Erasure code
619pools are created a bit differently, however, because they need to be
620constructed carefully based on the erasure code being used. For this reason,
621you must include this information in the *erasure code profile*. A CRUSH
622rule will then be created from that either explicitly or automatically when
623the profile is used to create a pool.
624
625The erasure code profiles can be listed with::
626
627 ceph osd erasure-code-profile ls
628
629An existing profile can be viewed with::
630
631 ceph osd erasure-code-profile get {profile-name}
632
633Normally profiles should never be modified; instead, a new profile
634should be created and used when creating a new pool or creating a new
635rule for an existing pool.
636
637An erasure code profile consists of a set of key=value pairs. Most of
638these control the behavior of the erasure code that is encoding data
639in the pool. Those that begin with ``crush-``, however, affect the
640CRUSH rule that is created.
641
642The erasure code profile properties of interest are:
643
644 * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
645 * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
646 * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
647 * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
648
649Once a profile is defined, you can create a CRUSH rule with::
650
651 ceph osd crush rule create-erasure {name} {profile-name}
652
653.. note: When creating a new pool, it is not actually necessary to
654 explicitly create the rule. If the erasure code profile alone is
655 specified and the rule argument is left off then Ceph will create
656 the CRUSH rule automatically.
657
658Deleting rules
659--------------
660
661Rules that are not in use by pools can be deleted with::
662
663 ceph osd crush rule rm {rule-name}
664
7c673cae
FG
665
666Tunables
667========
668
669Over time, we have made (and continue to make) improvements to the
670CRUSH algorithm used to calculate the placement of data. In order to
671support the change in behavior, we have introduced a series of tunable
672options that control whether the legacy or improved variation of the
673algorithm is used.
674
675In order to use newer tunables, both clients and servers must support
676the new version of CRUSH. For this reason, we have created
677``profiles`` that are named after the Ceph version in which they were
678introduced. For example, the ``firefly`` tunables are first supported
679in the firefly release, and will not work with older (e.g., dumpling)
680clients. Once a given set of tunables are changed from the legacy
681default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
682clients who do not support the new CRUSH features from connecting to
683the cluster.
684
685argonaut (legacy)
686-----------------
687
688The legacy CRUSH behavior used by argonaut and older releases works
689fine for most clusters, provided there are not too many OSDs that have
690been marked out.
691
692bobtail (CRUSH_TUNABLES2)
693-------------------------
694
695The bobtail tunable profile fixes a few key misbehaviors:
696
697 * For hierarchies with a small number of devices in the leaf buckets,
698 some PGs map to fewer than the desired number of replicas. This
699 commonly happens for hierarchies with "host" nodes with a small
700 number (1-3) of OSDs nested beneath each one.
701
702 * For large clusters, some small percentages of PGs map to less than
703 the desired number of OSDs. This is more prevalent when there are
704 several layers of the hierarchy (e.g., row, rack, host, osd).
705
706 * When some OSDs are marked out, the data tends to get redistributed
707 to nearby OSDs instead of across the entire hierarchy.
708
709The new tunables are:
710
711 * ``choose_local_tries``: Number of local retries. Legacy value is
712 2, optimal value is 0.
713
714 * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
715 is 0.
716
717 * ``choose_total_tries``: Total number of attempts to choose an item.
718 Legacy value was 19, subsequent testing indicates that a value of
719 50 is more appropriate for typical clusters. For extremely large
720 clusters, a larger value might be necessary.
721
722 * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
723 will retry, or only try once and allow the original placement to
724 retry. Legacy default is 0, optimal value is 1.
725
726Migration impact:
727
728 * Moving from argonaut to bobtail tunables triggers a moderate amount
729 of data movement. Use caution on a cluster that is already
730 populated with data.
731
732firefly (CRUSH_TUNABLES3)
733-------------------------
734
735The firefly tunable profile fixes a problem
736with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
737mappings with too few results when too many OSDs have been marked out.
738
739The new tunable is:
740
741 * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
742 start with a non-zero value of r, based on how many attempts the
743 parent has already made. Legacy default is 0, but with this value
744 CRUSH is sometimes unable to find a mapping. The optimal value (in
745 terms of computational cost and correctness) is 1.
746
747Migration impact:
748
749 * For existing clusters that have lots of existing data, changing
750 from 0 to 1 will cause a lot of data to move; a value of 4 or 5
751 will allow CRUSH to find a valid mapping but will make less data
752 move.
753
754straw_calc_version tunable (introduced with Firefly too)
755--------------------------------------------------------
756
757There were some problems with the internal weights calculated and
758stored in the CRUSH map for ``straw`` buckets. Specifically, when
759there were items with a CRUSH weight of 0 or both a mix of weights and
760some duplicated weights CRUSH would distribute data incorrectly (i.e.,
761not in proportion to the weights).
762
763The new tunable is:
764
765 * ``straw_calc_version``: A value of 0 preserves the old, broken
766 internal weight calculation; a value of 1 fixes the behavior.
767
768Migration impact:
769
770 * Moving to straw_calc_version 1 and then adjusting a straw bucket
771 (by adding, removing, or reweighting an item, or by using the
772 reweight-all command) can trigger a small to moderate amount of
773 data movement *if* the cluster has hit one of the problematic
774 conditions.
775
776This tunable option is special because it has absolutely no impact
777concerning the required kernel version in the client side.
778
779hammer (CRUSH_V4)
780-----------------
781
782The hammer tunable profile does not affect the
783mapping of existing CRUSH maps simply by changing the profile. However:
784
785 * There is a new bucket type (``straw2``) supported. The new
786 ``straw2`` bucket type fixes several limitations in the original
787 ``straw`` bucket. Specifically, the old ``straw`` buckets would
788 change some mappings that should have changed when a weight was
789 adjusted, while ``straw2`` achieves the original goal of only
790 changing mappings to or from the bucket item whose weight has
791 changed.
792
793 * ``straw2`` is the default for any newly created buckets.
794
795Migration impact:
796
797 * Changing a bucket type from ``straw`` to ``straw2`` will result in
798 a reasonably small amount of data movement, depending on how much
799 the bucket item weights vary from each other. When the weights are
800 all the same no data will move, and when item weights vary
801 significantly there will be more movement.
802
803jewel (CRUSH_TUNABLES5)
804-----------------------
805
806The jewel tunable profile improves the
807overall behavior of CRUSH such that significantly fewer mappings
808change when an OSD is marked out of the cluster.
809
810The new tunable is:
811
812 * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
813 use a better value for an inner loop that greatly reduces the number
814 of mapping changes when an OSD is marked out. The legacy value is 0,
815 while the new value of 1 uses the new approach.
816
817Migration impact:
818
819 * Changing this value on an existing cluster will result in a very
820 large amount of data movement as almost every PG mapping is likely
821 to change.
822
823
824
825
826Which client versions support CRUSH_TUNABLES
827--------------------------------------------
828
829 * argonaut series, v0.48.1 or later
830 * v0.49 or later
831 * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
832
833Which client versions support CRUSH_TUNABLES2
834---------------------------------------------
835
836 * v0.55 or later, including bobtail series (v0.56.x)
837 * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
838
839Which client versions support CRUSH_TUNABLES3
840---------------------------------------------
841
842 * v0.78 (firefly) or later
843 * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
844
845Which client versions support CRUSH_V4
846--------------------------------------
847
848 * v0.94 (hammer) or later
849 * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
850
851Which client versions support CRUSH_TUNABLES5
852---------------------------------------------
853
854 * v10.0.2 (jewel) or later
855 * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
856
857Warning when tunables are non-optimal
858-------------------------------------
859
860Starting with version v0.74, Ceph will issue a health warning if the
861current CRUSH tunables don't include all the optimal values from the
862``default`` profile (see below for the meaning of the ``default`` profile).
863To make this warning go away, you have two options:
864
8651. Adjust the tunables on the existing cluster. Note that this will
866 result in some data movement (possibly as much as 10%). This is the
867 preferred route, but should be taken with care on a production cluster
868 where the data movement may affect performance. You can enable optimal
869 tunables with::
870
871 ceph osd crush tunables optimal
872
873 If things go poorly (e.g., too much load) and not very much
874 progress has been made, or there is a client compatibility problem
875 (old kernel cephfs or rbd clients, or pre-bobtail librados
876 clients), you can switch back with::
877
878 ceph osd crush tunables legacy
879
8802. You can make the warning go away without making any changes to CRUSH by
881 adding the following option to your ceph.conf ``[mon]`` section::
882
883 mon warn on legacy crush tunables = false
884
885 For the change to take effect, you will need to restart the monitors, or
886 apply the option to running monitors with::
887
888 ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
889
890
891A few important points
892----------------------
893
894 * Adjusting these values will result in the shift of some PGs between
895 storage nodes. If the Ceph cluster is already storing a lot of
896 data, be prepared for some fraction of the data to move.
897 * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
898 feature bits of new connections as soon as they get
899 the updated map. However, already-connected clients are
900 effectively grandfathered in, and will misbehave if they do not
901 support the new feature.
902 * If the CRUSH tunables are set to non-legacy values and then later
903 changed back to the defult values, ``ceph-osd`` daemons will not be
904 required to support the feature. However, the OSD peering process
905 requires examining and understanding old maps. Therefore, you
906 should not run old versions of the ``ceph-osd`` daemon
907 if the cluster has previously used non-legacy CRUSH values, even if
908 the latest version of the map has been switched back to using the
909 legacy defaults.
910
911Tuning CRUSH
912------------
913
914The simplest way to adjust the crush tunables is by changing to a known
915profile. Those are:
916
917 * ``legacy``: the legacy behavior from argonaut and earlier.
918 * ``argonaut``: the legacy values supported by the original argonaut release
919 * ``bobtail``: the values supported by the bobtail release
920 * ``firefly``: the values supported by the firefly release
c07f9fc5
FG
921 * ``hammer``: the values supported by the hammer release
922 * ``jewel``: the values supported by the jewel release
7c673cae
FG
923 * ``optimal``: the best (ie optimal) values of the current version of Ceph
924 * ``default``: the default values of a new cluster installed from
925 scratch. These values, which depend on the current version of Ceph,
926 are hard coded and are generally a mix of optimal and legacy values.
927 These values generally match the ``optimal`` profile of the previous
928 LTS release, or the most recent release for which we generally except
929 more users to have up to date clients for.
930
931You can select a profile on a running cluster with the command::
932
933 ceph osd crush tunables {PROFILE}
934
935Note that this may result in some data movement.
936
937
c07f9fc5 938.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
7c673cae 939
7c673cae 940
c07f9fc5
FG
941Primary Affinity
942================
7c673cae 943
c07f9fc5
FG
944When a Ceph Client reads or writes data, it always contacts the primary OSD in
945the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
946OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
947a slow disk or a slow controller). To prevent performance bottlenecks
948(especially on read operations) while maximizing utilization of your hardware,
949you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
950the OSD as a primary in an acting set. ::
7c673cae 951
c07f9fc5 952 ceph osd primary-affinity <osd-id> <weight>
7c673cae 953
c07f9fc5
FG
954Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
955may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
956**NOT** be used as a primary and ``1`` means that an OSD may be used as a
957primary. When the weight is ``< 1``, it is less likely that CRUSH will select
958the Ceph OSD Daemon to act as a primary.
7c673cae 959
7c673cae 960
7c673cae 961