]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | CRUSH Maps | |
3 | ============ | |
4 | ||
5 | The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm | |
6 | determines how to store and retrieve data by computing data storage locations. | |
7 | CRUSH empowers Ceph clients to communicate with OSDs directly rather than | |
8 | through a centralized server or broker. With an algorithmically determined | |
9 | method of storing and retrieving data, Ceph avoids a single point of failure, a | |
10 | performance bottleneck, and a physical limit to its scalability. | |
11 | ||
12 | CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly | |
13 | store and retrieve data in OSDs with a uniform distribution of data across the | |
14 | cluster. For a detailed discussion of CRUSH, see | |
15 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ | |
16 | ||
17 | CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of | |
18 | 'buckets' for aggregating the devices into physical locations, and a list of | |
19 | rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By | |
20 | reflecting the underlying physical organization of the installation, CRUSH can | |
21 | model—and thereby address—potential sources of correlated device failures. | |
22 | Typical sources include physical proximity, a shared power source, and a shared | |
23 | network. By encoding this information into the cluster map, CRUSH placement | |
24 | policies can separate object replicas across different failure domains while | |
25 | still maintaining the desired distribution. For example, to address the | |
26 | possibility of concurrent failures, it may be desirable to ensure that data | |
27 | replicas are on devices using different shelves, racks, power supplies, | |
28 | controllers, and/or physical locations. | |
29 | ||
c07f9fc5 FG |
30 | When you deploy OSDs they are automatically placed within the CRUSH map under a |
31 | ``host`` node named with the hostname for the host they are running on. This, | |
32 | combined with the default CRUSH failure domain, ensures that replicas or erasure | |
33 | code shards are separated across hosts and a single host failure will not | |
34 | affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, | |
35 | for example, is common for mid- to large-sized clusters. | |
7c673cae FG |
36 | |
37 | ||
38 | CRUSH Location | |
39 | ============== | |
40 | ||
c07f9fc5 FG |
41 | The location of an OSD in terms of the CRUSH map's hierarchy is |
42 | referred to as a ``crush location``. This location specifier takes the | |
43 | form of a list of key and value pairs describing a position. For | |
44 | example, if an OSD is in a particular row, rack, chassis and host, and | |
45 | is part of the 'default' CRUSH tree (this is the case for the vast | |
46 | majority of clusters), its crush location could be described as:: | |
7c673cae FG |
47 | |
48 | root=default row=a rack=a2 chassis=a2a host=a2a1 | |
49 | ||
50 | Note: | |
51 | ||
52 | #. Note that the order of the keys does not matter. | |
53 | #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default | |
54 | these include root, datacenter, room, row, pod, pdu, rack, chassis and host, | |
55 | but those types can be customized to be anything appropriate by modifying | |
56 | the CRUSH map. | |
57 | #. Not all keys need to be specified. For example, by default, Ceph | |
58 | automatically sets a ``ceph-osd`` daemon's location to be | |
59 | ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). | |
60 | ||
c07f9fc5 FG |
61 | The crush location for an OSD is normally expressed via the ``crush location`` |
62 | config option being set in the ``ceph.conf`` file. Each time the OSD starts, | |
63 | it verifies it is in the correct location in the CRUSH map and, if it is not, | |
64 | it moved itself. To disable this automatic CRUSH map management, add the | |
65 | following to your configuration file in the ``[osd]`` section:: | |
7c673cae FG |
66 | |
67 | osd crush update on start = false | |
68 | ||
c07f9fc5 | 69 | |
7c673cae FG |
70 | Custom location hooks |
71 | --------------------- | |
72 | ||
c07f9fc5 FG |
73 | A customized location hook can be used to generate a more complete |
74 | crush location on startup. The sample ``ceph-crush-location`` utility | |
75 | will generate a CRUSH location string for a given daemon. The | |
76 | location is based on, in order of preference: | |
77 | ||
78 | #. A ``crush location`` option in ceph.conf. | |
79 | #. A default of ``root=default host=HOSTNAME`` where the hostname is | |
80 | generated with the ``hostname -s`` command. | |
81 | ||
82 | This is not useful by itself, as the OSD itself has the exact same | |
83 | behavior. However, the script can be modified to provide additional | |
84 | location fields (for example, the rack or datacenter), and then the | |
85 | hook enabled via the config option:: | |
7c673cae | 86 | |
c07f9fc5 | 87 | crush location hook = /path/to/customized-ceph-crush-location |
7c673cae FG |
88 | |
89 | This hook is passed several arguments (below) and should output a single line | |
90 | to stdout with the CRUSH location description.:: | |
91 | ||
92 | $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE | |
93 | ||
94 | where the cluster name is typically 'ceph', the id is the daemon | |
95 | identifier (the OSD number), and the daemon type is typically ``osd``. | |
96 | ||
97 | ||
c07f9fc5 FG |
98 | CRUSH structure |
99 | =============== | |
7c673cae | 100 | |
c07f9fc5 FG |
101 | The CRUSH map consists of, loosely speaking, a hierarchy describing |
102 | the physical topology of the cluster, and a set of rules defining | |
103 | policy about how we place data on those devices. The hierarchy has | |
104 | devices (``ceph-osd`` daemons) at the leaves, and internal nodes | |
105 | corresponding to other physical features or groupings: hosts, racks, | |
106 | rows, datacenters, and so on. The rules describe how replicas are | |
107 | placed in terms of that hierarchy (e.g., 'three replicas in different | |
108 | racks'). | |
7c673cae | 109 | |
c07f9fc5 FG |
110 | Devices |
111 | ------- | |
7c673cae | 112 | |
c07f9fc5 FG |
113 | Devices are individual ``ceph-osd`` daemons that can store data. You |
114 | will normally have one defined here for each OSD daemon in your | |
115 | cluster. Devices are identified by an id (a non-negative integer) and | |
116 | a name, normally ``osd.N`` where ``N`` is the device id. | |
7c673cae | 117 | |
c07f9fc5 FG |
118 | Devices may also have a *device class* associated with them (e.g., |
119 | ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a | |
120 | crush rule. | |
7c673cae | 121 | |
c07f9fc5 | 122 | Types and Buckets |
7c673cae FG |
123 | ----------------- |
124 | ||
c07f9fc5 FG |
125 | A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, |
126 | racks, rows, etc. The CRUSH map defines a series of *types* that are | |
127 | used to describe these nodes. By default, these types include: | |
128 | ||
129 | - osd (or device) | |
130 | - host | |
131 | - chassis | |
132 | - rack | |
133 | - row | |
134 | - pdu | |
135 | - pod | |
136 | - room | |
137 | - datacenter | |
138 | - region | |
139 | - root | |
140 | ||
141 | Most clusters make use of only a handful of these types, and others | |
142 | can be defined as needed. | |
143 | ||
144 | The hierarchy is built with devices (normally type ``osd``) at the | |
145 | leaves, interior nodes with non-device types, and a root node of type | |
146 | ``root``. For example, | |
147 | ||
148 | .. ditaa:: | |
149 | ||
150 | +-----------------+ | |
151 | | {o}root default | | |
152 | +--------+--------+ | |
7c673cae FG |
153 | | |
154 | +---------------+---------------+ | |
155 | | | | |
c07f9fc5 FG |
156 | +-------+-------+ +-----+-------+ |
157 | | {o}host foo | | {o}host bar | | |
158 | +-------+-------+ +-----+-------+ | |
7c673cae FG |
159 | | | |
160 | +-------+-------+ +-------+-------+ | |
161 | | | | | | |
162 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
c07f9fc5 | 163 | | osd.0 | | osd.1 | | osd.2 | | osd.3 | |
7c673cae FG |
164 | +-----------+ +-----------+ +-----------+ +-----------+ |
165 | ||
c07f9fc5 FG |
166 | Each node (device or bucket) in the hierarchy has a *weight* |
167 | associated with it, indicating the relative proportion of the total | |
168 | data that device or hierarchy subtree should store. Weights are set | |
169 | at the leaves, indicating the size of the device, and automatically | |
170 | sum up the tree from there, such that the weight of the default node | |
171 | will be the total of all devices contained beneath it. Normally | |
172 | weights are in units of terabytes (TB). | |
173 | ||
174 | You can get a simple view the CRUSH hierarchy for your cluster, | |
175 | including the weights, with:: | |
176 | ||
177 | ceph osd crush tree | |
178 | ||
179 | Rules | |
180 | ----- | |
181 | ||
182 | Rules define policy about how data is distributed across the devices | |
183 | in the hierarchy. | |
184 | ||
185 | CRUSH rules define placement and replication strategies or | |
186 | distribution policies that allow you to specify exactly how CRUSH | |
187 | places object replicas. For example, you might create a rule selecting | |
188 | a pair of targets for 2-way mirroring, another rule for selecting | |
189 | three targets in two different data centers for 3-way mirroring, and | |
190 | yet another rule for erasure coding over six storage devices. For a | |
191 | detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, | |
192 | Scalable, Decentralized Placement of Replicated Data`_, and more | |
193 | specifically to **Section 3.2**. | |
194 | ||
195 | In almost all cases, CRUSH rules can be created via the CLI by | |
196 | specifying the *pool type* they will be used for (replicated or | |
197 | erasure coded), the *failure domain*, and optionally a *device class*. | |
198 | In rare cases rules must be written by hand by manually editing the | |
199 | CRUSH map. | |
7c673cae | 200 | |
c07f9fc5 | 201 | You can see what rules are defined for your cluster with:: |
7c673cae | 202 | |
c07f9fc5 | 203 | ceph osd crush rule ls |
7c673cae | 204 | |
c07f9fc5 | 205 | You can view the contents of the rules with:: |
7c673cae | 206 | |
c07f9fc5 | 207 | ceph osd crush rule dump |
7c673cae FG |
208 | |
209 | ||
c07f9fc5 FG |
210 | Weights sets |
211 | ------------ | |
7c673cae | 212 | |
c07f9fc5 FG |
213 | A *weight set* is an alternative set of weights to use when |
214 | calculating data placement. The normal weights associated with each | |
215 | device in the CRUSH map are set based on the device size and indicate | |
216 | how much data we *should* be storing where. However, because CRUSH is | |
217 | based on a pseudorandom placement process, there is always some | |
218 | variation from this ideal distribution, the same way that rolling a | |
219 | dice sixty times will not result in rolling exactly 10 ones and 10 | |
220 | sixes. Weight sets allow the cluster to do a numerical optimization | |
221 | based on the specifics of your cluster (hierarchy, pools, etc.) to achieve | |
222 | a balanced distribution. | |
223 | ||
224 | There are two types of weight sets supported: | |
225 | ||
226 | #. A **compat** weight set is a single alternative set of weights for | |
227 | each device and node in the cluster. This is not well-suited for | |
228 | correcting for all anomalies (for example, placement groups for | |
229 | different pools may be different sizes and have different load | |
230 | levels, but will be mostly treated the same by the balancer). | |
231 | However, compat weight sets have the huge advantage that they are | |
232 | *backward compatible* with previous versions of Ceph, which means | |
233 | that even though weight sets were first introduced in Luminous | |
234 | v12.2.z, older clients (e.g., firefly) can still connect to the | |
235 | cluster when a compat weight set is being used to balance data. | |
236 | #. A **per-pool** weight set is more flexible in that it allows | |
237 | placement to be optimized for each data pool. Additionally, | |
238 | weights can be adjusted for each position of placement, allowing | |
239 | the optimizer to correct for a suble skew of data toward devices | |
240 | with small weights relative to their peers (and effect that is | |
241 | usually only apparently in very large clusters but which can cause | |
242 | balancing problems). | |
243 | ||
244 | When weight sets are in use, the weights associated with each node in | |
245 | the hierarchy is visible as a separate column (labeled either | |
246 | ``(compat)`` or the pool name) from the command:: | |
247 | ||
248 | ceph osd crush tree | |
249 | ||
250 | When both *compat* and *per-pool* weight sets are in use, data | |
251 | placement for a particular pool will use its own per-pool weight set | |
252 | if present. If not, it will use the compat weight set if present. If | |
253 | neither are present, it will use the normal CRUSH weights. | |
254 | ||
255 | Although weight sets can be set up and manipulated by hand, it is | |
256 | recommended that the *balancer* module be enabled to do so | |
257 | automatically. | |
258 | ||
259 | ||
260 | Modifying the CRUSH map | |
261 | ======================= | |
7c673cae FG |
262 | |
263 | .. _addosd: | |
264 | ||
265 | Add/Move an OSD | |
c07f9fc5 | 266 | --------------- |
7c673cae | 267 | |
c07f9fc5 FG |
268 | .. note: OSDs are normally automatically added to the CRUSH map when |
269 | the OSD is created. This command is rarely needed. | |
7c673cae | 270 | |
c07f9fc5 | 271 | To add or move an OSD in the CRUSH map of a running cluster:: |
7c673cae | 272 | |
c07f9fc5 | 273 | ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] |
7c673cae FG |
274 | |
275 | Where: | |
276 | ||
7c673cae FG |
277 | ``name`` |
278 | ||
279 | :Description: The full name of the OSD. | |
280 | :Type: String | |
281 | :Required: Yes | |
282 | :Example: ``osd.0`` | |
283 | ||
284 | ||
285 | ``weight`` | |
286 | ||
c07f9fc5 | 287 | :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). |
7c673cae FG |
288 | :Type: Double |
289 | :Required: Yes | |
290 | :Example: ``2.0`` | |
291 | ||
292 | ||
293 | ``root`` | |
294 | ||
c07f9fc5 | 295 | :Description: The root node of the tree in which the OSD resides (normally ``default``) |
7c673cae FG |
296 | :Type: Key/value pair. |
297 | :Required: Yes | |
298 | :Example: ``root=default`` | |
299 | ||
300 | ||
301 | ``bucket-type`` | |
302 | ||
303 | :Description: You may specify the OSD's location in the CRUSH hierarchy. | |
304 | :Type: Key/value pairs. | |
305 | :Required: No | |
306 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
307 | ||
308 | ||
c07f9fc5 FG |
309 | The following example adds ``osd.0`` to the hierarchy, or moves the |
310 | OSD from a previous location. :: | |
7c673cae | 311 | |
c07f9fc5 | 312 | ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 |
7c673cae FG |
313 | |
314 | ||
c07f9fc5 FG |
315 | Adjust OSD weight |
316 | ----------------- | |
317 | ||
318 | .. note: Normally OSDs automatically add themselves to the CRUSH map | |
319 | with the correct weight when they are created. This command | |
320 | is rarely needed. | |
7c673cae FG |
321 | |
322 | To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute | |
323 | the following:: | |
324 | ||
c07f9fc5 | 325 | ceph osd crush reweight {name} {weight} |
7c673cae FG |
326 | |
327 | Where: | |
328 | ||
329 | ``name`` | |
330 | ||
331 | :Description: The full name of the OSD. | |
332 | :Type: String | |
333 | :Required: Yes | |
334 | :Example: ``osd.0`` | |
335 | ||
336 | ||
337 | ``weight`` | |
338 | ||
339 | :Description: The CRUSH weight for the OSD. | |
340 | :Type: Double | |
341 | :Required: Yes | |
342 | :Example: ``2.0`` | |
343 | ||
344 | ||
345 | .. _removeosd: | |
346 | ||
347 | Remove an OSD | |
c07f9fc5 FG |
348 | ------------- |
349 | ||
350 | .. note: OSDs are normally removed from the CRUSH as part of the | |
351 | ``ceph osd purge`` command. This command is rarely needed. | |
7c673cae | 352 | |
c07f9fc5 FG |
353 | To remove an OSD from the CRUSH map of a running cluster, execute the |
354 | following:: | |
7c673cae | 355 | |
c07f9fc5 | 356 | ceph osd crush remove {name} |
7c673cae FG |
357 | |
358 | Where: | |
359 | ||
360 | ``name`` | |
361 | ||
362 | :Description: The full name of the OSD. | |
363 | :Type: String | |
364 | :Required: Yes | |
365 | :Example: ``osd.0`` | |
366 | ||
c07f9fc5 | 367 | |
7c673cae | 368 | Add a Bucket |
c07f9fc5 FG |
369 | ------------ |
370 | ||
371 | .. note: Buckets are normally implicitly created when an OSD is added | |
372 | that specifies a ``{bucket-type}={bucket-name}`` as part of its | |
373 | location and a bucket with that name does not already exist. This | |
374 | command is typically used when manually adjusting the structure of the | |
375 | hierarchy after OSDs have been created (for example, to move a | |
376 | series of hosts underneath a new rack-level bucket). | |
7c673cae | 377 | |
c07f9fc5 FG |
378 | To add a bucket in the CRUSH map of a running cluster, execute the |
379 | ``ceph osd crush add-bucket`` command:: | |
7c673cae | 380 | |
c07f9fc5 | 381 | ceph osd crush add-bucket {bucket-name} {bucket-type} |
7c673cae FG |
382 | |
383 | Where: | |
384 | ||
385 | ``bucket-name`` | |
386 | ||
387 | :Description: The full name of the bucket. | |
388 | :Type: String | |
389 | :Required: Yes | |
390 | :Example: ``rack12`` | |
391 | ||
392 | ||
393 | ``bucket-type`` | |
394 | ||
395 | :Description: The type of the bucket. The type must already exist in the hierarchy. | |
396 | :Type: String | |
397 | :Required: Yes | |
398 | :Example: ``rack`` | |
399 | ||
400 | ||
401 | The following example adds the ``rack12`` bucket to the hierarchy:: | |
402 | ||
c07f9fc5 | 403 | ceph osd crush add-bucket rack12 rack |
7c673cae FG |
404 | |
405 | Move a Bucket | |
c07f9fc5 | 406 | ------------- |
7c673cae | 407 | |
c07f9fc5 FG |
408 | To move a bucket to a different location or position in the CRUSH map |
409 | hierarchy, execute the following:: | |
7c673cae | 410 | |
c07f9fc5 | 411 | ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] |
7c673cae FG |
412 | |
413 | Where: | |
414 | ||
415 | ``bucket-name`` | |
416 | ||
417 | :Description: The name of the bucket to move/reposition. | |
418 | :Type: String | |
419 | :Required: Yes | |
420 | :Example: ``foo-bar-1`` | |
421 | ||
422 | ``bucket-type`` | |
423 | ||
424 | :Description: You may specify the bucket's location in the CRUSH hierarchy. | |
425 | :Type: Key/value pairs. | |
426 | :Required: No | |
427 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
428 | ||
429 | Remove a Bucket | |
c07f9fc5 | 430 | --------------- |
7c673cae FG |
431 | |
432 | To remove a bucket from the CRUSH map hierarchy, execute the following:: | |
433 | ||
c07f9fc5 | 434 | ceph osd crush remove {bucket-name} |
7c673cae FG |
435 | |
436 | .. note:: A bucket must be empty before removing it from the CRUSH hierarchy. | |
437 | ||
438 | Where: | |
439 | ||
440 | ``bucket-name`` | |
441 | ||
442 | :Description: The name of the bucket that you'd like to remove. | |
443 | :Type: String | |
444 | :Required: Yes | |
445 | :Example: ``rack12`` | |
446 | ||
447 | The following example removes the ``rack12`` bucket from the hierarchy:: | |
448 | ||
c07f9fc5 FG |
449 | ceph osd crush remove rack12 |
450 | ||
451 | Creating a compat weight set | |
452 | ---------------------------- | |
453 | ||
454 | .. note: This step is normally done automatically by the ``balancer`` | |
455 | module when enabled. | |
456 | ||
457 | To create a *compat* weight set:: | |
458 | ||
459 | ceph osd crush weight-set create-compat | |
460 | ||
461 | Weights for the compat weight set can be adjusted with:: | |
462 | ||
463 | ceph osd crush weight-set reweight-compat {name} {weight} | |
464 | ||
465 | The compat weight set can be destroyed with:: | |
466 | ||
467 | ceph osd crush weight-set rm-compat | |
468 | ||
469 | Creating per-pool weight sets | |
470 | ----------------------------- | |
471 | ||
472 | To create a weight set for a specific pool,:: | |
473 | ||
474 | ceph osd crush weight-set create {pool-name} {mode} | |
475 | ||
476 | .. note:: Per-pool weight sets require that all servers and daemons | |
477 | run Luminous v12.2.z or later. | |
478 | ||
479 | Where: | |
480 | ||
481 | ``pool-name`` | |
482 | ||
483 | :Description: The name of a RADOS pool | |
484 | :Type: String | |
485 | :Required: Yes | |
486 | :Example: ``rbd`` | |
487 | ||
488 | ``mode`` | |
489 | ||
490 | :Description: Either ``flat`` or ``positional``. A *flat* weight set | |
491 | has a single weight for each device or bucket. A | |
492 | *positional* weight set has a potentially different | |
493 | weight for each position in the resulting placement | |
494 | mapping. For example, if a pool has a replica count of | |
495 | 3, then a positional weight set will have three weights | |
496 | for each device and bucket. | |
497 | :Type: String | |
498 | :Required: Yes | |
499 | :Example: ``flat`` | |
500 | ||
501 | To adjust the weight of an item in a weight set:: | |
502 | ||
503 | ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} | |
504 | ||
505 | To list existing weight sets,:: | |
506 | ||
507 | ceph osd crush weight-set ls | |
508 | ||
509 | To remove a weight set,:: | |
510 | ||
511 | ceph osd crush weight-set rm {pool-name} | |
512 | ||
513 | Creating a rule for a replicated pool | |
514 | ------------------------------------- | |
515 | ||
516 | For a replicated pool, the primary decision when creating the CRUSH | |
517 | rule is what the failure domain is going to be. For example, if a | |
518 | failure domain of ``host`` is selected, then CRUSH will ensure that | |
519 | each replica of the data is stored on a different host. If ``rack`` | |
520 | is selected, then each replica will be stored in a different rack. | |
521 | What failure domain you choose primarily depends on the size of your | |
522 | cluster and how your hierarchy is structured. | |
523 | ||
524 | Normally, the entire cluster hierarchy is nested beneath a root node | |
525 | named ``default``. If you have customized your hierarchy, you may | |
526 | want to create a rule nested at some other node in the hierarchy. It | |
527 | doesn't matter what type is associated with that node (it doesn't have | |
528 | to be a ``root`` node). | |
529 | ||
530 | It is also possible to create a rule that restricts data placement to | |
531 | a specific *class* of device. By default, Ceph OSDs automatically | |
532 | classify themselves as either ``hdd`` or ``ssd``, depending on the | |
533 | underlying type of device being used. These classes can also be | |
534 | customized. | |
535 | ||
536 | To create a replicated rule,:: | |
537 | ||
538 | ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] | |
539 | ||
540 | Where: | |
541 | ||
542 | ``name`` | |
543 | ||
544 | :Description: The name of the rule | |
545 | :Type: String | |
546 | :Required: Yes | |
547 | :Example: ``rbd-rule`` | |
548 | ||
549 | ``root`` | |
550 | ||
551 | :Description: The name of the node under which data should be placed. | |
552 | :Type: String | |
553 | :Required: Yes | |
554 | :Example: ``default`` | |
555 | ||
556 | ``failure-domain-type`` | |
557 | ||
558 | :Description: The type of CRUSH nodes across which we should separate replicas. | |
559 | :Type: String | |
560 | :Required: Yes | |
561 | :Example: ``rack`` | |
562 | ||
563 | ``class`` | |
564 | ||
565 | :Description: The device class data should be placed on. | |
566 | :Type: String | |
567 | :Required: No | |
568 | :Example: ``ssd`` | |
569 | ||
570 | Creating a rule for an erasure coded pool | |
571 | ----------------------------------------- | |
572 | ||
573 | For an erasure-coded pool, the same basic decisions need to be made as | |
574 | with a replicated pool: what is the failure domain, what node in the | |
575 | hierarchy will data be placed under (usually ``default``), and will | |
576 | placement be restricted to a specific device class. Erasure code | |
577 | pools are created a bit differently, however, because they need to be | |
578 | constructed carefully based on the erasure code being used. For this reason, | |
579 | you must include this information in the *erasure code profile*. A CRUSH | |
580 | rule will then be created from that either explicitly or automatically when | |
581 | the profile is used to create a pool. | |
582 | ||
583 | The erasure code profiles can be listed with:: | |
584 | ||
585 | ceph osd erasure-code-profile ls | |
586 | ||
587 | An existing profile can be viewed with:: | |
588 | ||
589 | ceph osd erasure-code-profile get {profile-name} | |
590 | ||
591 | Normally profiles should never be modified; instead, a new profile | |
592 | should be created and used when creating a new pool or creating a new | |
593 | rule for an existing pool. | |
594 | ||
595 | An erasure code profile consists of a set of key=value pairs. Most of | |
596 | these control the behavior of the erasure code that is encoding data | |
597 | in the pool. Those that begin with ``crush-``, however, affect the | |
598 | CRUSH rule that is created. | |
599 | ||
600 | The erasure code profile properties of interest are: | |
601 | ||
602 | * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. | |
603 | * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. | |
604 | * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. | |
605 | * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. | |
606 | ||
607 | Once a profile is defined, you can create a CRUSH rule with:: | |
608 | ||
609 | ceph osd crush rule create-erasure {name} {profile-name} | |
610 | ||
611 | .. note: When creating a new pool, it is not actually necessary to | |
612 | explicitly create the rule. If the erasure code profile alone is | |
613 | specified and the rule argument is left off then Ceph will create | |
614 | the CRUSH rule automatically. | |
615 | ||
616 | Deleting rules | |
617 | -------------- | |
618 | ||
619 | Rules that are not in use by pools can be deleted with:: | |
620 | ||
621 | ceph osd crush rule rm {rule-name} | |
622 | ||
7c673cae FG |
623 | |
624 | Tunables | |
625 | ======== | |
626 | ||
627 | Over time, we have made (and continue to make) improvements to the | |
628 | CRUSH algorithm used to calculate the placement of data. In order to | |
629 | support the change in behavior, we have introduced a series of tunable | |
630 | options that control whether the legacy or improved variation of the | |
631 | algorithm is used. | |
632 | ||
633 | In order to use newer tunables, both clients and servers must support | |
634 | the new version of CRUSH. For this reason, we have created | |
635 | ``profiles`` that are named after the Ceph version in which they were | |
636 | introduced. For example, the ``firefly`` tunables are first supported | |
637 | in the firefly release, and will not work with older (e.g., dumpling) | |
638 | clients. Once a given set of tunables are changed from the legacy | |
639 | default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older | |
640 | clients who do not support the new CRUSH features from connecting to | |
641 | the cluster. | |
642 | ||
643 | argonaut (legacy) | |
644 | ----------------- | |
645 | ||
646 | The legacy CRUSH behavior used by argonaut and older releases works | |
647 | fine for most clusters, provided there are not too many OSDs that have | |
648 | been marked out. | |
649 | ||
650 | bobtail (CRUSH_TUNABLES2) | |
651 | ------------------------- | |
652 | ||
653 | The bobtail tunable profile fixes a few key misbehaviors: | |
654 | ||
655 | * For hierarchies with a small number of devices in the leaf buckets, | |
656 | some PGs map to fewer than the desired number of replicas. This | |
657 | commonly happens for hierarchies with "host" nodes with a small | |
658 | number (1-3) of OSDs nested beneath each one. | |
659 | ||
660 | * For large clusters, some small percentages of PGs map to less than | |
661 | the desired number of OSDs. This is more prevalent when there are | |
662 | several layers of the hierarchy (e.g., row, rack, host, osd). | |
663 | ||
664 | * When some OSDs are marked out, the data tends to get redistributed | |
665 | to nearby OSDs instead of across the entire hierarchy. | |
666 | ||
667 | The new tunables are: | |
668 | ||
669 | * ``choose_local_tries``: Number of local retries. Legacy value is | |
670 | 2, optimal value is 0. | |
671 | ||
672 | * ``choose_local_fallback_tries``: Legacy value is 5, optimal value | |
673 | is 0. | |
674 | ||
675 | * ``choose_total_tries``: Total number of attempts to choose an item. | |
676 | Legacy value was 19, subsequent testing indicates that a value of | |
677 | 50 is more appropriate for typical clusters. For extremely large | |
678 | clusters, a larger value might be necessary. | |
679 | ||
680 | * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt | |
681 | will retry, or only try once and allow the original placement to | |
682 | retry. Legacy default is 0, optimal value is 1. | |
683 | ||
684 | Migration impact: | |
685 | ||
686 | * Moving from argonaut to bobtail tunables triggers a moderate amount | |
687 | of data movement. Use caution on a cluster that is already | |
688 | populated with data. | |
689 | ||
690 | firefly (CRUSH_TUNABLES3) | |
691 | ------------------------- | |
692 | ||
693 | The firefly tunable profile fixes a problem | |
694 | with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG | |
695 | mappings with too few results when too many OSDs have been marked out. | |
696 | ||
697 | The new tunable is: | |
698 | ||
699 | * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will | |
700 | start with a non-zero value of r, based on how many attempts the | |
701 | parent has already made. Legacy default is 0, but with this value | |
702 | CRUSH is sometimes unable to find a mapping. The optimal value (in | |
703 | terms of computational cost and correctness) is 1. | |
704 | ||
705 | Migration impact: | |
706 | ||
707 | * For existing clusters that have lots of existing data, changing | |
708 | from 0 to 1 will cause a lot of data to move; a value of 4 or 5 | |
709 | will allow CRUSH to find a valid mapping but will make less data | |
710 | move. | |
711 | ||
712 | straw_calc_version tunable (introduced with Firefly too) | |
713 | -------------------------------------------------------- | |
714 | ||
715 | There were some problems with the internal weights calculated and | |
716 | stored in the CRUSH map for ``straw`` buckets. Specifically, when | |
717 | there were items with a CRUSH weight of 0 or both a mix of weights and | |
718 | some duplicated weights CRUSH would distribute data incorrectly (i.e., | |
719 | not in proportion to the weights). | |
720 | ||
721 | The new tunable is: | |
722 | ||
723 | * ``straw_calc_version``: A value of 0 preserves the old, broken | |
724 | internal weight calculation; a value of 1 fixes the behavior. | |
725 | ||
726 | Migration impact: | |
727 | ||
728 | * Moving to straw_calc_version 1 and then adjusting a straw bucket | |
729 | (by adding, removing, or reweighting an item, or by using the | |
730 | reweight-all command) can trigger a small to moderate amount of | |
731 | data movement *if* the cluster has hit one of the problematic | |
732 | conditions. | |
733 | ||
734 | This tunable option is special because it has absolutely no impact | |
735 | concerning the required kernel version in the client side. | |
736 | ||
737 | hammer (CRUSH_V4) | |
738 | ----------------- | |
739 | ||
740 | The hammer tunable profile does not affect the | |
741 | mapping of existing CRUSH maps simply by changing the profile. However: | |
742 | ||
743 | * There is a new bucket type (``straw2``) supported. The new | |
744 | ``straw2`` bucket type fixes several limitations in the original | |
745 | ``straw`` bucket. Specifically, the old ``straw`` buckets would | |
746 | change some mappings that should have changed when a weight was | |
747 | adjusted, while ``straw2`` achieves the original goal of only | |
748 | changing mappings to or from the bucket item whose weight has | |
749 | changed. | |
750 | ||
751 | * ``straw2`` is the default for any newly created buckets. | |
752 | ||
753 | Migration impact: | |
754 | ||
755 | * Changing a bucket type from ``straw`` to ``straw2`` will result in | |
756 | a reasonably small amount of data movement, depending on how much | |
757 | the bucket item weights vary from each other. When the weights are | |
758 | all the same no data will move, and when item weights vary | |
759 | significantly there will be more movement. | |
760 | ||
761 | jewel (CRUSH_TUNABLES5) | |
762 | ----------------------- | |
763 | ||
764 | The jewel tunable profile improves the | |
765 | overall behavior of CRUSH such that significantly fewer mappings | |
766 | change when an OSD is marked out of the cluster. | |
767 | ||
768 | The new tunable is: | |
769 | ||
770 | * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will | |
771 | use a better value for an inner loop that greatly reduces the number | |
772 | of mapping changes when an OSD is marked out. The legacy value is 0, | |
773 | while the new value of 1 uses the new approach. | |
774 | ||
775 | Migration impact: | |
776 | ||
777 | * Changing this value on an existing cluster will result in a very | |
778 | large amount of data movement as almost every PG mapping is likely | |
779 | to change. | |
780 | ||
781 | ||
782 | ||
783 | ||
784 | Which client versions support CRUSH_TUNABLES | |
785 | -------------------------------------------- | |
786 | ||
787 | * argonaut series, v0.48.1 or later | |
788 | * v0.49 or later | |
789 | * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) | |
790 | ||
791 | Which client versions support CRUSH_TUNABLES2 | |
792 | --------------------------------------------- | |
793 | ||
794 | * v0.55 or later, including bobtail series (v0.56.x) | |
795 | * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) | |
796 | ||
797 | Which client versions support CRUSH_TUNABLES3 | |
798 | --------------------------------------------- | |
799 | ||
800 | * v0.78 (firefly) or later | |
801 | * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) | |
802 | ||
803 | Which client versions support CRUSH_V4 | |
804 | -------------------------------------- | |
805 | ||
806 | * v0.94 (hammer) or later | |
807 | * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) | |
808 | ||
809 | Which client versions support CRUSH_TUNABLES5 | |
810 | --------------------------------------------- | |
811 | ||
812 | * v10.0.2 (jewel) or later | |
813 | * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) | |
814 | ||
815 | Warning when tunables are non-optimal | |
816 | ------------------------------------- | |
817 | ||
818 | Starting with version v0.74, Ceph will issue a health warning if the | |
819 | current CRUSH tunables don't include all the optimal values from the | |
820 | ``default`` profile (see below for the meaning of the ``default`` profile). | |
821 | To make this warning go away, you have two options: | |
822 | ||
823 | 1. Adjust the tunables on the existing cluster. Note that this will | |
824 | result in some data movement (possibly as much as 10%). This is the | |
825 | preferred route, but should be taken with care on a production cluster | |
826 | where the data movement may affect performance. You can enable optimal | |
827 | tunables with:: | |
828 | ||
829 | ceph osd crush tunables optimal | |
830 | ||
831 | If things go poorly (e.g., too much load) and not very much | |
832 | progress has been made, or there is a client compatibility problem | |
833 | (old kernel cephfs or rbd clients, or pre-bobtail librados | |
834 | clients), you can switch back with:: | |
835 | ||
836 | ceph osd crush tunables legacy | |
837 | ||
838 | 2. You can make the warning go away without making any changes to CRUSH by | |
839 | adding the following option to your ceph.conf ``[mon]`` section:: | |
840 | ||
841 | mon warn on legacy crush tunables = false | |
842 | ||
843 | For the change to take effect, you will need to restart the monitors, or | |
844 | apply the option to running monitors with:: | |
845 | ||
846 | ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables | |
847 | ||
848 | ||
849 | A few important points | |
850 | ---------------------- | |
851 | ||
852 | * Adjusting these values will result in the shift of some PGs between | |
853 | storage nodes. If the Ceph cluster is already storing a lot of | |
854 | data, be prepared for some fraction of the data to move. | |
855 | * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the | |
856 | feature bits of new connections as soon as they get | |
857 | the updated map. However, already-connected clients are | |
858 | effectively grandfathered in, and will misbehave if they do not | |
859 | support the new feature. | |
860 | * If the CRUSH tunables are set to non-legacy values and then later | |
861 | changed back to the defult values, ``ceph-osd`` daemons will not be | |
862 | required to support the feature. However, the OSD peering process | |
863 | requires examining and understanding old maps. Therefore, you | |
864 | should not run old versions of the ``ceph-osd`` daemon | |
865 | if the cluster has previously used non-legacy CRUSH values, even if | |
866 | the latest version of the map has been switched back to using the | |
867 | legacy defaults. | |
868 | ||
869 | Tuning CRUSH | |
870 | ------------ | |
871 | ||
872 | The simplest way to adjust the crush tunables is by changing to a known | |
873 | profile. Those are: | |
874 | ||
875 | * ``legacy``: the legacy behavior from argonaut and earlier. | |
876 | * ``argonaut``: the legacy values supported by the original argonaut release | |
877 | * ``bobtail``: the values supported by the bobtail release | |
878 | * ``firefly``: the values supported by the firefly release | |
c07f9fc5 FG |
879 | * ``hammer``: the values supported by the hammer release |
880 | * ``jewel``: the values supported by the jewel release | |
7c673cae FG |
881 | * ``optimal``: the best (ie optimal) values of the current version of Ceph |
882 | * ``default``: the default values of a new cluster installed from | |
883 | scratch. These values, which depend on the current version of Ceph, | |
884 | are hard coded and are generally a mix of optimal and legacy values. | |
885 | These values generally match the ``optimal`` profile of the previous | |
886 | LTS release, or the most recent release for which we generally except | |
887 | more users to have up to date clients for. | |
888 | ||
889 | You can select a profile on a running cluster with the command:: | |
890 | ||
891 | ceph osd crush tunables {PROFILE} | |
892 | ||
893 | Note that this may result in some data movement. | |
894 | ||
895 | ||
c07f9fc5 | 896 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf |
7c673cae | 897 | |
7c673cae | 898 | |
c07f9fc5 FG |
899 | Primary Affinity |
900 | ================ | |
7c673cae | 901 | |
c07f9fc5 FG |
902 | When a Ceph Client reads or writes data, it always contacts the primary OSD in |
903 | the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an | |
904 | OSD is not well suited to act as a primary compared to other OSDs (e.g., it has | |
905 | a slow disk or a slow controller). To prevent performance bottlenecks | |
906 | (especially on read operations) while maximizing utilization of your hardware, | |
907 | you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use | |
908 | the OSD as a primary in an acting set. :: | |
7c673cae | 909 | |
c07f9fc5 | 910 | ceph osd primary-affinity <osd-id> <weight> |
7c673cae | 911 | |
c07f9fc5 FG |
912 | Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You |
913 | may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may | |
914 | **NOT** be used as a primary and ``1`` means that an OSD may be used as a | |
915 | primary. When the weight is ``< 1``, it is less likely that CRUSH will select | |
916 | the Ceph OSD Daemon to act as a primary. | |
7c673cae | 917 | |
7c673cae | 918 | |
7c673cae | 919 |