]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | CRUSH Maps | |
3 | ============ | |
4 | ||
5 | The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm | |
6 | determines how to store and retrieve data by computing data storage locations. | |
7 | CRUSH empowers Ceph clients to communicate with OSDs directly rather than | |
8 | through a centralized server or broker. With an algorithmically determined | |
9 | method of storing and retrieving data, Ceph avoids a single point of failure, a | |
10 | performance bottleneck, and a physical limit to its scalability. | |
11 | ||
12 | CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly | |
13 | store and retrieve data in OSDs with a uniform distribution of data across the | |
14 | cluster. For a detailed discussion of CRUSH, see | |
15 | `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ | |
16 | ||
17 | CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of | |
18 | 'buckets' for aggregating the devices into physical locations, and a list of | |
19 | rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By | |
20 | reflecting the underlying physical organization of the installation, CRUSH can | |
21 | model—and thereby address—potential sources of correlated device failures. | |
22 | Typical sources include physical proximity, a shared power source, and a shared | |
23 | network. By encoding this information into the cluster map, CRUSH placement | |
24 | policies can separate object replicas across different failure domains while | |
25 | still maintaining the desired distribution. For example, to address the | |
26 | possibility of concurrent failures, it may be desirable to ensure that data | |
27 | replicas are on devices using different shelves, racks, power supplies, | |
28 | controllers, and/or physical locations. | |
29 | ||
c07f9fc5 FG |
30 | When you deploy OSDs they are automatically placed within the CRUSH map under a |
31 | ``host`` node named with the hostname for the host they are running on. This, | |
32 | combined with the default CRUSH failure domain, ensures that replicas or erasure | |
33 | code shards are separated across hosts and a single host failure will not | |
34 | affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, | |
35 | for example, is common for mid- to large-sized clusters. | |
7c673cae FG |
36 | |
37 | ||
38 | CRUSH Location | |
39 | ============== | |
40 | ||
c07f9fc5 FG |
41 | The location of an OSD in terms of the CRUSH map's hierarchy is |
42 | referred to as a ``crush location``. This location specifier takes the | |
43 | form of a list of key and value pairs describing a position. For | |
44 | example, if an OSD is in a particular row, rack, chassis and host, and | |
45 | is part of the 'default' CRUSH tree (this is the case for the vast | |
46 | majority of clusters), its crush location could be described as:: | |
7c673cae FG |
47 | |
48 | root=default row=a rack=a2 chassis=a2a host=a2a1 | |
49 | ||
50 | Note: | |
51 | ||
52 | #. Note that the order of the keys does not matter. | |
53 | #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default | |
54 | these include root, datacenter, room, row, pod, pdu, rack, chassis and host, | |
55 | but those types can be customized to be anything appropriate by modifying | |
56 | the CRUSH map. | |
57 | #. Not all keys need to be specified. For example, by default, Ceph | |
58 | automatically sets a ``ceph-osd`` daemon's location to be | |
59 | ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). | |
60 | ||
c07f9fc5 FG |
61 | The crush location for an OSD is normally expressed via the ``crush location`` |
62 | config option being set in the ``ceph.conf`` file. Each time the OSD starts, | |
63 | it verifies it is in the correct location in the CRUSH map and, if it is not, | |
64 | it moved itself. To disable this automatic CRUSH map management, add the | |
65 | following to your configuration file in the ``[osd]`` section:: | |
7c673cae FG |
66 | |
67 | osd crush update on start = false | |
68 | ||
c07f9fc5 | 69 | |
7c673cae FG |
70 | Custom location hooks |
71 | --------------------- | |
72 | ||
c07f9fc5 FG |
73 | A customized location hook can be used to generate a more complete |
74 | crush location on startup. The sample ``ceph-crush-location`` utility | |
75 | will generate a CRUSH location string for a given daemon. The | |
76 | location is based on, in order of preference: | |
77 | ||
78 | #. A ``crush location`` option in ceph.conf. | |
79 | #. A default of ``root=default host=HOSTNAME`` where the hostname is | |
80 | generated with the ``hostname -s`` command. | |
81 | ||
82 | This is not useful by itself, as the OSD itself has the exact same | |
83 | behavior. However, the script can be modified to provide additional | |
84 | location fields (for example, the rack or datacenter), and then the | |
85 | hook enabled via the config option:: | |
7c673cae | 86 | |
c07f9fc5 | 87 | crush location hook = /path/to/customized-ceph-crush-location |
7c673cae FG |
88 | |
89 | This hook is passed several arguments (below) and should output a single line | |
90 | to stdout with the CRUSH location description.:: | |
91 | ||
92 | $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE | |
93 | ||
94 | where the cluster name is typically 'ceph', the id is the daemon | |
95 | identifier (the OSD number), and the daemon type is typically ``osd``. | |
96 | ||
97 | ||
c07f9fc5 FG |
98 | CRUSH structure |
99 | =============== | |
7c673cae | 100 | |
c07f9fc5 FG |
101 | The CRUSH map consists of, loosely speaking, a hierarchy describing |
102 | the physical topology of the cluster, and a set of rules defining | |
103 | policy about how we place data on those devices. The hierarchy has | |
104 | devices (``ceph-osd`` daemons) at the leaves, and internal nodes | |
105 | corresponding to other physical features or groupings: hosts, racks, | |
106 | rows, datacenters, and so on. The rules describe how replicas are | |
107 | placed in terms of that hierarchy (e.g., 'three replicas in different | |
108 | racks'). | |
7c673cae | 109 | |
c07f9fc5 FG |
110 | Devices |
111 | ------- | |
7c673cae | 112 | |
c07f9fc5 FG |
113 | Devices are individual ``ceph-osd`` daemons that can store data. You |
114 | will normally have one defined here for each OSD daemon in your | |
115 | cluster. Devices are identified by an id (a non-negative integer) and | |
116 | a name, normally ``osd.N`` where ``N`` is the device id. | |
7c673cae | 117 | |
c07f9fc5 FG |
118 | Devices may also have a *device class* associated with them (e.g., |
119 | ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a | |
120 | crush rule. | |
7c673cae | 121 | |
c07f9fc5 | 122 | Types and Buckets |
7c673cae FG |
123 | ----------------- |
124 | ||
c07f9fc5 FG |
125 | A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, |
126 | racks, rows, etc. The CRUSH map defines a series of *types* that are | |
127 | used to describe these nodes. By default, these types include: | |
128 | ||
129 | - osd (or device) | |
130 | - host | |
131 | - chassis | |
132 | - rack | |
133 | - row | |
134 | - pdu | |
135 | - pod | |
136 | - room | |
137 | - datacenter | |
138 | - region | |
139 | - root | |
140 | ||
141 | Most clusters make use of only a handful of these types, and others | |
142 | can be defined as needed. | |
143 | ||
144 | The hierarchy is built with devices (normally type ``osd``) at the | |
145 | leaves, interior nodes with non-device types, and a root node of type | |
146 | ``root``. For example, | |
147 | ||
148 | .. ditaa:: | |
149 | ||
150 | +-----------------+ | |
151 | | {o}root default | | |
152 | +--------+--------+ | |
7c673cae FG |
153 | | |
154 | +---------------+---------------+ | |
155 | | | | |
c07f9fc5 FG |
156 | +-------+-------+ +-----+-------+ |
157 | | {o}host foo | | {o}host bar | | |
158 | +-------+-------+ +-----+-------+ | |
7c673cae FG |
159 | | | |
160 | +-------+-------+ +-------+-------+ | |
161 | | | | | | |
162 | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ | |
c07f9fc5 | 163 | | osd.0 | | osd.1 | | osd.2 | | osd.3 | |
7c673cae FG |
164 | +-----------+ +-----------+ +-----------+ +-----------+ |
165 | ||
c07f9fc5 FG |
166 | Each node (device or bucket) in the hierarchy has a *weight* |
167 | associated with it, indicating the relative proportion of the total | |
168 | data that device or hierarchy subtree should store. Weights are set | |
169 | at the leaves, indicating the size of the device, and automatically | |
170 | sum up the tree from there, such that the weight of the default node | |
171 | will be the total of all devices contained beneath it. Normally | |
172 | weights are in units of terabytes (TB). | |
173 | ||
174 | You can get a simple view the CRUSH hierarchy for your cluster, | |
175 | including the weights, with:: | |
176 | ||
177 | ceph osd crush tree | |
178 | ||
179 | Rules | |
180 | ----- | |
181 | ||
182 | Rules define policy about how data is distributed across the devices | |
183 | in the hierarchy. | |
184 | ||
185 | CRUSH rules define placement and replication strategies or | |
186 | distribution policies that allow you to specify exactly how CRUSH | |
187 | places object replicas. For example, you might create a rule selecting | |
188 | a pair of targets for 2-way mirroring, another rule for selecting | |
189 | three targets in two different data centers for 3-way mirroring, and | |
190 | yet another rule for erasure coding over six storage devices. For a | |
191 | detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, | |
192 | Scalable, Decentralized Placement of Replicated Data`_, and more | |
193 | specifically to **Section 3.2**. | |
194 | ||
195 | In almost all cases, CRUSH rules can be created via the CLI by | |
196 | specifying the *pool type* they will be used for (replicated or | |
197 | erasure coded), the *failure domain*, and optionally a *device class*. | |
198 | In rare cases rules must be written by hand by manually editing the | |
199 | CRUSH map. | |
7c673cae | 200 | |
c07f9fc5 | 201 | You can see what rules are defined for your cluster with:: |
7c673cae | 202 | |
c07f9fc5 | 203 | ceph osd crush rule ls |
7c673cae | 204 | |
c07f9fc5 | 205 | You can view the contents of the rules with:: |
7c673cae | 206 | |
c07f9fc5 | 207 | ceph osd crush rule dump |
7c673cae | 208 | |
d2e6a577 FG |
209 | Device classes |
210 | -------------- | |
211 | ||
212 | Each device can optionally have a *class* associated with it. By | |
213 | default, OSDs automatically set their class on startup to either | |
214 | `hdd`, `ssd`, or `nvme` based on the type of device they are backed | |
215 | by. | |
216 | ||
217 | The device class for one or more OSDs can be explicitly set with:: | |
218 | ||
219 | ceph osd crush set-device-class <class> <osd-name> [...] | |
220 | ||
221 | Once a device class is set, it cannot be changed to another class | |
222 | until the old class is unset with:: | |
223 | ||
224 | ceph osd crush rm-device-class <osd-name> [...] | |
225 | ||
226 | This allows administrators to set device classes without the class | |
227 | being changed on OSD restart or by some other script. | |
228 | ||
229 | A placement rule that targets a specific device class can be created with:: | |
230 | ||
231 | ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> | |
232 | ||
233 | A pool can then be changed to use the new rule with:: | |
234 | ||
235 | ceph osd pool set <pool-name> crush_rule <rule-name> | |
236 | ||
237 | Device classes are implemented by creating a "shadow" CRUSH hierarchy | |
238 | for each device class in use that contains only devices of that class. | |
239 | Rules can then distribute data over the shadow hierarchy. One nice | |
240 | thing about this approach is that it is fully backward compatible with | |
241 | old Ceph clients. You can view the CRUSH hierarchy with shadow items | |
242 | with:: | |
243 | ||
244 | ceph osd crush tree --show-shadow | |
245 | ||
7c673cae | 246 | |
c07f9fc5 FG |
247 | Weights sets |
248 | ------------ | |
7c673cae | 249 | |
c07f9fc5 FG |
250 | A *weight set* is an alternative set of weights to use when |
251 | calculating data placement. The normal weights associated with each | |
252 | device in the CRUSH map are set based on the device size and indicate | |
253 | how much data we *should* be storing where. However, because CRUSH is | |
254 | based on a pseudorandom placement process, there is always some | |
255 | variation from this ideal distribution, the same way that rolling a | |
256 | dice sixty times will not result in rolling exactly 10 ones and 10 | |
257 | sixes. Weight sets allow the cluster to do a numerical optimization | |
258 | based on the specifics of your cluster (hierarchy, pools, etc.) to achieve | |
259 | a balanced distribution. | |
260 | ||
261 | There are two types of weight sets supported: | |
262 | ||
263 | #. A **compat** weight set is a single alternative set of weights for | |
264 | each device and node in the cluster. This is not well-suited for | |
265 | correcting for all anomalies (for example, placement groups for | |
266 | different pools may be different sizes and have different load | |
267 | levels, but will be mostly treated the same by the balancer). | |
268 | However, compat weight sets have the huge advantage that they are | |
269 | *backward compatible* with previous versions of Ceph, which means | |
270 | that even though weight sets were first introduced in Luminous | |
271 | v12.2.z, older clients (e.g., firefly) can still connect to the | |
272 | cluster when a compat weight set is being used to balance data. | |
273 | #. A **per-pool** weight set is more flexible in that it allows | |
274 | placement to be optimized for each data pool. Additionally, | |
275 | weights can be adjusted for each position of placement, allowing | |
276 | the optimizer to correct for a suble skew of data toward devices | |
277 | with small weights relative to their peers (and effect that is | |
278 | usually only apparently in very large clusters but which can cause | |
279 | balancing problems). | |
280 | ||
281 | When weight sets are in use, the weights associated with each node in | |
282 | the hierarchy is visible as a separate column (labeled either | |
283 | ``(compat)`` or the pool name) from the command:: | |
284 | ||
285 | ceph osd crush tree | |
286 | ||
287 | When both *compat* and *per-pool* weight sets are in use, data | |
288 | placement for a particular pool will use its own per-pool weight set | |
289 | if present. If not, it will use the compat weight set if present. If | |
290 | neither are present, it will use the normal CRUSH weights. | |
291 | ||
292 | Although weight sets can be set up and manipulated by hand, it is | |
293 | recommended that the *balancer* module be enabled to do so | |
294 | automatically. | |
295 | ||
296 | ||
297 | Modifying the CRUSH map | |
298 | ======================= | |
7c673cae FG |
299 | |
300 | .. _addosd: | |
301 | ||
302 | Add/Move an OSD | |
c07f9fc5 | 303 | --------------- |
7c673cae | 304 | |
c07f9fc5 FG |
305 | .. note: OSDs are normally automatically added to the CRUSH map when |
306 | the OSD is created. This command is rarely needed. | |
7c673cae | 307 | |
c07f9fc5 | 308 | To add or move an OSD in the CRUSH map of a running cluster:: |
7c673cae | 309 | |
c07f9fc5 | 310 | ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] |
7c673cae FG |
311 | |
312 | Where: | |
313 | ||
7c673cae FG |
314 | ``name`` |
315 | ||
316 | :Description: The full name of the OSD. | |
317 | :Type: String | |
318 | :Required: Yes | |
319 | :Example: ``osd.0`` | |
320 | ||
321 | ||
322 | ``weight`` | |
323 | ||
c07f9fc5 | 324 | :Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). |
7c673cae FG |
325 | :Type: Double |
326 | :Required: Yes | |
327 | :Example: ``2.0`` | |
328 | ||
329 | ||
330 | ``root`` | |
331 | ||
c07f9fc5 | 332 | :Description: The root node of the tree in which the OSD resides (normally ``default``) |
7c673cae FG |
333 | :Type: Key/value pair. |
334 | :Required: Yes | |
335 | :Example: ``root=default`` | |
336 | ||
337 | ||
338 | ``bucket-type`` | |
339 | ||
340 | :Description: You may specify the OSD's location in the CRUSH hierarchy. | |
341 | :Type: Key/value pairs. | |
342 | :Required: No | |
343 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
344 | ||
345 | ||
c07f9fc5 FG |
346 | The following example adds ``osd.0`` to the hierarchy, or moves the |
347 | OSD from a previous location. :: | |
7c673cae | 348 | |
c07f9fc5 | 349 | ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 |
7c673cae FG |
350 | |
351 | ||
c07f9fc5 FG |
352 | Adjust OSD weight |
353 | ----------------- | |
354 | ||
355 | .. note: Normally OSDs automatically add themselves to the CRUSH map | |
356 | with the correct weight when they are created. This command | |
357 | is rarely needed. | |
7c673cae FG |
358 | |
359 | To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute | |
360 | the following:: | |
361 | ||
c07f9fc5 | 362 | ceph osd crush reweight {name} {weight} |
7c673cae FG |
363 | |
364 | Where: | |
365 | ||
366 | ``name`` | |
367 | ||
368 | :Description: The full name of the OSD. | |
369 | :Type: String | |
370 | :Required: Yes | |
371 | :Example: ``osd.0`` | |
372 | ||
373 | ||
374 | ``weight`` | |
375 | ||
376 | :Description: The CRUSH weight for the OSD. | |
377 | :Type: Double | |
378 | :Required: Yes | |
379 | :Example: ``2.0`` | |
380 | ||
381 | ||
382 | .. _removeosd: | |
383 | ||
384 | Remove an OSD | |
c07f9fc5 FG |
385 | ------------- |
386 | ||
387 | .. note: OSDs are normally removed from the CRUSH as part of the | |
388 | ``ceph osd purge`` command. This command is rarely needed. | |
7c673cae | 389 | |
c07f9fc5 FG |
390 | To remove an OSD from the CRUSH map of a running cluster, execute the |
391 | following:: | |
7c673cae | 392 | |
c07f9fc5 | 393 | ceph osd crush remove {name} |
7c673cae FG |
394 | |
395 | Where: | |
396 | ||
397 | ``name`` | |
398 | ||
399 | :Description: The full name of the OSD. | |
400 | :Type: String | |
401 | :Required: Yes | |
402 | :Example: ``osd.0`` | |
403 | ||
c07f9fc5 | 404 | |
7c673cae | 405 | Add a Bucket |
c07f9fc5 FG |
406 | ------------ |
407 | ||
408 | .. note: Buckets are normally implicitly created when an OSD is added | |
409 | that specifies a ``{bucket-type}={bucket-name}`` as part of its | |
410 | location and a bucket with that name does not already exist. This | |
411 | command is typically used when manually adjusting the structure of the | |
412 | hierarchy after OSDs have been created (for example, to move a | |
413 | series of hosts underneath a new rack-level bucket). | |
7c673cae | 414 | |
c07f9fc5 FG |
415 | To add a bucket in the CRUSH map of a running cluster, execute the |
416 | ``ceph osd crush add-bucket`` command:: | |
7c673cae | 417 | |
c07f9fc5 | 418 | ceph osd crush add-bucket {bucket-name} {bucket-type} |
7c673cae FG |
419 | |
420 | Where: | |
421 | ||
422 | ``bucket-name`` | |
423 | ||
424 | :Description: The full name of the bucket. | |
425 | :Type: String | |
426 | :Required: Yes | |
427 | :Example: ``rack12`` | |
428 | ||
429 | ||
430 | ``bucket-type`` | |
431 | ||
432 | :Description: The type of the bucket. The type must already exist in the hierarchy. | |
433 | :Type: String | |
434 | :Required: Yes | |
435 | :Example: ``rack`` | |
436 | ||
437 | ||
438 | The following example adds the ``rack12`` bucket to the hierarchy:: | |
439 | ||
c07f9fc5 | 440 | ceph osd crush add-bucket rack12 rack |
7c673cae FG |
441 | |
442 | Move a Bucket | |
c07f9fc5 | 443 | ------------- |
7c673cae | 444 | |
c07f9fc5 FG |
445 | To move a bucket to a different location or position in the CRUSH map |
446 | hierarchy, execute the following:: | |
7c673cae | 447 | |
c07f9fc5 | 448 | ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] |
7c673cae FG |
449 | |
450 | Where: | |
451 | ||
452 | ``bucket-name`` | |
453 | ||
454 | :Description: The name of the bucket to move/reposition. | |
455 | :Type: String | |
456 | :Required: Yes | |
457 | :Example: ``foo-bar-1`` | |
458 | ||
459 | ``bucket-type`` | |
460 | ||
461 | :Description: You may specify the bucket's location in the CRUSH hierarchy. | |
462 | :Type: Key/value pairs. | |
463 | :Required: No | |
464 | :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` | |
465 | ||
466 | Remove a Bucket | |
c07f9fc5 | 467 | --------------- |
7c673cae FG |
468 | |
469 | To remove a bucket from the CRUSH map hierarchy, execute the following:: | |
470 | ||
c07f9fc5 | 471 | ceph osd crush remove {bucket-name} |
7c673cae FG |
472 | |
473 | .. note:: A bucket must be empty before removing it from the CRUSH hierarchy. | |
474 | ||
475 | Where: | |
476 | ||
477 | ``bucket-name`` | |
478 | ||
479 | :Description: The name of the bucket that you'd like to remove. | |
480 | :Type: String | |
481 | :Required: Yes | |
482 | :Example: ``rack12`` | |
483 | ||
484 | The following example removes the ``rack12`` bucket from the hierarchy:: | |
485 | ||
c07f9fc5 FG |
486 | ceph osd crush remove rack12 |
487 | ||
488 | Creating a compat weight set | |
489 | ---------------------------- | |
490 | ||
491 | .. note: This step is normally done automatically by the ``balancer`` | |
492 | module when enabled. | |
493 | ||
494 | To create a *compat* weight set:: | |
495 | ||
496 | ceph osd crush weight-set create-compat | |
497 | ||
498 | Weights for the compat weight set can be adjusted with:: | |
499 | ||
500 | ceph osd crush weight-set reweight-compat {name} {weight} | |
501 | ||
502 | The compat weight set can be destroyed with:: | |
503 | ||
504 | ceph osd crush weight-set rm-compat | |
505 | ||
506 | Creating per-pool weight sets | |
507 | ----------------------------- | |
508 | ||
509 | To create a weight set for a specific pool,:: | |
510 | ||
511 | ceph osd crush weight-set create {pool-name} {mode} | |
512 | ||
513 | .. note:: Per-pool weight sets require that all servers and daemons | |
514 | run Luminous v12.2.z or later. | |
515 | ||
516 | Where: | |
517 | ||
518 | ``pool-name`` | |
519 | ||
520 | :Description: The name of a RADOS pool | |
521 | :Type: String | |
522 | :Required: Yes | |
523 | :Example: ``rbd`` | |
524 | ||
525 | ``mode`` | |
526 | ||
527 | :Description: Either ``flat`` or ``positional``. A *flat* weight set | |
528 | has a single weight for each device or bucket. A | |
529 | *positional* weight set has a potentially different | |
530 | weight for each position in the resulting placement | |
531 | mapping. For example, if a pool has a replica count of | |
532 | 3, then a positional weight set will have three weights | |
533 | for each device and bucket. | |
534 | :Type: String | |
535 | :Required: Yes | |
536 | :Example: ``flat`` | |
537 | ||
538 | To adjust the weight of an item in a weight set:: | |
539 | ||
540 | ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} | |
541 | ||
542 | To list existing weight sets,:: | |
543 | ||
544 | ceph osd crush weight-set ls | |
545 | ||
546 | To remove a weight set,:: | |
547 | ||
548 | ceph osd crush weight-set rm {pool-name} | |
549 | ||
550 | Creating a rule for a replicated pool | |
551 | ------------------------------------- | |
552 | ||
553 | For a replicated pool, the primary decision when creating the CRUSH | |
554 | rule is what the failure domain is going to be. For example, if a | |
555 | failure domain of ``host`` is selected, then CRUSH will ensure that | |
556 | each replica of the data is stored on a different host. If ``rack`` | |
557 | is selected, then each replica will be stored in a different rack. | |
558 | What failure domain you choose primarily depends on the size of your | |
559 | cluster and how your hierarchy is structured. | |
560 | ||
561 | Normally, the entire cluster hierarchy is nested beneath a root node | |
562 | named ``default``. If you have customized your hierarchy, you may | |
563 | want to create a rule nested at some other node in the hierarchy. It | |
564 | doesn't matter what type is associated with that node (it doesn't have | |
565 | to be a ``root`` node). | |
566 | ||
567 | It is also possible to create a rule that restricts data placement to | |
568 | a specific *class* of device. By default, Ceph OSDs automatically | |
569 | classify themselves as either ``hdd`` or ``ssd``, depending on the | |
570 | underlying type of device being used. These classes can also be | |
571 | customized. | |
572 | ||
573 | To create a replicated rule,:: | |
574 | ||
575 | ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] | |
576 | ||
577 | Where: | |
578 | ||
579 | ``name`` | |
580 | ||
581 | :Description: The name of the rule | |
582 | :Type: String | |
583 | :Required: Yes | |
584 | :Example: ``rbd-rule`` | |
585 | ||
586 | ``root`` | |
587 | ||
588 | :Description: The name of the node under which data should be placed. | |
589 | :Type: String | |
590 | :Required: Yes | |
591 | :Example: ``default`` | |
592 | ||
593 | ``failure-domain-type`` | |
594 | ||
595 | :Description: The type of CRUSH nodes across which we should separate replicas. | |
596 | :Type: String | |
597 | :Required: Yes | |
598 | :Example: ``rack`` | |
599 | ||
600 | ``class`` | |
601 | ||
602 | :Description: The device class data should be placed on. | |
603 | :Type: String | |
604 | :Required: No | |
605 | :Example: ``ssd`` | |
606 | ||
607 | Creating a rule for an erasure coded pool | |
608 | ----------------------------------------- | |
609 | ||
610 | For an erasure-coded pool, the same basic decisions need to be made as | |
611 | with a replicated pool: what is the failure domain, what node in the | |
612 | hierarchy will data be placed under (usually ``default``), and will | |
613 | placement be restricted to a specific device class. Erasure code | |
614 | pools are created a bit differently, however, because they need to be | |
615 | constructed carefully based on the erasure code being used. For this reason, | |
616 | you must include this information in the *erasure code profile*. A CRUSH | |
617 | rule will then be created from that either explicitly or automatically when | |
618 | the profile is used to create a pool. | |
619 | ||
620 | The erasure code profiles can be listed with:: | |
621 | ||
622 | ceph osd erasure-code-profile ls | |
623 | ||
624 | An existing profile can be viewed with:: | |
625 | ||
626 | ceph osd erasure-code-profile get {profile-name} | |
627 | ||
628 | Normally profiles should never be modified; instead, a new profile | |
629 | should be created and used when creating a new pool or creating a new | |
630 | rule for an existing pool. | |
631 | ||
632 | An erasure code profile consists of a set of key=value pairs. Most of | |
633 | these control the behavior of the erasure code that is encoding data | |
634 | in the pool. Those that begin with ``crush-``, however, affect the | |
635 | CRUSH rule that is created. | |
636 | ||
637 | The erasure code profile properties of interest are: | |
638 | ||
639 | * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. | |
640 | * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. | |
641 | * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. | |
642 | * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. | |
643 | ||
644 | Once a profile is defined, you can create a CRUSH rule with:: | |
645 | ||
646 | ceph osd crush rule create-erasure {name} {profile-name} | |
647 | ||
648 | .. note: When creating a new pool, it is not actually necessary to | |
649 | explicitly create the rule. If the erasure code profile alone is | |
650 | specified and the rule argument is left off then Ceph will create | |
651 | the CRUSH rule automatically. | |
652 | ||
653 | Deleting rules | |
654 | -------------- | |
655 | ||
656 | Rules that are not in use by pools can be deleted with:: | |
657 | ||
658 | ceph osd crush rule rm {rule-name} | |
659 | ||
7c673cae FG |
660 | |
661 | Tunables | |
662 | ======== | |
663 | ||
664 | Over time, we have made (and continue to make) improvements to the | |
665 | CRUSH algorithm used to calculate the placement of data. In order to | |
666 | support the change in behavior, we have introduced a series of tunable | |
667 | options that control whether the legacy or improved variation of the | |
668 | algorithm is used. | |
669 | ||
670 | In order to use newer tunables, both clients and servers must support | |
671 | the new version of CRUSH. For this reason, we have created | |
672 | ``profiles`` that are named after the Ceph version in which they were | |
673 | introduced. For example, the ``firefly`` tunables are first supported | |
674 | in the firefly release, and will not work with older (e.g., dumpling) | |
675 | clients. Once a given set of tunables are changed from the legacy | |
676 | default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older | |
677 | clients who do not support the new CRUSH features from connecting to | |
678 | the cluster. | |
679 | ||
680 | argonaut (legacy) | |
681 | ----------------- | |
682 | ||
683 | The legacy CRUSH behavior used by argonaut and older releases works | |
684 | fine for most clusters, provided there are not too many OSDs that have | |
685 | been marked out. | |
686 | ||
687 | bobtail (CRUSH_TUNABLES2) | |
688 | ------------------------- | |
689 | ||
690 | The bobtail tunable profile fixes a few key misbehaviors: | |
691 | ||
692 | * For hierarchies with a small number of devices in the leaf buckets, | |
693 | some PGs map to fewer than the desired number of replicas. This | |
694 | commonly happens for hierarchies with "host" nodes with a small | |
695 | number (1-3) of OSDs nested beneath each one. | |
696 | ||
697 | * For large clusters, some small percentages of PGs map to less than | |
698 | the desired number of OSDs. This is more prevalent when there are | |
699 | several layers of the hierarchy (e.g., row, rack, host, osd). | |
700 | ||
701 | * When some OSDs are marked out, the data tends to get redistributed | |
702 | to nearby OSDs instead of across the entire hierarchy. | |
703 | ||
704 | The new tunables are: | |
705 | ||
706 | * ``choose_local_tries``: Number of local retries. Legacy value is | |
707 | 2, optimal value is 0. | |
708 | ||
709 | * ``choose_local_fallback_tries``: Legacy value is 5, optimal value | |
710 | is 0. | |
711 | ||
712 | * ``choose_total_tries``: Total number of attempts to choose an item. | |
713 | Legacy value was 19, subsequent testing indicates that a value of | |
714 | 50 is more appropriate for typical clusters. For extremely large | |
715 | clusters, a larger value might be necessary. | |
716 | ||
717 | * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt | |
718 | will retry, or only try once and allow the original placement to | |
719 | retry. Legacy default is 0, optimal value is 1. | |
720 | ||
721 | Migration impact: | |
722 | ||
723 | * Moving from argonaut to bobtail tunables triggers a moderate amount | |
724 | of data movement. Use caution on a cluster that is already | |
725 | populated with data. | |
726 | ||
727 | firefly (CRUSH_TUNABLES3) | |
728 | ------------------------- | |
729 | ||
730 | The firefly tunable profile fixes a problem | |
731 | with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG | |
732 | mappings with too few results when too many OSDs have been marked out. | |
733 | ||
734 | The new tunable is: | |
735 | ||
736 | * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will | |
737 | start with a non-zero value of r, based on how many attempts the | |
738 | parent has already made. Legacy default is 0, but with this value | |
739 | CRUSH is sometimes unable to find a mapping. The optimal value (in | |
740 | terms of computational cost and correctness) is 1. | |
741 | ||
742 | Migration impact: | |
743 | ||
744 | * For existing clusters that have lots of existing data, changing | |
745 | from 0 to 1 will cause a lot of data to move; a value of 4 or 5 | |
746 | will allow CRUSH to find a valid mapping but will make less data | |
747 | move. | |
748 | ||
749 | straw_calc_version tunable (introduced with Firefly too) | |
750 | -------------------------------------------------------- | |
751 | ||
752 | There were some problems with the internal weights calculated and | |
753 | stored in the CRUSH map for ``straw`` buckets. Specifically, when | |
754 | there were items with a CRUSH weight of 0 or both a mix of weights and | |
755 | some duplicated weights CRUSH would distribute data incorrectly (i.e., | |
756 | not in proportion to the weights). | |
757 | ||
758 | The new tunable is: | |
759 | ||
760 | * ``straw_calc_version``: A value of 0 preserves the old, broken | |
761 | internal weight calculation; a value of 1 fixes the behavior. | |
762 | ||
763 | Migration impact: | |
764 | ||
765 | * Moving to straw_calc_version 1 and then adjusting a straw bucket | |
766 | (by adding, removing, or reweighting an item, or by using the | |
767 | reweight-all command) can trigger a small to moderate amount of | |
768 | data movement *if* the cluster has hit one of the problematic | |
769 | conditions. | |
770 | ||
771 | This tunable option is special because it has absolutely no impact | |
772 | concerning the required kernel version in the client side. | |
773 | ||
774 | hammer (CRUSH_V4) | |
775 | ----------------- | |
776 | ||
777 | The hammer tunable profile does not affect the | |
778 | mapping of existing CRUSH maps simply by changing the profile. However: | |
779 | ||
780 | * There is a new bucket type (``straw2``) supported. The new | |
781 | ``straw2`` bucket type fixes several limitations in the original | |
782 | ``straw`` bucket. Specifically, the old ``straw`` buckets would | |
783 | change some mappings that should have changed when a weight was | |
784 | adjusted, while ``straw2`` achieves the original goal of only | |
785 | changing mappings to or from the bucket item whose weight has | |
786 | changed. | |
787 | ||
788 | * ``straw2`` is the default for any newly created buckets. | |
789 | ||
790 | Migration impact: | |
791 | ||
792 | * Changing a bucket type from ``straw`` to ``straw2`` will result in | |
793 | a reasonably small amount of data movement, depending on how much | |
794 | the bucket item weights vary from each other. When the weights are | |
795 | all the same no data will move, and when item weights vary | |
796 | significantly there will be more movement. | |
797 | ||
798 | jewel (CRUSH_TUNABLES5) | |
799 | ----------------------- | |
800 | ||
801 | The jewel tunable profile improves the | |
802 | overall behavior of CRUSH such that significantly fewer mappings | |
803 | change when an OSD is marked out of the cluster. | |
804 | ||
805 | The new tunable is: | |
806 | ||
807 | * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will | |
808 | use a better value for an inner loop that greatly reduces the number | |
809 | of mapping changes when an OSD is marked out. The legacy value is 0, | |
810 | while the new value of 1 uses the new approach. | |
811 | ||
812 | Migration impact: | |
813 | ||
814 | * Changing this value on an existing cluster will result in a very | |
815 | large amount of data movement as almost every PG mapping is likely | |
816 | to change. | |
817 | ||
818 | ||
819 | ||
820 | ||
821 | Which client versions support CRUSH_TUNABLES | |
822 | -------------------------------------------- | |
823 | ||
824 | * argonaut series, v0.48.1 or later | |
825 | * v0.49 or later | |
826 | * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) | |
827 | ||
828 | Which client versions support CRUSH_TUNABLES2 | |
829 | --------------------------------------------- | |
830 | ||
831 | * v0.55 or later, including bobtail series (v0.56.x) | |
832 | * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) | |
833 | ||
834 | Which client versions support CRUSH_TUNABLES3 | |
835 | --------------------------------------------- | |
836 | ||
837 | * v0.78 (firefly) or later | |
838 | * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) | |
839 | ||
840 | Which client versions support CRUSH_V4 | |
841 | -------------------------------------- | |
842 | ||
843 | * v0.94 (hammer) or later | |
844 | * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) | |
845 | ||
846 | Which client versions support CRUSH_TUNABLES5 | |
847 | --------------------------------------------- | |
848 | ||
849 | * v10.0.2 (jewel) or later | |
850 | * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) | |
851 | ||
852 | Warning when tunables are non-optimal | |
853 | ------------------------------------- | |
854 | ||
855 | Starting with version v0.74, Ceph will issue a health warning if the | |
856 | current CRUSH tunables don't include all the optimal values from the | |
857 | ``default`` profile (see below for the meaning of the ``default`` profile). | |
858 | To make this warning go away, you have two options: | |
859 | ||
860 | 1. Adjust the tunables on the existing cluster. Note that this will | |
861 | result in some data movement (possibly as much as 10%). This is the | |
862 | preferred route, but should be taken with care on a production cluster | |
863 | where the data movement may affect performance. You can enable optimal | |
864 | tunables with:: | |
865 | ||
866 | ceph osd crush tunables optimal | |
867 | ||
868 | If things go poorly (e.g., too much load) and not very much | |
869 | progress has been made, or there is a client compatibility problem | |
870 | (old kernel cephfs or rbd clients, or pre-bobtail librados | |
871 | clients), you can switch back with:: | |
872 | ||
873 | ceph osd crush tunables legacy | |
874 | ||
875 | 2. You can make the warning go away without making any changes to CRUSH by | |
876 | adding the following option to your ceph.conf ``[mon]`` section:: | |
877 | ||
878 | mon warn on legacy crush tunables = false | |
879 | ||
880 | For the change to take effect, you will need to restart the monitors, or | |
881 | apply the option to running monitors with:: | |
882 | ||
883 | ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables | |
884 | ||
885 | ||
886 | A few important points | |
887 | ---------------------- | |
888 | ||
889 | * Adjusting these values will result in the shift of some PGs between | |
890 | storage nodes. If the Ceph cluster is already storing a lot of | |
891 | data, be prepared for some fraction of the data to move. | |
892 | * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the | |
893 | feature bits of new connections as soon as they get | |
894 | the updated map. However, already-connected clients are | |
895 | effectively grandfathered in, and will misbehave if they do not | |
896 | support the new feature. | |
897 | * If the CRUSH tunables are set to non-legacy values and then later | |
898 | changed back to the defult values, ``ceph-osd`` daemons will not be | |
899 | required to support the feature. However, the OSD peering process | |
900 | requires examining and understanding old maps. Therefore, you | |
901 | should not run old versions of the ``ceph-osd`` daemon | |
902 | if the cluster has previously used non-legacy CRUSH values, even if | |
903 | the latest version of the map has been switched back to using the | |
904 | legacy defaults. | |
905 | ||
906 | Tuning CRUSH | |
907 | ------------ | |
908 | ||
909 | The simplest way to adjust the crush tunables is by changing to a known | |
910 | profile. Those are: | |
911 | ||
912 | * ``legacy``: the legacy behavior from argonaut and earlier. | |
913 | * ``argonaut``: the legacy values supported by the original argonaut release | |
914 | * ``bobtail``: the values supported by the bobtail release | |
915 | * ``firefly``: the values supported by the firefly release | |
c07f9fc5 FG |
916 | * ``hammer``: the values supported by the hammer release |
917 | * ``jewel``: the values supported by the jewel release | |
7c673cae FG |
918 | * ``optimal``: the best (ie optimal) values of the current version of Ceph |
919 | * ``default``: the default values of a new cluster installed from | |
920 | scratch. These values, which depend on the current version of Ceph, | |
921 | are hard coded and are generally a mix of optimal and legacy values. | |
922 | These values generally match the ``optimal`` profile of the previous | |
923 | LTS release, or the most recent release for which we generally except | |
924 | more users to have up to date clients for. | |
925 | ||
926 | You can select a profile on a running cluster with the command:: | |
927 | ||
928 | ceph osd crush tunables {PROFILE} | |
929 | ||
930 | Note that this may result in some data movement. | |
931 | ||
932 | ||
c07f9fc5 | 933 | .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf |
7c673cae | 934 | |
7c673cae | 935 | |
c07f9fc5 FG |
936 | Primary Affinity |
937 | ================ | |
7c673cae | 938 | |
c07f9fc5 FG |
939 | When a Ceph Client reads or writes data, it always contacts the primary OSD in |
940 | the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an | |
941 | OSD is not well suited to act as a primary compared to other OSDs (e.g., it has | |
942 | a slow disk or a slow controller). To prevent performance bottlenecks | |
943 | (especially on read operations) while maximizing utilization of your hardware, | |
944 | you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use | |
945 | the OSD as a primary in an acting set. :: | |
7c673cae | 946 | |
c07f9fc5 | 947 | ceph osd primary-affinity <osd-id> <weight> |
7c673cae | 948 | |
c07f9fc5 FG |
949 | Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You |
950 | may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may | |
951 | **NOT** be used as a primary and ``1`` means that an OSD may be used as a | |
952 | primary. When the weight is ``< 1``, it is less likely that CRUSH will select | |
953 | the Ceph OSD Daemon to act as a primary. | |
7c673cae | 954 | |
7c673cae | 955 | |
7c673cae | 956 |