]>
Commit | Line | Data |
---|---|---|
20effc67 TL |
1 | .. _placement groups: |
2 | ||
7c673cae FG |
3 | ================== |
4 | Placement Groups | |
5 | ================== | |
6 | ||
11fdf7f2 TL |
7 | .. _pg-autoscaler: |
8 | ||
9 | Autoscaling placement groups | |
10 | ============================ | |
11 | ||
12 | Placement groups (PGs) are an internal implementation detail of how | |
20effc67 TL |
13 | Ceph distributes data. You may enable *pg-autoscaling* to allow the cluster to |
14 | make recommendations or automatically adjust the numbers of PGs (``pgp_num``) | |
15 | for each pool based on expected cluster and pool utilization. | |
11fdf7f2 | 16 | |
20effc67 | 17 | Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``. |
11fdf7f2 | 18 | |
20effc67 | 19 | * ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate ``pgp_num`` for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information. |
11fdf7f2 TL |
20 | * ``on``: Enable automated adjustments of the PG count for the given pool. |
21 | * ``warn``: Raise health alerts when the PG count should be adjusted | |
22 | ||
20effc67 | 23 | To set the autoscaling mode for an existing pool:: |
11fdf7f2 TL |
24 | |
25 | ceph osd pool set <pool-name> pg_autoscale_mode <mode> | |
26 | ||
20effc67 | 27 | For example to enable autoscaling on pool ``foo``:: |
11fdf7f2 TL |
28 | |
29 | ceph osd pool set foo pg_autoscale_mode on | |
30 | ||
31 | You can also configure the default ``pg_autoscale_mode`` that is | |
20effc67 | 32 | set on any pools that are subsequently created:: |
11fdf7f2 | 33 | |
9f95a23c | 34 | ceph config set global osd_pool_default_pg_autoscale_mode <mode> |
11fdf7f2 | 35 | |
20effc67 TL |
36 | You can disable or enable the autoscaler for all pools with |
37 | the ``noautoscale`` flag. By default this flag is set to be ``off``, | |
38 | but you can turn it ``on`` by using the command:: | |
39 | ||
40 | ceph osd pool set noautoscale | |
41 | ||
42 | You can turn it ``off`` using the command:: | |
43 | ||
44 | ceph osd pool unset noautoscale | |
45 | ||
46 | To ``get`` the value of the flag use the command:: | |
47 | ||
48 | ceph osd pool get noautoscale | |
49 | ||
11fdf7f2 TL |
50 | Viewing PG scaling recommendations |
51 | ---------------------------------- | |
52 | ||
53 | You can view each pool, its relative utilization, and any suggested changes to | |
54 | the PG count with this command:: | |
55 | ||
56 | ceph osd pool autoscale-status | |
57 | ||
58 | Output will be something like:: | |
59 | ||
20effc67 TL |
60 | POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK |
61 | a 12900M 3.0 82431M 0.4695 8 128 warn True | |
62 | c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True | |
63 | b 0 953.6M 3.0 82431M 0.0347 8 warn False | |
11fdf7f2 TL |
64 | |
65 | **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if | |
66 | present, is the amount of data the administrator has specified that | |
67 | they expect to eventually be stored in this pool. The system uses | |
68 | the larger of the two values for its calculation. | |
69 | ||
70 | **RATE** is the multiplier for the pool that determines how much raw | |
71 | storage capacity is consumed. For example, a 3 replica pool will | |
72 | have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a | |
73 | ratio of 1.5. | |
74 | ||
75 | **RAW CAPACITY** is the total amount of raw storage capacity on the | |
76 | OSDs that are responsible for storing this pool's (and perhaps other | |
77 | pools') data. **RATIO** is the ratio of that total capacity that | |
78 | this pool is consuming (i.e., ratio = size * rate / raw capacity). | |
79 | ||
80 | **TARGET RATIO**, if present, is the ratio of storage that the | |
9f95a23c TL |
81 | administrator has specified that they expect this pool to consume |
82 | relative to other pools with target ratios set. | |
83 | If both target size bytes and ratio are specified, the | |
11fdf7f2 TL |
84 | ratio takes precedence. |
85 | ||
9f95a23c TL |
86 | **EFFECTIVE RATIO** is the target ratio after adjusting in two ways: |
87 | ||
20effc67 TL |
88 | 1. Subtracting any capacity expected to be used by pools with target size set |
89 | 2. Normalizing the target ratios among pools with target ratio set so | |
9f95a23c TL |
90 | they collectively target the rest of the space. For example, 4 |
91 | pools with target_ratio 1.0 would have an effective ratio of 0.25. | |
92 | ||
93 | The system uses the larger of the actual ratio and the effective ratio | |
94 | for its calculation. | |
95 | ||
522d829b TL |
96 | **BIAS** is used as a multiplier to manually adjust a pool's PG based |
97 | on prior information about how much PGs a specific pool is expected | |
98 | to have. | |
99 | ||
11fdf7f2 TL |
100 | **PG_NUM** is the current number of PGs for the pool (or the current |
101 | number of PGs that the pool is working towards, if a ``pg_num`` | |
102 | change is in progress). **NEW PG_NUM**, if present, is what the | |
103 | system believes the pool's ``pg_num`` should be changed to. It is | |
104 | always a power of 2, and will only be present if the "ideal" value | |
20effc67 TL |
105 | varies from the current value by more than a factor of 3 by default. |
106 | This factor can be be adjusted with:: | |
107 | ||
108 | ceph osd pool set threshold 2.0 | |
11fdf7f2 | 109 | |
522d829b | 110 | **AUTOSCALE**, is the pool ``pg_autoscale_mode`` |
11fdf7f2 TL |
111 | and will be either ``on``, ``off``, or ``warn``. |
112 | ||
20effc67 TL |
113 | The final column, **BULK** determines if the pool is ``bulk`` |
114 | and will be either ``True`` or ``False``. A ``bulk`` pool | |
115 | means that the pool is expected to be large and should start out | |
116 | with large amount of PGs for performance purposes. On the other hand, | |
117 | pools without the ``bulk`` flag are expected to be smaller e.g., | |
118 | .mgr or meta pools. | |
522d829b | 119 | |
11fdf7f2 TL |
120 | |
121 | Automated scaling | |
122 | ----------------- | |
123 | ||
20effc67 | 124 | Allowing the cluster to automatically scale ``pgp_num`` based on usage is the |
11fdf7f2 TL |
125 | simplest approach. Ceph will look at the total available storage and |
126 | target number of PGs for the whole system, look at how much data is | |
20effc67 | 127 | stored in each pool, and try to apportion PGs accordingly. The |
11fdf7f2 TL |
128 | system is relatively conservative with its approach, only making |
129 | changes to a pool when the current number of PGs (``pg_num``) is more | |
20effc67 | 130 | than a factor of 3 off from what it thinks it should be. |
11fdf7f2 TL |
131 | |
132 | The target number of PGs per OSD is based on the | |
133 | ``mon_target_pg_per_osd`` configurable (default: 100), which can be | |
134 | adjusted with:: | |
135 | ||
136 | ceph config set global mon_target_pg_per_osd 100 | |
137 | ||
138 | The autoscaler analyzes pools and adjusts on a per-subtree basis. | |
139 | Because each pool may map to a different CRUSH rule, and each rule may | |
140 | distribute data across different devices, Ceph will consider | |
141 | utilization of each subtree of the hierarchy independently. For | |
142 | example, a pool that maps to OSDs of class `ssd` and a pool that maps | |
143 | to OSDs of class `hdd` will each have optimal PG counts that depend on | |
144 | the number of those respective device types. | |
145 | ||
2a845540 TL |
146 | In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow |
147 | trees with both `ssd` and `hdd` devices), the autoscaler will | |
148 | issue a warning to the user in the manager log stating the name of the pool | |
149 | and the set of roots that overlap each other. The autoscaler will not | |
150 | scale any pools with overlapping roots because this can cause problems | |
151 | with the scaling process. We recommend making each pool belong to only | |
152 | one root (one OSD class) to get rid of the warning and ensure a successful | |
153 | scaling process. | |
154 | ||
20effc67 | 155 | The autoscaler uses the `bulk` flag to determine which pool |
1d09f67e TL |
156 | should start out with a full complement of PGs and only |
157 | scales down when the usage ratio across the pool is not even. | |
20effc67 TL |
158 | However, if the pool doesn't have the `bulk` flag, the pool will |
159 | start out with minimal PGs and only when there is more usage in the pool. | |
522d829b | 160 | |
20effc67 TL |
161 | To create pool with `bulk` flag:: |
162 | ||
1d09f67e | 163 | ceph osd pool create <pool-name> --bulk |
522d829b | 164 | |
20effc67 | 165 | To set/unset `bulk` flag of existing pool:: |
522d829b | 166 | |
1d09f67e | 167 | ceph osd pool set <pool-name> bulk <true/false/1/0> |
522d829b | 168 | |
20effc67 | 169 | To get `bulk` flag of existing pool:: |
522d829b | 170 | |
20effc67 | 171 | ceph osd pool get <pool-name> bulk |
11fdf7f2 TL |
172 | |
173 | .. _specifying_pool_target_size: | |
174 | ||
175 | Specifying expected pool size | |
176 | ----------------------------- | |
177 | ||
178 | When a cluster or pool is first created, it will consume a small | |
179 | fraction of the total cluster capacity and will appear to the system | |
180 | as if it should only need a small number of placement groups. | |
181 | However, in most cases cluster administrators have a good idea which | |
182 | pools are expected to consume most of the system capacity over time. | |
183 | By providing this information to Ceph, a more appropriate number of | |
184 | PGs can be used from the beginning, preventing subsequent changes in | |
185 | ``pg_num`` and the overhead associated with moving data around when | |
186 | those adjustments are made. | |
187 | ||
9f95a23c TL |
188 | The *target size* of a pool can be specified in two ways: either in |
189 | terms of the absolute size of the pool (i.e., bytes), or as a weight | |
190 | relative to other pools with a ``target_size_ratio`` set. | |
11fdf7f2 | 191 | |
20effc67 | 192 | For example:: |
11fdf7f2 TL |
193 | |
194 | ceph osd pool set mypool target_size_bytes 100T | |
195 | ||
196 | will tell the system that `mypool` is expected to consume 100 TiB of | |
20effc67 | 197 | space. Alternatively:: |
11fdf7f2 | 198 | |
9f95a23c | 199 | ceph osd pool set mypool target_size_ratio 1.0 |
11fdf7f2 | 200 | |
9f95a23c TL |
201 | will tell the system that `mypool` is expected to consume 1.0 relative |
202 | to the other pools with ``target_size_ratio`` set. If `mypool` is the | |
203 | only pool in the cluster, this means an expected use of 100% of the | |
204 | total capacity. If there is a second pool with ``target_size_ratio`` | |
205 | 1.0, both pools would expect to use 50% of the cluster capacity. | |
11fdf7f2 TL |
206 | |
207 | You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command. | |
208 | ||
209 | Note that if impossible target size values are specified (for example, | |
9f95a23c TL |
210 | a capacity larger than the total cluster) then a health warning |
211 | (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. | |
212 | ||
213 | If both ``target_size_ratio`` and ``target_size_bytes`` are specified | |
214 | for a pool, only the ratio will be considered, and a health warning | |
215 | (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued. | |
11fdf7f2 TL |
216 | |
217 | Specifying bounds on a pool's PGs | |
218 | --------------------------------- | |
219 | ||
220 | It is also possible to specify a minimum number of PGs for a pool. | |
221 | This is useful for establishing a lower bound on the amount of | |
222 | parallelism client will see when doing IO, even when a pool is mostly | |
223 | empty. Setting the lower bound prevents Ceph from reducing (or | |
224 | recommending you reduce) the PG number below the configured number. | |
225 | ||
20effc67 | 226 | You can set the minimum or maximum number of PGs for a pool with:: |
11fdf7f2 TL |
227 | |
228 | ceph osd pool set <pool-name> pg_num_min <num> | |
20effc67 | 229 | ceph osd pool set <pool-name> pg_num_max <num> |
11fdf7f2 | 230 | |
20effc67 TL |
231 | You can also specify the minimum or maximum PG count at pool creation |
232 | time with the optional ``--pg-num-min <num>`` or ``--pg-num-max | |
233 | <num>`` arguments to the ``ceph osd pool create`` command. | |
11fdf7f2 | 234 | |
7c673cae FG |
235 | .. _preselection: |
236 | ||
237 | A preselection of pg_num | |
238 | ======================== | |
239 | ||
240 | When creating a new pool with:: | |
241 | ||
9f95a23c | 242 | ceph osd pool create {pool-name} [pg_num] |
7c673cae | 243 | |
9f95a23c TL |
244 | it is optional to choose the value of ``pg_num``. If you do not |
245 | specify ``pg_num``, the cluster can (by default) automatically tune it | |
246 | for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`). | |
7c673cae | 247 | |
9f95a23c TL |
248 | Alternatively, ``pg_num`` can be explicitly provided. However, |
249 | whether you specify a ``pg_num`` value or not does not affect whether | |
250 | the value is automatically tuned by the cluster after the fact. To | |
20effc67 | 251 | enable or disable auto-tuning:: |
7c673cae | 252 | |
9f95a23c | 253 | ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) |
7c673cae | 254 | |
9f95a23c TL |
255 | The "rule of thumb" for PGs per OSD has traditionally be 100. With |
256 | the additional of the balancer (which is also enabled by default), a | |
257 | value of more like 50 PGs per OSD is probably reasonable. The | |
258 | challenge (which the autoscaler normally does for you), is to: | |
7c673cae | 259 | |
9f95a23c TL |
260 | - have the PGs per pool proportional to the data in the pool, and |
261 | - end up with 50-100 PGs per OSDs, after the replication or | |
262 | erasuring-coding fan-out of each PG across OSDs is taken into | |
263 | consideration | |
7c673cae FG |
264 | |
265 | How are Placement Groups used ? | |
266 | =============================== | |
267 | ||
268 | A placement group (PG) aggregates objects within a pool because | |
269 | tracking object placement and object metadata on a per-object basis is | |
270 | computationally expensive--i.e., a system with millions of objects | |
271 | cannot realistically track placement on a per-object basis. | |
272 | ||
273 | .. ditaa:: | |
274 | /-----\ /-----\ /-----\ /-----\ /-----\ | |
275 | | obj | | obj | | obj | | obj | | obj | | |
276 | \-----/ \-----/ \-----/ \-----/ \-----/ | |
277 | | | | | | | |
278 | +--------+--------+ +---+----+ | |
279 | | | | |
280 | v v | |
281 | +-----------------------+ +-----------------------+ | |
282 | | Placement Group #1 | | Placement Group #2 | | |
283 | | | | | | |
284 | +-----------------------+ +-----------------------+ | |
285 | | | | |
286 | +------------------------------+ | |
287 | | | |
288 | v | |
289 | +-----------------------+ | |
290 | | Pool | | |
291 | | | | |
292 | +-----------------------+ | |
293 | ||
294 | The Ceph client will calculate which placement group an object should | |
295 | be in. It does this by hashing the object ID and applying an operation | |
296 | based on the number of PGs in the defined pool and the ID of the pool. | |
297 | See `Mapping PGs to OSDs`_ for details. | |
298 | ||
299 | The object's contents within a placement group are stored in a set of | |
300 | OSDs. For instance, in a replicated pool of size two, each placement | |
301 | group will store objects on two OSDs, as shown below. | |
302 | ||
303 | .. ditaa:: | |
7c673cae FG |
304 | +-----------------------+ +-----------------------+ |
305 | | Placement Group #1 | | Placement Group #2 | | |
306 | | | | | | |
307 | +-----------------------+ +-----------------------+ | |
308 | | | | | | |
309 | v v v v | |
310 | /----------\ /----------\ /----------\ /----------\ | |
311 | | | | | | | | | | |
312 | | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | | |
313 | | | | | | | | | | |
314 | \----------/ \----------/ \----------/ \----------/ | |
315 | ||
316 | ||
317 | Should OSD #2 fail, another will be assigned to Placement Group #1 and | |
318 | will be filled with copies of all objects in OSD #1. If the pool size | |
319 | is changed from two to three, an additional OSD will be assigned to | |
320 | the placement group and will receive copies of all objects in the | |
321 | placement group. | |
322 | ||
11fdf7f2 | 323 | Placement groups do not own the OSD; they share it with other |
7c673cae FG |
324 | placement groups from the same pool or even other pools. If OSD #2 |
325 | fails, the Placement Group #2 will also have to restore copies of | |
326 | objects, using OSD #3. | |
327 | ||
328 | When the number of placement groups increases, the new placement | |
329 | groups will be assigned OSDs. The result of the CRUSH function will | |
330 | also change and some objects from the former placement groups will be | |
331 | copied over to the new Placement Groups and removed from the old ones. | |
332 | ||
333 | Placement Groups Tradeoffs | |
334 | ========================== | |
335 | ||
336 | Data durability and even distribution among all OSDs call for more | |
337 | placement groups but their number should be reduced to the minimum to | |
338 | save CPU and memory. | |
339 | ||
340 | .. _data durability: | |
341 | ||
342 | Data durability | |
343 | --------------- | |
344 | ||
345 | After an OSD fails, the risk of data loss increases until the data it | |
346 | contained is fully recovered. Let's imagine a scenario that causes | |
347 | permanent data loss in a single placement group: | |
348 | ||
349 | - The OSD fails and all copies of the object it contains are lost. | |
350 | For all objects within the placement group the number of replica | |
11fdf7f2 | 351 | suddenly drops from three to two. |
7c673cae | 352 | |
11fdf7f2 | 353 | - Ceph starts recovery for this placement group by choosing a new OSD |
7c673cae FG |
354 | to re-create the third copy of all objects. |
355 | ||
356 | - Another OSD, within the same placement group, fails before the new | |
357 | OSD is fully populated with the third copy. Some objects will then | |
358 | only have one surviving copies. | |
359 | ||
360 | - Ceph picks yet another OSD and keeps copying objects to restore the | |
361 | desired number of copies. | |
362 | ||
363 | - A third OSD, within the same placement group, fails before recovery | |
364 | is complete. If this OSD contained the only remaining copy of an | |
365 | object, it is permanently lost. | |
366 | ||
367 | In a cluster containing 10 OSDs with 512 placement groups in a three | |
368 | replica pool, CRUSH will give each placement groups three OSDs. In the | |
369 | end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement | |
370 | Groups. When the first OSD fails, the above scenario will therefore | |
371 | start recovery for all 150 placement groups at the same time. | |
372 | ||
373 | The 150 placement groups being recovered are likely to be | |
374 | homogeneously spread over the 9 remaining OSDs. Each remaining OSD is | |
375 | therefore likely to send copies of objects to all others and also | |
376 | receive some new objects to be stored because they became part of a | |
377 | new placement group. | |
378 | ||
379 | The amount of time it takes for this recovery to complete entirely | |
380 | depends on the architecture of the Ceph cluster. Let say each OSD is | |
381 | hosted by a 1TB SSD on a single machine and all of them are connected | |
382 | to a 10Gb/s switch and the recovery for a single OSD completes within | |
383 | M minutes. If there are two OSDs per machine using spinners with no | |
384 | SSD journal and a 1Gb/s switch, it will at least be an order of | |
385 | magnitude slower. | |
386 | ||
387 | In a cluster of this size, the number of placement groups has almost | |
388 | no influence on data durability. It could be 128 or 8192 and the | |
389 | recovery would not be slower or faster. | |
390 | ||
391 | However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs | |
392 | is likely to speed up recovery and therefore improve data durability | |
393 | significantly. Each OSD now participates in only ~75 placement groups | |
394 | instead of ~150 when there were only 10 OSDs and it will still require | |
395 | all 19 remaining OSDs to perform the same amount of object copies in | |
396 | order to recover. But where 10 OSDs had to copy approximately 100GB | |
397 | each, they now have to copy 50GB each instead. If the network was the | |
398 | bottleneck, recovery will happen twice as fast. In other words, | |
399 | recovery goes faster when the number of OSDs increases. | |
400 | ||
401 | If this cluster grows to 40 OSDs, each of them will only host ~35 | |
402 | placement groups. If an OSD dies, recovery will keep going faster | |
403 | unless it is blocked by another bottleneck. However, if this cluster | |
404 | grows to 200 OSDs, each of them will only host ~7 placement groups. If | |
405 | an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs | |
406 | in these placement groups: recovery will take longer than when there | |
407 | were 40 OSDs, meaning the number of placement groups should be | |
408 | increased. | |
409 | ||
410 | No matter how short the recovery time is, there is a chance for a | |
411 | second OSD to fail while it is in progress. In the 10 OSDs cluster | |
412 | described above, if any of them fail, then ~17 placement groups | |
413 | (i.e. ~150 / 9 placement groups being recovered) will only have one | |
414 | surviving copy. And if any of the 8 remaining OSD fail, the last | |
415 | objects of two placement groups are likely to be lost (i.e. ~17 / 8 | |
416 | placement groups with only one remaining copy being recovered). | |
417 | ||
418 | When the size of the cluster grows to 20 OSDs, the number of Placement | |
419 | Groups damaged by the loss of three OSDs drops. The second OSD lost | |
420 | will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) | |
421 | instead of ~17 and the third OSD lost will only lose data if it is one | |
422 | of the four OSDs containing the surviving copy. In other words, if the | |
423 | probability of losing one OSD is 0.0001% during the recovery time | |
11fdf7f2 | 424 | frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * |
7c673cae FG |
425 | 0.0001% in the cluster with 20 OSDs. |
426 | ||
427 | In a nutshell, more OSDs mean faster recovery and a lower risk of | |
428 | cascading failures leading to the permanent loss of a Placement | |
429 | Group. Having 512 or 4096 Placement Groups is roughly equivalent in a | |
430 | cluster with less than 50 OSDs as far as data durability is concerned. | |
431 | ||
432 | Note: It may take a long time for a new OSD added to the cluster to be | |
433 | populated with placement groups that were assigned to it. However | |
434 | there is no degradation of any object and it has no impact on the | |
435 | durability of the data contained in the Cluster. | |
436 | ||
437 | .. _object distribution: | |
438 | ||
439 | Object distribution within a pool | |
440 | --------------------------------- | |
441 | ||
442 | Ideally objects are evenly distributed in each placement group. Since | |
443 | CRUSH computes the placement group for each object, but does not | |
444 | actually know how much data is stored in each OSD within this | |
445 | placement group, the ratio between the number of placement groups and | |
446 | the number of OSDs may influence the distribution of the data | |
447 | significantly. | |
448 | ||
11fdf7f2 | 449 | For instance, if there was a single placement group for ten OSDs in a |
7c673cae FG |
450 | three replica pool, only three OSD would be used because CRUSH would |
451 | have no other choice. When more placement groups are available, | |
452 | objects are more likely to be evenly spread among them. CRUSH also | |
453 | makes every effort to evenly spread OSDs among all existing Placement | |
454 | Groups. | |
455 | ||
456 | As long as there are one or two orders of magnitude more Placement | |
eafe8130 TL |
457 | Groups than OSDs, the distribution should be even. For instance, 256 |
458 | placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs | |
459 | etc. | |
7c673cae FG |
460 | |
461 | Uneven data distribution can be caused by factors other than the ratio | |
462 | between OSDs and placement groups. Since CRUSH does not take into | |
463 | account the size of the objects, a few very large objects may create | |
464 | an imbalance. Let say one million 4K objects totaling 4GB are evenly | |
eafe8130 | 465 | spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10 |
7c673cae FG |
466 | = 400MB on each OSD. If one 400MB object is added to the pool, the |
467 | three OSDs supporting the placement group in which the object has been | |
468 | placed will be filled with 400MB + 400MB = 800MB while the seven | |
469 | others will remain occupied with only 400MB. | |
470 | ||
471 | .. _resource usage: | |
472 | ||
473 | Memory, CPU and network usage | |
474 | ----------------------------- | |
475 | ||
476 | For each placement group, OSDs and MONs need memory, network and CPU | |
477 | at all times and even more during recovery. Sharing this overhead by | |
478 | clustering objects within a placement group is one of the main reasons | |
479 | they exist. | |
480 | ||
481 | Minimizing the number of placement groups saves significant amounts of | |
482 | resources. | |
483 | ||
11fdf7f2 TL |
484 | .. _choosing-number-of-placement-groups: |
485 | ||
7c673cae FG |
486 | Choosing the number of Placement Groups |
487 | ======================================= | |
488 | ||
11fdf7f2 TL |
489 | .. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information. |
490 | ||
7c673cae FG |
491 | If you have more than 50 OSDs, we recommend approximately 50-100 |
492 | placement groups per OSD to balance out resource usage, data | |
11fdf7f2 | 493 | durability and distribution. If you have less than 50 OSDs, choosing |
7c673cae | 494 | among the `preselection`_ above is best. For a single pool of objects, |
f67539c2 | 495 | you can use the following formula to get a baseline |
7c673cae | 496 | |
f67539c2 | 497 | Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` |
7c673cae FG |
498 | |
499 | Where **pool size** is either the number of replicas for replicated | |
500 | pools or the K+M sum for erasure coded pools (as returned by **ceph | |
501 | osd erasure-code-profile get**). | |
502 | ||
503 | You should then check if the result makes sense with the way you | |
504 | designed your Ceph cluster to maximize `data durability`_, | |
505 | `object distribution`_ and minimize `resource usage`_. | |
506 | ||
eafe8130 TL |
507 | The result should always be **rounded up to the nearest power of two**. |
508 | ||
509 | Only a power of two will evenly balance the number of objects among | |
510 | placement groups. Other values will result in an uneven distribution of | |
511 | data across your OSDs. Their use should be limited to incrementally | |
512 | stepping from one power of two to another. | |
7c673cae FG |
513 | |
514 | As an example, for a cluster with 200 OSDs and a pool size of 3 | |
f67539c2 | 515 | replicas, you would estimate your number of PGs as follows |
7c673cae | 516 | |
f67539c2 | 517 | :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192 |
7c673cae FG |
518 | |
519 | When using multiple data pools for storing objects, you need to ensure | |
520 | that you balance the number of placement groups per pool with the | |
521 | number of placement groups per OSD so that you arrive at a reasonable | |
522 | total number of placement groups that provides reasonably low variance | |
523 | per OSD without taxing system resources or making the peering process | |
524 | too slow. | |
525 | ||
526 | For instance a cluster of 10 pools each with 512 placement groups on | |
527 | ten OSDs is a total of 5,120 placement groups spread over ten OSDs, | |
528 | that is 512 placement groups per OSD. That does not use too many | |
529 | resources. However, if 1,000 pools were created with 512 placement | |
530 | groups each, the OSDs will handle ~50,000 placement groups each and it | |
531 | would require significantly more resources and time for peering. | |
532 | ||
224ce89b WB |
533 | You may find the `PGCalc`_ tool helpful. |
534 | ||
535 | ||
7c673cae FG |
536 | .. _setting the number of placement groups: |
537 | ||
538 | Set the Number of Placement Groups | |
539 | ================================== | |
540 | ||
541 | To set the number of placement groups in a pool, you must specify the | |
542 | number of placement groups at the time you create the pool. | |
11fdf7f2 | 543 | See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with:: |
7c673cae FG |
544 | |
545 | ceph osd pool set {pool-name} pg_num {pg_num} | |
546 | ||
11fdf7f2 | 547 | After you increase the number of placement groups, you must also |
7c673cae FG |
548 | increase the number of placement groups for placement (``pgp_num``) |
549 | before your cluster will rebalance. The ``pgp_num`` will be the number of | |
550 | placement groups that will be considered for placement by the CRUSH | |
551 | algorithm. Increasing ``pg_num`` splits the placement groups but data | |
552 | will not be migrated to the newer placement groups until placement | |
553 | groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` | |
554 | should be equal to the ``pg_num``. To increase the number of | |
555 | placement groups for placement, execute the following:: | |
556 | ||
557 | ceph osd pool set {pool-name} pgp_num {pgp_num} | |
558 | ||
11fdf7f2 TL |
559 | When decreasing the number of PGs, ``pgp_num`` is adjusted |
560 | automatically for you. | |
7c673cae FG |
561 | |
562 | Get the Number of Placement Groups | |
563 | ================================== | |
564 | ||
565 | To get the number of placement groups in a pool, execute the following:: | |
566 | ||
567 | ceph osd pool get {pool-name} pg_num | |
568 | ||
569 | ||
570 | Get a Cluster's PG Statistics | |
571 | ============================= | |
572 | ||
573 | To get the statistics for the placement groups in your cluster, execute the following:: | |
574 | ||
575 | ceph pg dump [--format {format}] | |
576 | ||
577 | Valid formats are ``plain`` (default) and ``json``. | |
578 | ||
579 | ||
580 | Get Statistics for Stuck PGs | |
581 | ============================ | |
582 | ||
583 | To get the statistics for all placement groups stuck in a specified state, | |
584 | execute the following:: | |
585 | ||
586 | ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] | |
587 | ||
588 | **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD | |
589 | with the most up-to-date data to come up and in. | |
590 | ||
591 | **Unclean** Placement groups contain objects that are not replicated the desired number | |
592 | of times. They should be recovering. | |
593 | ||
594 | **Stale** Placement groups are in an unknown state - the OSDs that host them have not | |
595 | reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). | |
596 | ||
597 | Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number | |
598 | of seconds the placement group is stuck before including it in the returned statistics | |
599 | (default 300 seconds). | |
600 | ||
601 | ||
602 | Get a PG Map | |
603 | ============ | |
604 | ||
605 | To get the placement group map for a particular placement group, execute the following:: | |
606 | ||
607 | ceph pg map {pg-id} | |
608 | ||
609 | For example:: | |
610 | ||
611 | ceph pg map 1.6c | |
612 | ||
613 | Ceph will return the placement group map, the placement group, and the OSD status:: | |
614 | ||
615 | osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] | |
616 | ||
617 | ||
618 | Get a PGs Statistics | |
619 | ==================== | |
620 | ||
621 | To retrieve statistics for a particular placement group, execute the following:: | |
622 | ||
623 | ceph pg {pg-id} query | |
624 | ||
625 | ||
626 | Scrub a Placement Group | |
627 | ======================= | |
628 | ||
629 | To scrub a placement group, execute the following:: | |
630 | ||
631 | ceph pg scrub {pg-id} | |
632 | ||
633 | Ceph checks the primary and any replica nodes, generates a catalog of all objects | |
634 | in the placement group and compares them to ensure that no objects are missing | |
635 | or mismatched, and their contents are consistent. Assuming the replicas all | |
636 | match, a final semantic sweep ensures that all of the snapshot-related object | |
637 | metadata is consistent. Errors are reported via logs. | |
638 | ||
11fdf7f2 TL |
639 | To scrub all placement groups from a specific pool, execute the following:: |
640 | ||
641 | ceph osd pool scrub {pool-name} | |
642 | ||
c07f9fc5 FG |
643 | Prioritize backfill/recovery of a Placement Group(s) |
644 | ==================================================== | |
645 | ||
646 | You may run into a situation where a bunch of placement groups will require | |
647 | recovery and/or backfill, and some particular groups hold data more important | |
648 | than others (for example, those PGs may hold data for images used by running | |
649 | machines and other PGs may be used by inactive machines/less relevant data). | |
650 | In that case, you may want to prioritize recovery of those groups so | |
651 | performance and/or availability of data stored on those groups is restored | |
11fdf7f2 | 652 | earlier. To do this (mark particular placement group(s) as prioritized during |
c07f9fc5 FG |
653 | backfill or recovery), execute the following:: |
654 | ||
655 | ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] | |
656 | ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] | |
657 | ||
658 | This will cause Ceph to perform recovery or backfill on specified placement | |
659 | groups first, before other placement groups. This does not interrupt currently | |
660 | ongoing backfills or recovery, but causes specified PGs to be processed | |
661 | as soon as possible. If you change your mind or prioritize wrong groups, | |
662 | use:: | |
663 | ||
664 | ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] | |
665 | ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] | |
666 | ||
667 | This will remove "force" flag from those PGs and they will be processed | |
668 | in default order. Again, this doesn't affect currently processed placement | |
669 | group, only those that are still queued. | |
670 | ||
671 | The "force" flag is cleared automatically after recovery or backfill of group | |
672 | is done. | |
7c673cae | 673 | |
11fdf7f2 TL |
674 | Similarly, you may use the following commands to force Ceph to perform recovery |
675 | or backfill on all placement groups from a specified pool first:: | |
676 | ||
677 | ceph osd pool force-recovery {pool-name} | |
678 | ceph osd pool force-backfill {pool-name} | |
679 | ||
680 | or:: | |
681 | ||
682 | ceph osd pool cancel-force-recovery {pool-name} | |
683 | ceph osd pool cancel-force-backfill {pool-name} | |
684 | ||
685 | to restore to the default recovery or backfill priority if you change your mind. | |
686 | ||
687 | Note that these commands could possibly break the ordering of Ceph's internal | |
688 | priority computations, so use them with caution! | |
689 | Especially, if you have multiple pools that are currently sharing the same | |
690 | underlying OSDs, and some particular pools hold data more important than others, | |
691 | we recommend you use the following command to re-arrange all pools's | |
692 | recovery/backfill priority in a better order:: | |
693 | ||
694 | ceph osd pool set {pool-name} recovery_priority {value} | |
695 | ||
696 | For example, if you have 10 pools you could make the most important one priority 10, | |
697 | next 9, etc. Or you could leave most pools alone and have say 3 important pools | |
698 | all priority 1 or priorities 3, 2, 1 respectively. | |
699 | ||
7c673cae FG |
700 | Revert Lost |
701 | =========== | |
702 | ||
703 | If the cluster has lost one or more objects, and you have decided to | |
704 | abandon the search for the lost data, you must mark the unfound objects | |
705 | as ``lost``. | |
706 | ||
707 | If all possible locations have been queried and objects are still | |
708 | lost, you may have to give up on the lost objects. This is | |
709 | possible given unusual combinations of failures that allow the cluster | |
710 | to learn about writes that were performed before the writes themselves | |
711 | are recovered. | |
712 | ||
713 | Currently the only supported option is "revert", which will either roll back to | |
714 | a previous version of the object or (if it was a new object) forget about it | |
715 | entirely. To mark the "unfound" objects as "lost", execute the following:: | |
716 | ||
717 | ceph pg {pg-id} mark_unfound_lost revert|delete | |
718 | ||
719 | .. important:: Use this feature with caution, because it may confuse | |
720 | applications that expect the object(s) to exist. | |
721 | ||
722 | ||
723 | .. toctree:: | |
724 | :hidden: | |
725 | ||
726 | pg-states | |
727 | pg-concepts | |
728 | ||
729 | ||
730 | .. _Create a Pool: ../pools#createpool | |
731 | .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds | |
33c7a0ef | 732 | .. _pgcalc: https://old.ceph.com/pgcalc/ |