]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/placement-groups.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
1 .. _placement groups:
2
3 ==================
4 Placement Groups
5 ==================
6
7 .. _pg-autoscaler:
8
9 Autoscaling placement groups
10 ============================
11
12 Placement groups (PGs) are an internal implementation detail of how Ceph
13 distributes data. Autoscaling provides a way to manage PGs, and especially to
14 manage the number of PGs present in different pools. When *pg-autoscaling* is
15 enabled, the cluster is allowed to make recommendations or automatic
16 adjustments with respect to the number of PGs for each pool (``pgp_num``) in
17 accordance with expected cluster utilization and expected pool utilization.
18
19 Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
20 ``on``, or ``warn``:
21
22 * ``off``: Disable autoscaling for this pool. It is up to the administrator to
23 choose an appropriate ``pgp_num`` for each pool. For more information, see
24 :ref:`choosing-number-of-placement-groups`.
25 * ``on``: Enable automated adjustments of the PG count for the given pool.
26 * ``warn``: Raise health checks when the PG count is in need of adjustment.
27
28 To set the autoscaling mode for an existing pool, run a command of the
29 following form:
30
31 .. prompt:: bash #
32
33 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
34
35 For example, to enable autoscaling on pool ``foo``, run the following command:
36
37 .. prompt:: bash #
38
39 ceph osd pool set foo pg_autoscale_mode on
40
41 There is also a ``pg_autoscale_mode`` setting for any pools that are created
42 after the initial setup of the cluster. To change this setting, run a command
43 of the following form:
44
45 .. prompt:: bash #
46
47 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
48
49 You can disable or enable the autoscaler for all pools with the ``noautoscale``
50 flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
51 running the following command:
52
53 .. prompt:: bash #
54
55 ceph osd pool set noautoscale
56
57 To set the ``noautoscale`` flag to ``off``, run the following command:
58
59 .. prompt:: bash #
60
61 ceph osd pool unset noautoscale
62
63 To get the value of the flag, run the following command:
64
65 .. prompt:: bash #
66
67 ceph osd pool get noautoscale
68
69 Viewing PG scaling recommendations
70 ----------------------------------
71
72 To view each pool, its relative utilization, and any recommended changes to the
73 PG count, run the following command:
74
75 .. prompt:: bash #
76
77 ceph osd pool autoscale-status
78
79 The output will resemble the following::
80
81 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
82 a 12900M 3.0 82431M 0.4695 8 128 warn True
83 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
84 b 0 953.6M 3.0 82431M 0.0347 8 warn False
85
86 - **POOL** is the name of the pool.
87
88 - **SIZE** is the amount of data stored in the pool.
89
90 - **TARGET SIZE** (if present) is the amount of data that is expected to be
91 stored in the pool, as specified by the administrator. The system uses the
92 greater of the two values for its calculation.
93
94 - **RATE** is the multiplier for the pool that determines how much raw storage
95 capacity is consumed. For example, a three-replica pool will have a ratio of
96 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
97
98 - **RAW CAPACITY** is the total amount of raw storage capacity on the specific
99 OSDs that are responsible for storing the data of the pool (and perhaps the
100 data of other pools).
101
102 - **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
103 total raw storage capacity. In order words, RATIO is defined as
104 (SIZE * RATE) / RAW CAPACITY.
105
106 - **TARGET RATIO** (if present) is the ratio of the expected storage of this
107 pool (that is, the amount of storage that this pool is expected to consume,
108 as specified by the administrator) to the expected storage of all other pools
109 that have target ratios set. If both ``target_size_bytes`` and
110 ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
111 precedence.
112
113 - **EFFECTIVE RATIO** is the result of making two adjustments to the target
114 ratio:
115
116 #. Subtracting any capacity expected to be used by pools that have target
117 size set.
118
119 #. Normalizing the target ratios among pools that have target ratio set so
120 that collectively they target cluster capacity. For example, four pools
121 with target_ratio 1.0 would have an effective ratio of 0.25.
122
123 The system's calculations use whichever of these two ratios (that is, the
124 target ratio and the effective ratio) is greater.
125
126 - **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
127 with prior information about how many PGs a specific pool is expected to
128 have.
129
130 - **PG_NUM** is either the current number of PGs associated with the pool or,
131 if a ``pg_num`` change is in progress, the current number of PGs that the
132 pool is working towards.
133
134 - **NEW PG_NUM** (if present) is the value that the system is recommending the
135 ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
136 present only if the recommended value varies from the current value by more
137 than the default factor of ``3``. To adjust this factor (in the following
138 example, it is changed to ``2``), run the following command:
139
140 .. prompt:: bash #
141
142 ceph osd pool set threshold 2.0
143
144 - **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
145 ``off``, or ``warn``.
146
147 - **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
148 or ``False``. A ``bulk`` pool is expected to be large and should initially
149 have a large number of PGs so that performance does not suffer]. On the other
150 hand, a pool that is not ``bulk`` is expected to be small (for example, a
151 ``.mgr`` pool or a meta pool).
152
153 .. note::
154
155 If the ``ceph osd pool autoscale-status`` command returns no output at all,
156 there is probably at least one pool that spans multiple CRUSH roots. This
157 'spanning pool' issue can happen in scenarios like the following:
158 when a new deployment auto-creates the ``.mgr`` pool on the ``default``
159 CRUSH root, subsequent pools are created with rules that constrain them to a
160 specific shadow CRUSH tree. For example, if you create an RBD metadata pool
161 that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
162 constrained to ``deviceclass = hdd``, you will encounter this issue. To
163 remedy this issue, constrain the spanning pool to only one device class. In
164 the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
165 effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
166 running the following commands:
167
168 .. prompt:: bash #
169
170 ceph osd pool set .mgr crush_rule replicated-ssd
171 ceph osd pool set pool 1 crush_rule to replicated-ssd
172
173 This intervention will result in a small amount of backfill, but
174 typically this traffic completes quickly.
175
176
177 Automated scaling
178 -----------------
179
180 In the simplest approach to automated scaling, the cluster is allowed to
181 automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
182 total available storage and the target number of PGs for the whole system,
183 considers how much data is stored in each pool, and apportions PGs accordingly.
184 The system is conservative with its approach, making changes to a pool only
185 when the current number of PGs (``pg_num``) varies by more than a factor of 3
186 from the recommended number.
187
188 The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
189 parameter (default: 100), which can be adjusted by running the following
190 command:
191
192 .. prompt:: bash #
193
194 ceph config set global mon_target_pg_per_osd 100
195
196 The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
197 pool might map to a different CRUSH rule, and each rule might distribute data
198 across different devices, Ceph will consider the utilization of each subtree of
199 the hierarchy independently. For example, a pool that maps to OSDs of class
200 ``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
201 counts that are determined by how many of these two different device types
202 there are.
203
204 If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
205 with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
206 user in the manager log. The warning states the name of the pool and the set of
207 roots that overlap each other. The autoscaler does not scale any pools with
208 overlapping roots because this condition can cause problems with the scaling
209 process. We recommend constraining each pool so that it belongs to only one
210 root (that is, one OSD class) to silence the warning and ensure a successful
211 scaling process.
212
213 .. _managing_bulk_flagged_pools:
214
215 Managing pools that are flagged with ``bulk``
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
218 If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
219 complement of PGs and then scales down the number of PGs only if the usage
220 ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
221 then the autoscaler starts the pool with minimal PGs and creates additional PGs
222 only if there is more usage in the pool.
223
224 To create a pool that will be flagged ``bulk``, run the following command:
225
226 .. prompt:: bash #
227
228 ceph osd pool create <pool-name> --bulk
229
230 To set or unset the ``bulk`` flag of an existing pool, run the following
231 command:
232
233 .. prompt:: bash #
234
235 ceph osd pool set <pool-name> bulk <true/false/1/0>
236
237 To get the ``bulk`` flag of an existing pool, run the following command:
238
239 .. prompt:: bash #
240
241 ceph osd pool get <pool-name> bulk
242
243 .. _specifying_pool_target_size:
244
245 Specifying expected pool size
246 -----------------------------
247
248 When a cluster or pool is first created, it consumes only a small fraction of
249 the total cluster capacity and appears to the system as if it should need only
250 a small number of PGs. However, in some cases, cluster administrators know
251 which pools are likely to consume most of the system capacity in the long run.
252 When Ceph is provided with this information, a more appropriate number of PGs
253 can be used from the beginning, obviating subsequent changes in ``pg_num`` and
254 the associated overhead cost of relocating data.
255
256 The *target size* of a pool can be specified in two ways: either in relation to
257 the absolute size (in bytes) of the pool, or as a weight relative to all other
258 pools that have ``target_size_ratio`` set.
259
260 For example, to tell the system that ``mypool`` is expected to consume 100 TB,
261 run the following command:
262
263 .. prompt:: bash #
264
265 ceph osd pool set mypool target_size_bytes 100T
266
267 Alternatively, to tell the system that ``mypool`` is expected to consume a
268 ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
269 adjust the ``target_size_ratio`` setting of ``my pool`` by running the
270 following command:
271
272 .. prompt:: bash #
273
274 ceph osd pool set mypool target_size_ratio 1.0
275
276 If `mypool` is the only pool in the cluster, then it is expected to use 100% of
277 the total cluster capacity. However, if the cluster contains a second pool that
278 has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
279 of the total cluster capacity.
280
281 The ``ceph osd pool create`` command has two command-line options that can be
282 used to set the target size of a pool at creation time: ``--target-size-bytes
283 <bytes>`` and ``--target-size-ratio <ratio>``.
284
285 Note that if the target-size values that have been specified are impossible
286 (for example, a capacity larger than the total cluster), then a health check
287 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
288
289 If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
290 pool, then the latter will be ignored, the former will be used in system
291 calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
292 will be raised.
293
294 Specifying bounds on a pool's PGs
295 ---------------------------------
296
297 It is possible to specify both the minimum number and the maximum number of PGs
298 for a pool.
299
300 Setting a Minimum Number of PGs and a Maximum Number of PGs
301 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302
303 If a minimum is set, then Ceph will not itself reduce (nor recommend that you
304 reduce) the number of PGs to a value below the configured value. Setting a
305 minimum serves to establish a lower bound on the amount of parallelism enjoyed
306 by a client during I/O, even if a pool is mostly empty.
307
308 If a maximum is set, then Ceph will not itself increase (or recommend that you
309 increase) the number of PGs to a value above the configured value.
310
311 To set the minimum number of PGs for a pool, run a command of the following
312 form:
313
314 .. prompt:: bash #
315
316 ceph osd pool set <pool-name> pg_num_min <num>
317
318 To set the maximum number of PGs for a pool, run a command of the following
319 form:
320
321 .. prompt:: bash #
322
323 ceph osd pool set <pool-name> pg_num_max <num>
324
325 In addition, the ``ceph osd pool create`` command has two command-line options
326 that can be used to specify the minimum or maximum PG count of a pool at
327 creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.
328
329 .. _preselection:
330
331 Preselecting pg_num
332 ===================
333
334 When creating a pool with the following command, you have the option to
335 preselect the value of the ``pg_num`` parameter:
336
337 .. prompt:: bash #
338
339 ceph osd pool create {pool-name} [pg_num]
340
341 If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
342 autoscaler to automatically configure the parameter in accordance with the
343 amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).
344
345 However, your decision of whether or not to specify ``pg_num`` at creation time
346 has no effect on whether the parameter will be automatically tuned by the
347 cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
348 running a command of the following form:
349
350 .. prompt:: bash #
351
352 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
353
354 Without the balancer, the suggested target is approximately 100 PG replicas on
355 each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
356 reasonable.
357
358 The autoscaler attempts to satisfy the following conditions:
359
360 - the number of PGs per OSD should be proportional to the amount of data in the
361 pool
362 - there should be 50-100 PGs per pool, taking into account the replication
363 overhead or erasure-coding fan-out of each PG's replicas across OSDs
364
365 Use of Placement Groups
366 =======================
367
368 A placement group aggregates objects within a pool. The tracking of RADOS
369 object placement and object metadata on a per-object basis is computationally
370 expensive. It would be infeasible for a system with millions of RADOS
371 objects to efficiently track placement on a per-object basis.
372
373 .. ditaa::
374 /-----\ /-----\ /-----\ /-----\ /-----\
375 | obj | | obj | | obj | | obj | | obj |
376 \-----/ \-----/ \-----/ \-----/ \-----/
377 | | | | |
378 +--------+--------+ +---+----+
379 | |
380 v v
381 +-----------------------+ +-----------------------+
382 | Placement Group #1 | | Placement Group #2 |
383 | | | |
384 +-----------------------+ +-----------------------+
385 | |
386 +------------------------------+
387 |
388 v
389 +-----------------------+
390 | Pool |
391 | |
392 +-----------------------+
393
394 The Ceph client calculates which PG a RADOS object should be in. As part of
395 this calculation, the client hashes the object ID and performs an operation
396 involving both the number of PGs in the specified pool and the pool ID. For
397 details, see `Mapping PGs to OSDs`_.
398
399 The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
400 For example, in a replicated pool of size two, each PG will store objects on
401 two OSDs, as shown below:
402
403 .. ditaa::
404 +-----------------------+ +-----------------------+
405 | Placement Group #1 | | Placement Group #2 |
406 | | | |
407 +-----------------------+ +-----------------------+
408 | | | |
409 v v v v
410 /----------\ /----------\ /----------\ /----------\
411 | | | | | | | |
412 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
413 | | | | | | | |
414 \----------/ \----------/ \----------/ \----------/
415
416
417 If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
418 filled with copies of all objects in OSD #1. If the pool size is changed from
419 two to three, an additional OSD will be assigned to the PG and will receive
420 copies of all objects in the PG.
421
422 An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
423 shared with other PGs either from the same pool or from other pools. In our
424 example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
425 #2 fails, then Placement Group #2 must restore copies of objects (by making use
426 of OSD #3).
427
428 When the number of PGs increases, several consequences ensue. The new PGs are
429 assigned OSDs. The result of the CRUSH function changes, which means that some
430 objects from the already-existing PGs are copied to the new PGs and removed
431 from the old ones.
432
433 Factors Relevant To Specifying pg_num
434 =====================================
435
436 On the one hand, the criteria of data durability and even distribution across
437 OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
438 saving CPU resources and minimizing memory usage weigh in favor of a low number
439 of PGs.
440
441 .. _data durability:
442
443 Data durability
444 ---------------
445
446 When an OSD fails, the risk of data loss is increased until replication of the
447 data it hosted is restored to the configured level. To illustrate this point,
448 let's imagine a scenario that results in permanent data loss in a single PG:
449
450 #. The OSD fails and all copies of the object that it contains are lost. For
451 each object within the PG, the number of its replicas suddenly drops from
452 three to two.
453
454 #. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
455 the third copy of each object.
456
457 #. Another OSD within the same PG fails before the new OSD is fully populated
458 with the third copy. Some objects will then only have one surviving copy.
459
460 #. Ceph selects yet another OSD and continues copying objects in order to
461 restore the desired number of copies.
462
463 #. A third OSD within the same PG fails before recovery is complete. If this
464 OSD happened to contain the only remaining copy of an object, the object is
465 permanently lost.
466
467 In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
468 will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
469 3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
470 recovery will begin for all 150 PGs at the same time.
471
472 The 150 PGs that are being recovered are likely to be homogeneously distributed
473 across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
474 copies of objects to all other OSDs and also likely to receive some new objects
475 to be stored because it has become part of a new PG.
476
477 The amount of time it takes for this recovery to complete depends on the
478 architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
479 a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
480 switch, and the recovery of a single OSD completes within a certain number of
481 minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
482 a 1 Gb/s switch. In the second setup, recovery will be at least one order of
483 magnitude slower.
484
485 In such a cluster, the number of PGs has almost no effect on data durability.
486 Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
487 slower or faster.
488
489 However, an increase in the number of OSDs can increase the speed of recovery.
490 Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now
491 participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
492 still be required to replicate the same number of objects in order to recover.
493 But instead of there being only 10 OSDs that have to copy ~100 GB each, there
494 are now 20 OSDs that have to copy only 50 GB each. If the network had
495 previously been a bottleneck, recovery now happens twice as fast.
496
497 Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
498 ~38 PGs. And if an OSD dies, recovery will take place faster than before unless
499 it is blocked by another bottleneck. Now, however, suppose that our cluster
500 grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
501 will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
502 associated with these PGs. This means that recovery will take longer than when
503 there were only 40 OSDs. For this reason, the number of PGs should be
504 increased.
505
506 No matter how brief the recovery time is, there is always a chance that an
507 additional OSD will fail while recovery is in progress. Consider the cluster
508 with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
509 (approximately 150 divided by 9) PGs will have only one remaining copy. And if
510 any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
511 are likely to lose their remaining objects. This is one reason why setting
512 ``size=2`` is risky.
513
514 When the number of OSDs in the cluster increases to 20, the number of PGs that
515 would be damaged by the loss of three OSDs significantly decreases. The loss of
516 a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
517 PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
518 data loss only if it is one of the 4 OSDs that contains the remaining copy.
519 This means -- assuming that the probability of losing one OSD during recovery
520 is 0.0001% -- that the probability of data loss when three OSDs are lost is
521 :math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
522 only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.
523
524 In summary, the greater the number of OSDs, the faster the recovery and the
525 lower the risk of permanently losing a PG due to cascading failures. As far as
526 data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
527 much matter whether there are 512 or 4096 PGs.
528
529 .. note:: It can take a long time for an OSD that has been recently added to
530 the cluster to be populated with the PGs assigned to it. However, no object
531 degradation or impact on data durability will result from the slowness of
532 this process since Ceph populates data into the new PGs before removing it
533 from the old PGs.
534
535 .. _object distribution:
536
537 Object distribution within a pool
538 ---------------------------------
539
540 Under ideal conditions, objects are evenly distributed across PGs. Because
541 CRUSH computes the PG for each object but does not know how much data is stored
542 in each OSD associated with the PG, the ratio between the number of PGs and the
543 number of OSDs can have a significant influence on data distribution.
544
545 For example, suppose that there is only a single PG for ten OSDs in a
546 three-replica pool. In that case, only three OSDs would be used because CRUSH
547 would have no other option. However, if more PGs are available, RADOS objects are
548 more likely to be evenly distributed across OSDs. CRUSH makes every effort to
549 distribute OSDs evenly across all existing PGs.
550
551 As long as there are one or two orders of magnitude more PGs than OSDs, the
552 distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
553 10 OSDs, or 1024 PGs for 10 OSDs.
554
555 However, uneven data distribution can emerge due to factors other than the
556 ratio of PGs to OSDs. For example, since CRUSH does not take into account the
557 size of the RADOS objects, the presence of a few very large RADOS objects can
558 create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
559 are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
560 consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
561 added to the pool, the three OSDs supporting the PG in which the RADOS object
562 has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
563 other OSDs will still contain only 400 MB.
564
565 .. _resource usage:
566
567 Memory, CPU and network usage
568 -----------------------------
569
570 Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
571 MONs. These needs must be met at all times and are increased during recovery.
572 Indeed, one of the main reasons PGs were developed was to share this overhead
573 by clustering objects together.
574
575 For this reason, minimizing the number of PGs saves significant resources.
576
577 .. _choosing-number-of-placement-groups:
578
579 Choosing the Number of PGs
580 ==========================
581
582 .. note: It is rarely necessary to do the math in this section by hand.
583 Instead, use the ``ceph osd pool autoscale-status`` command in combination
584 with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
585 more information, see :ref:`pg-autoscaler`.
586
587 If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
588 order to balance resource usage, data durability, and data distribution. If you
589 have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
590 For a single pool, use the following formula to get a baseline value:
591
592 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
593
594 Here **pool size** is either the number of replicas for replicated pools or the
595 K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
596 osd erasure-code-profile get``.
597
598 Next, check whether the resulting baseline value is consistent with the way you
599 designed your Ceph cluster to maximize `data durability`_ and `object
600 distribution`_ and to minimize `resource usage`_.
601
602 This value should be **rounded up to the nearest power of two**.
603
604 Each pool's ``pg_num`` should be a power of two. Other values are likely to
605 result in uneven distribution of data across OSDs. It is best to increase
606 ``pg_num`` for a pool only when it is feasible and desirable to set the next
607 highest power of two. Note that this power of two rule is per-pool; it is
608 neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
609 of two.
610
611 For example, if you have a cluster with 200 OSDs and a single pool with a size
612 of 3 replicas, estimate the number of PGs as follows:
613
614 :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.
615
616 When using multiple data pools to store objects, make sure that you balance the
617 number of PGs per pool against the number of PGs per OSD so that you arrive at
618 a reasonable total number of PGs. It is important to find a number that
619 provides reasonably low variance per OSD without taxing system resources or
620 making the peering process too slow.
621
622 For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
623 OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
624 This cluster will not use too many resources. However, in a cluster of 1,000
625 pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
626 each. This cluster will require significantly more resources and significantly
627 more time for peering.
628
629 For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
630 tool.
631
632
633 .. _setting the number of placement groups:
634
635 Setting the Number of PGs
636 =========================
637
638 Setting the initial number of PGs in a pool must be done at the time you create
639 the pool. See `Create a Pool`_ for details.
640
641 However, even after a pool is created, if the ``pg_autoscaler`` is not being
642 used to manage ``pg_num`` values, you can change the number of PGs by running a
643 command of the following form:
644
645 .. prompt:: bash #
646
647 ceph osd pool set {pool-name} pg_num {pg_num}
648
649 If you increase the number of PGs, your cluster will not rebalance until you
650 increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
651 parameter specifies the number of PGs that are to be considered for placement
652 by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
653 but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
654 The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
655 increase the number of PGs for placement, run a command of the following form:
656
657 .. prompt:: bash #
658
659 ceph osd pool set {pool-name} pgp_num {pgp_num}
660
661 If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
662 In releases of Ceph that are Nautilus and later (inclusive), when the
663 ``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
664 ``pg_num``. This process manifests as periods of remapping of PGs and of
665 backfill, and is expected behavior and normal.
666
667 .. _rados_ops_pgs_get_pg_num:
668
669 Get the Number of PGs
670 =====================
671
672 To get the number of PGs in a pool, run a command of the following form:
673
674 .. prompt:: bash #
675
676 ceph osd pool get {pool-name} pg_num
677
678
679 Get a Cluster's PG Statistics
680 =============================
681
682 To see the details of the PGs in your cluster, run a command of the following
683 form:
684
685 .. prompt:: bash #
686
687 ceph pg dump [--format {format}]
688
689 Valid formats are ``plain`` (default) and ``json``.
690
691
692 Get Statistics for Stuck PGs
693 ============================
694
695 To see the statistics for all PGs that are stuck in a specified state, run a
696 command of the following form:
697
698 .. prompt:: bash #
699
700 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
701
702 - **Inactive** PGs cannot process reads or writes because they are waiting for
703 enough OSDs with the most up-to-date data to come ``up`` and ``in``.
704
705 - **Undersized** PGs contain objects that have not been replicated the desired
706 number of times. Under normal conditions, it can be assumed that these PGs
707 are recovering.
708
709 - **Stale** PGs are in an unknown state -- the OSDs that host them have not
710 reported to the monitor cluster for a certain period of time (determined by
711 ``mon_osd_report_timeout``).
712
713 Valid formats are ``plain`` (default) and ``json``. The threshold defines the
714 minimum number of seconds the PG is stuck before it is included in the returned
715 statistics (default: 300).
716
717
718 Get a PG Map
719 ============
720
721 To get the PG map for a particular PG, run a command of the following form:
722
723 .. prompt:: bash #
724
725 ceph pg map {pg-id}
726
727 For example:
728
729 .. prompt:: bash #
730
731 ceph pg map 1.6c
732
733 Ceph will return the PG map, the PG, and the OSD status. The output resembles
734 the following:
735
736 .. prompt:: bash #
737
738 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
739
740
741 Get a PG's Statistics
742 =====================
743
744 To see statistics for a particular PG, run a command of the following form:
745
746 .. prompt:: bash #
747
748 ceph pg {pg-id} query
749
750
751 Scrub a PG
752 ==========
753
754 To scrub a PG, run a command of the following form:
755
756 .. prompt:: bash #
757
758 ceph pg scrub {pg-id}
759
760 Ceph checks the primary and replica OSDs, generates a catalog of all objects in
761 the PG, and compares the objects against each other in order to ensure that no
762 objects are missing or mismatched and that their contents are consistent. If
763 the replicas all match, then a final semantic sweep takes place to ensure that
764 all snapshot-related object metadata is consistent. Errors are reported in
765 logs.
766
767 To scrub all PGs from a specific pool, run a command of the following form:
768
769 .. prompt:: bash #
770
771 ceph osd pool scrub {pool-name}
772
773
774 Prioritize backfill/recovery of PG(s)
775 =====================================
776
777 You might encounter a situation in which multiple PGs require recovery or
778 backfill, but the data in some PGs is more important than the data in others
779 (for example, some PGs hold data for images that are used by running machines
780 and other PGs are used by inactive machines and hold data that is less
781 relevant). In that case, you might want to prioritize recovery or backfill of
782 the PGs with especially important data so that the performance of the cluster
783 and the availability of their data are restored sooner. To designate specific
784 PG(s) as prioritized during recovery, run a command of the following form:
785
786 .. prompt:: bash #
787
788 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
789
790 To mark specific PG(s) as prioritized during backfill, run a command of the
791 following form:
792
793 .. prompt:: bash #
794
795 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
796
797 These commands instruct Ceph to perform recovery or backfill on the specified
798 PGs before processing the other PGs. Prioritization does not interrupt current
799 backfills or recovery, but places the specified PGs at the top of the queue so
800 that they will be acted upon next. If you change your mind or realize that you
801 have prioritized the wrong PGs, run one or both of the following commands:
802
803 .. prompt:: bash #
804
805 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
806 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
807
808 These commands remove the ``force`` flag from the specified PGs, so that the
809 PGs will be processed in their usual order. As in the case of adding the
810 ``force`` flag, this affects only those PGs that are still queued but does not
811 affect PGs currently undergoing recovery.
812
813 The ``force`` flag is cleared automatically after recovery or backfill of the
814 PGs is complete.
815
816 Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
817 is, to perform recovery or backfill on those PGs first), run one or both of the
818 following commands:
819
820 .. prompt:: bash #
821
822 ceph osd pool force-recovery {pool-name}
823 ceph osd pool force-backfill {pool-name}
824
825 These commands can also be cancelled. To revert to the default order, run one
826 or both of the following commands:
827
828 .. prompt:: bash #
829
830 ceph osd pool cancel-force-recovery {pool-name}
831 ceph osd pool cancel-force-backfill {pool-name}
832
833 .. warning:: These commands can break the order of Ceph's internal priority
834 computations, so use them with caution! If you have multiple pools that are
835 currently sharing the same underlying OSDs, and if the data held by certain
836 pools is more important than the data held by other pools, then we recommend
837 that you run a command of the following form to arrange a custom
838 recovery/backfill priority for all pools:
839
840 .. prompt:: bash #
841
842 ceph osd pool set {pool-name} recovery_priority {value}
843
844 For example, if you have twenty pools, you could make the most important pool
845 priority ``20``, and the next most important pool priority ``19``, and so on.
846
847 Another option is to set the recovery/backfill priority for only a proper
848 subset of pools. In such a scenario, three important pools might (all) be
849 assigned priority ``1`` and all other pools would be left without an assigned
850 recovery/backfill priority. Another possibility is to select three important
851 pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
852 respectively.
853
854 .. important:: Numbers of greater value have higher priority than numbers of
855 lesser value when using ``ceph osd pool set {pool-name} recovery_priority
856 {value}`` to set their recovery/backfill priority. For example, a pool with
857 the recovery/backfill priority ``30`` has a higher priority than a pool with
858 the recovery/backfill priority ``15``.
859
860 Reverting Lost RADOS Objects
861 ============================
862
863 If the cluster has lost one or more RADOS objects and you have decided to
864 abandon the search for the lost data, you must mark the unfound objects
865 ``lost``.
866
867 If every possible location has been queried and all OSDs are ``up`` and ``in``,
868 but certain RADOS objects are still lost, you might have to give up on those
869 objects. This situation can arise when rare and unusual combinations of
870 failures allow the cluster to learn about writes that were performed before the
871 writes themselves were recovered.
872
873 The command to mark a RADOS object ``lost`` has only one supported option:
874 ``revert``. The ``revert`` option will either roll back to a previous version
875 of the RADOS object (if it is old enough to have a previous version) or forget
876 about it entirely (if it is too new to have a previous version). To mark the
877 "unfound" objects ``lost``, run a command of the following form:
878
879
880 .. prompt:: bash #
881
882 ceph pg {pg-id} mark_unfound_lost revert|delete
883
884 .. important:: Use this feature with caution. It might confuse applications
885 that expect the object(s) to exist.
886
887
888 .. toctree::
889 :hidden:
890
891 pg-states
892 pg-concepts
893
894
895 .. _Create a Pool: ../pools#createpool
896 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
897 .. _pgcalc: https://old.ceph.com/pgcalc/