9 Autoscaling placement groups
10 ============================
12 Placement groups (PGs) are an internal implementation detail of how Ceph
13 distributes data. Autoscaling provides a way to manage PGs, and especially to
14 manage the number of PGs present in different pools. When *pg-autoscaling* is
15 enabled, the cluster is allowed to make recommendations or automatic
16 adjustments with respect to the number of PGs for each pool (``pgp_num``) in
17 accordance with expected cluster utilization and expected pool utilization.
19 Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
22 * ``off``: Disable autoscaling for this pool. It is up to the administrator to
23 choose an appropriate ``pgp_num`` for each pool. For more information, see
24 :ref:`choosing-number-of-placement-groups`.
25 * ``on``: Enable automated adjustments of the PG count for the given pool.
26 * ``warn``: Raise health checks when the PG count is in need of adjustment.
28 To set the autoscaling mode for an existing pool, run a command of the
33 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
35 For example, to enable autoscaling on pool ``foo``, run the following command:
39 ceph osd pool set foo pg_autoscale_mode on
41 There is also a ``pg_autoscale_mode`` setting for any pools that are created
42 after the initial setup of the cluster. To change this setting, run a command
43 of the following form:
47 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
49 You can disable or enable the autoscaler for all pools with the ``noautoscale``
50 flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
51 running the following command:
55 ceph osd pool set noautoscale
57 To set the ``noautoscale`` flag to ``off``, run the following command:
61 ceph osd pool unset noautoscale
63 To get the value of the flag, run the following command:
67 ceph osd pool get noautoscale
69 Viewing PG scaling recommendations
70 ----------------------------------
72 To view each pool, its relative utilization, and any recommended changes to the
73 PG count, run the following command:
77 ceph osd pool autoscale-status
79 The output will resemble the following::
81 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
82 a 12900M 3.0 82431M 0.4695 8 128 warn True
83 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
84 b 0 953.6M 3.0 82431M 0.0347 8 warn False
86 - **POOL** is the name of the pool.
88 - **SIZE** is the amount of data stored in the pool.
90 - **TARGET SIZE** (if present) is the amount of data that is expected to be
91 stored in the pool, as specified by the administrator. The system uses the
92 greater of the two values for its calculation.
94 - **RATE** is the multiplier for the pool that determines how much raw storage
95 capacity is consumed. For example, a three-replica pool will have a ratio of
96 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
98 - **RAW CAPACITY** is the total amount of raw storage capacity on the specific
99 OSDs that are responsible for storing the data of the pool (and perhaps the
100 data of other pools).
102 - **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
103 total raw storage capacity. In order words, RATIO is defined as
104 (SIZE * RATE) / RAW CAPACITY.
106 - **TARGET RATIO** (if present) is the ratio of the expected storage of this
107 pool (that is, the amount of storage that this pool is expected to consume,
108 as specified by the administrator) to the expected storage of all other pools
109 that have target ratios set. If both ``target_size_bytes`` and
110 ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
113 - **EFFECTIVE RATIO** is the result of making two adjustments to the target
116 #. Subtracting any capacity expected to be used by pools that have target
119 #. Normalizing the target ratios among pools that have target ratio set so
120 that collectively they target cluster capacity. For example, four pools
121 with target_ratio 1.0 would have an effective ratio of 0.25.
123 The system's calculations use whichever of these two ratios (that is, the
124 target ratio and the effective ratio) is greater.
126 - **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
127 with prior information about how many PGs a specific pool is expected to
130 - **PG_NUM** is either the current number of PGs associated with the pool or,
131 if a ``pg_num`` change is in progress, the current number of PGs that the
132 pool is working towards.
134 - **NEW PG_NUM** (if present) is the value that the system is recommending the
135 ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
136 present only if the recommended value varies from the current value by more
137 than the default factor of ``3``. To adjust this factor (in the following
138 example, it is changed to ``2``), run the following command:
142 ceph osd pool set threshold 2.0
144 - **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
145 ``off``, or ``warn``.
147 - **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
148 or ``False``. A ``bulk`` pool is expected to be large and should initially
149 have a large number of PGs so that performance does not suffer]. On the other
150 hand, a pool that is not ``bulk`` is expected to be small (for example, a
151 ``.mgr`` pool or a meta pool).
155 If the ``ceph osd pool autoscale-status`` command returns no output at all,
156 there is probably at least one pool that spans multiple CRUSH roots. This
157 'spanning pool' issue can happen in scenarios like the following:
158 when a new deployment auto-creates the ``.mgr`` pool on the ``default``
159 CRUSH root, subsequent pools are created with rules that constrain them to a
160 specific shadow CRUSH tree. For example, if you create an RBD metadata pool
161 that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
162 constrained to ``deviceclass = hdd``, you will encounter this issue. To
163 remedy this issue, constrain the spanning pool to only one device class. In
164 the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
165 effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
166 running the following commands:
170 ceph osd pool set .mgr crush_rule replicated-ssd
171 ceph osd pool set pool 1 crush_rule to replicated-ssd
173 This intervention will result in a small amount of backfill, but
174 typically this traffic completes quickly.
180 In the simplest approach to automated scaling, the cluster is allowed to
181 automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
182 total available storage and the target number of PGs for the whole system,
183 considers how much data is stored in each pool, and apportions PGs accordingly.
184 The system is conservative with its approach, making changes to a pool only
185 when the current number of PGs (``pg_num``) varies by more than a factor of 3
186 from the recommended number.
188 The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
189 parameter (default: 100), which can be adjusted by running the following
194 ceph config set global mon_target_pg_per_osd 100
196 The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
197 pool might map to a different CRUSH rule, and each rule might distribute data
198 across different devices, Ceph will consider the utilization of each subtree of
199 the hierarchy independently. For example, a pool that maps to OSDs of class
200 ``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
201 counts that are determined by how many of these two different device types
204 If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
205 with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
206 user in the manager log. The warning states the name of the pool and the set of
207 roots that overlap each other. The autoscaler does not scale any pools with
208 overlapping roots because this condition can cause problems with the scaling
209 process. We recommend constraining each pool so that it belongs to only one
210 root (that is, one OSD class) to silence the warning and ensure a successful
213 .. _managing_bulk_flagged_pools:
215 Managing pools that are flagged with ``bulk``
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
218 If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
219 complement of PGs and then scales down the number of PGs only if the usage
220 ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
221 then the autoscaler starts the pool with minimal PGs and creates additional PGs
222 only if there is more usage in the pool.
224 To create a pool that will be flagged ``bulk``, run the following command:
228 ceph osd pool create <pool-name> --bulk
230 To set or unset the ``bulk`` flag of an existing pool, run the following
235 ceph osd pool set <pool-name> bulk <true/false/1/0>
237 To get the ``bulk`` flag of an existing pool, run the following command:
241 ceph osd pool get <pool-name> bulk
243 .. _specifying_pool_target_size:
245 Specifying expected pool size
246 -----------------------------
248 When a cluster or pool is first created, it consumes only a small fraction of
249 the total cluster capacity and appears to the system as if it should need only
250 a small number of PGs. However, in some cases, cluster administrators know
251 which pools are likely to consume most of the system capacity in the long run.
252 When Ceph is provided with this information, a more appropriate number of PGs
253 can be used from the beginning, obviating subsequent changes in ``pg_num`` and
254 the associated overhead cost of relocating data.
256 The *target size* of a pool can be specified in two ways: either in relation to
257 the absolute size (in bytes) of the pool, or as a weight relative to all other
258 pools that have ``target_size_ratio`` set.
260 For example, to tell the system that ``mypool`` is expected to consume 100 TB,
261 run the following command:
265 ceph osd pool set mypool target_size_bytes 100T
267 Alternatively, to tell the system that ``mypool`` is expected to consume a
268 ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
269 adjust the ``target_size_ratio`` setting of ``my pool`` by running the
274 ceph osd pool set mypool target_size_ratio 1.0
276 If `mypool` is the only pool in the cluster, then it is expected to use 100% of
277 the total cluster capacity. However, if the cluster contains a second pool that
278 has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
279 of the total cluster capacity.
281 The ``ceph osd pool create`` command has two command-line options that can be
282 used to set the target size of a pool at creation time: ``--target-size-bytes
283 <bytes>`` and ``--target-size-ratio <ratio>``.
285 Note that if the target-size values that have been specified are impossible
286 (for example, a capacity larger than the total cluster), then a health check
287 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
289 If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
290 pool, then the latter will be ignored, the former will be used in system
291 calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
294 Specifying bounds on a pool's PGs
295 ---------------------------------
297 It is possible to specify both the minimum number and the maximum number of PGs
300 Setting a Minimum Number of PGs and a Maximum Number of PGs
301 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
303 If a minimum is set, then Ceph will not itself reduce (nor recommend that you
304 reduce) the number of PGs to a value below the configured value. Setting a
305 minimum serves to establish a lower bound on the amount of parallelism enjoyed
306 by a client during I/O, even if a pool is mostly empty.
308 If a maximum is set, then Ceph will not itself increase (or recommend that you
309 increase) the number of PGs to a value above the configured value.
311 To set the minimum number of PGs for a pool, run a command of the following
316 ceph osd pool set <pool-name> pg_num_min <num>
318 To set the maximum number of PGs for a pool, run a command of the following
323 ceph osd pool set <pool-name> pg_num_max <num>
325 In addition, the ``ceph osd pool create`` command has two command-line options
326 that can be used to specify the minimum or maximum PG count of a pool at
327 creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.
334 When creating a pool with the following command, you have the option to
335 preselect the value of the ``pg_num`` parameter:
339 ceph osd pool create {pool-name} [pg_num]
341 If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
342 autoscaler to automatically configure the parameter in accordance with the
343 amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).
345 However, your decision of whether or not to specify ``pg_num`` at creation time
346 has no effect on whether the parameter will be automatically tuned by the
347 cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
348 running a command of the following form:
352 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
354 Without the balancer, the suggested target is approximately 100 PG replicas on
355 each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
358 The autoscaler attempts to satisfy the following conditions:
360 - the number of PGs per OSD should be proportional to the amount of data in the
362 - there should be 50-100 PGs per pool, taking into account the replication
363 overhead or erasure-coding fan-out of each PG's replicas across OSDs
365 Use of Placement Groups
366 =======================
368 A placement group aggregates objects within a pool. The tracking of RADOS
369 object placement and object metadata on a per-object basis is computationally
370 expensive. It would be infeasible for a system with millions of RADOS
371 objects to efficiently track placement on a per-object basis.
374 /-----\ /-----\ /-----\ /-----\ /-----\
375 | obj | | obj | | obj | | obj | | obj |
376 \-----/ \-----/ \-----/ \-----/ \-----/
378 +--------+--------+ +---+----+
381 +-----------------------+ +-----------------------+
382 | Placement Group #1 | | Placement Group #2 |
384 +-----------------------+ +-----------------------+
386 +------------------------------+
389 +-----------------------+
392 +-----------------------+
394 The Ceph client calculates which PG a RADOS object should be in. As part of
395 this calculation, the client hashes the object ID and performs an operation
396 involving both the number of PGs in the specified pool and the pool ID. For
397 details, see `Mapping PGs to OSDs`_.
399 The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
400 For example, in a replicated pool of size two, each PG will store objects on
401 two OSDs, as shown below:
404 +-----------------------+ +-----------------------+
405 | Placement Group #1 | | Placement Group #2 |
407 +-----------------------+ +-----------------------+
410 /----------\ /----------\ /----------\ /----------\
412 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
414 \----------/ \----------/ \----------/ \----------/
417 If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
418 filled with copies of all objects in OSD #1. If the pool size is changed from
419 two to three, an additional OSD will be assigned to the PG and will receive
420 copies of all objects in the PG.
422 An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
423 shared with other PGs either from the same pool or from other pools. In our
424 example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
425 #2 fails, then Placement Group #2 must restore copies of objects (by making use
428 When the number of PGs increases, several consequences ensue. The new PGs are
429 assigned OSDs. The result of the CRUSH function changes, which means that some
430 objects from the already-existing PGs are copied to the new PGs and removed
433 Factors Relevant To Specifying pg_num
434 =====================================
436 On the one hand, the criteria of data durability and even distribution across
437 OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
438 saving CPU resources and minimizing memory usage weigh in favor of a low number
446 When an OSD fails, the risk of data loss is increased until replication of the
447 data it hosted is restored to the configured level. To illustrate this point,
448 let's imagine a scenario that results in permanent data loss in a single PG:
450 #. The OSD fails and all copies of the object that it contains are lost. For
451 each object within the PG, the number of its replicas suddenly drops from
454 #. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
455 the third copy of each object.
457 #. Another OSD within the same PG fails before the new OSD is fully populated
458 with the third copy. Some objects will then only have one surviving copy.
460 #. Ceph selects yet another OSD and continues copying objects in order to
461 restore the desired number of copies.
463 #. A third OSD within the same PG fails before recovery is complete. If this
464 OSD happened to contain the only remaining copy of an object, the object is
467 In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
468 will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
469 3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
470 recovery will begin for all 150 PGs at the same time.
472 The 150 PGs that are being recovered are likely to be homogeneously distributed
473 across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
474 copies of objects to all other OSDs and also likely to receive some new objects
475 to be stored because it has become part of a new PG.
477 The amount of time it takes for this recovery to complete depends on the
478 architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
479 a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
480 switch, and the recovery of a single OSD completes within a certain number of
481 minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
482 a 1 Gb/s switch. In the second setup, recovery will be at least one order of
485 In such a cluster, the number of PGs has almost no effect on data durability.
486 Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
489 However, an increase in the number of OSDs can increase the speed of recovery.
490 Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now
491 participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
492 still be required to replicate the same number of objects in order to recover.
493 But instead of there being only 10 OSDs that have to copy ~100 GB each, there
494 are now 20 OSDs that have to copy only 50 GB each. If the network had
495 previously been a bottleneck, recovery now happens twice as fast.
497 Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
498 ~38 PGs. And if an OSD dies, recovery will take place faster than before unless
499 it is blocked by another bottleneck. Now, however, suppose that our cluster
500 grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
501 will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
502 associated with these PGs. This means that recovery will take longer than when
503 there were only 40 OSDs. For this reason, the number of PGs should be
506 No matter how brief the recovery time is, there is always a chance that an
507 additional OSD will fail while recovery is in progress. Consider the cluster
508 with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
509 (approximately 150 divided by 9) PGs will have only one remaining copy. And if
510 any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
511 are likely to lose their remaining objects. This is one reason why setting
514 When the number of OSDs in the cluster increases to 20, the number of PGs that
515 would be damaged by the loss of three OSDs significantly decreases. The loss of
516 a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
517 PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
518 data loss only if it is one of the 4 OSDs that contains the remaining copy.
519 This means -- assuming that the probability of losing one OSD during recovery
520 is 0.0001% -- that the probability of data loss when three OSDs are lost is
521 :math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
522 only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.
524 In summary, the greater the number of OSDs, the faster the recovery and the
525 lower the risk of permanently losing a PG due to cascading failures. As far as
526 data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
527 much matter whether there are 512 or 4096 PGs.
529 .. note:: It can take a long time for an OSD that has been recently added to
530 the cluster to be populated with the PGs assigned to it. However, no object
531 degradation or impact on data durability will result from the slowness of
532 this process since Ceph populates data into the new PGs before removing it
535 .. _object distribution:
537 Object distribution within a pool
538 ---------------------------------
540 Under ideal conditions, objects are evenly distributed across PGs. Because
541 CRUSH computes the PG for each object but does not know how much data is stored
542 in each OSD associated with the PG, the ratio between the number of PGs and the
543 number of OSDs can have a significant influence on data distribution.
545 For example, suppose that there is only a single PG for ten OSDs in a
546 three-replica pool. In that case, only three OSDs would be used because CRUSH
547 would have no other option. However, if more PGs are available, RADOS objects are
548 more likely to be evenly distributed across OSDs. CRUSH makes every effort to
549 distribute OSDs evenly across all existing PGs.
551 As long as there are one or two orders of magnitude more PGs than OSDs, the
552 distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
553 10 OSDs, or 1024 PGs for 10 OSDs.
555 However, uneven data distribution can emerge due to factors other than the
556 ratio of PGs to OSDs. For example, since CRUSH does not take into account the
557 size of the RADOS objects, the presence of a few very large RADOS objects can
558 create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
559 are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
560 consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
561 added to the pool, the three OSDs supporting the PG in which the RADOS object
562 has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
563 other OSDs will still contain only 400 MB.
567 Memory, CPU and network usage
568 -----------------------------
570 Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
571 MONs. These needs must be met at all times and are increased during recovery.
572 Indeed, one of the main reasons PGs were developed was to share this overhead
573 by clustering objects together.
575 For this reason, minimizing the number of PGs saves significant resources.
577 .. _choosing-number-of-placement-groups:
579 Choosing the Number of PGs
580 ==========================
582 .. note: It is rarely necessary to do the math in this section by hand.
583 Instead, use the ``ceph osd pool autoscale-status`` command in combination
584 with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
585 more information, see :ref:`pg-autoscaler`.
587 If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
588 order to balance resource usage, data durability, and data distribution. If you
589 have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
590 For a single pool, use the following formula to get a baseline value:
592 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
594 Here **pool size** is either the number of replicas for replicated pools or the
595 K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
596 osd erasure-code-profile get``.
598 Next, check whether the resulting baseline value is consistent with the way you
599 designed your Ceph cluster to maximize `data durability`_ and `object
600 distribution`_ and to minimize `resource usage`_.
602 This value should be **rounded up to the nearest power of two**.
604 Each pool's ``pg_num`` should be a power of two. Other values are likely to
605 result in uneven distribution of data across OSDs. It is best to increase
606 ``pg_num`` for a pool only when it is feasible and desirable to set the next
607 highest power of two. Note that this power of two rule is per-pool; it is
608 neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
611 For example, if you have a cluster with 200 OSDs and a single pool with a size
612 of 3 replicas, estimate the number of PGs as follows:
614 :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.
616 When using multiple data pools to store objects, make sure that you balance the
617 number of PGs per pool against the number of PGs per OSD so that you arrive at
618 a reasonable total number of PGs. It is important to find a number that
619 provides reasonably low variance per OSD without taxing system resources or
620 making the peering process too slow.
622 For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
623 OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
624 This cluster will not use too many resources. However, in a cluster of 1,000
625 pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
626 each. This cluster will require significantly more resources and significantly
627 more time for peering.
629 For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
633 .. _setting the number of placement groups:
635 Setting the Number of PGs
636 =========================
638 Setting the initial number of PGs in a pool must be done at the time you create
639 the pool. See `Create a Pool`_ for details.
641 However, even after a pool is created, if the ``pg_autoscaler`` is not being
642 used to manage ``pg_num`` values, you can change the number of PGs by running a
643 command of the following form:
647 ceph osd pool set {pool-name} pg_num {pg_num}
649 If you increase the number of PGs, your cluster will not rebalance until you
650 increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
651 parameter specifies the number of PGs that are to be considered for placement
652 by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
653 but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
654 The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
655 increase the number of PGs for placement, run a command of the following form:
659 ceph osd pool set {pool-name} pgp_num {pgp_num}
661 If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
662 In releases of Ceph that are Nautilus and later (inclusive), when the
663 ``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
664 ``pg_num``. This process manifests as periods of remapping of PGs and of
665 backfill, and is expected behavior and normal.
667 .. _rados_ops_pgs_get_pg_num:
669 Get the Number of PGs
670 =====================
672 To get the number of PGs in a pool, run a command of the following form:
676 ceph osd pool get {pool-name} pg_num
679 Get a Cluster's PG Statistics
680 =============================
682 To see the details of the PGs in your cluster, run a command of the following
687 ceph pg dump [--format {format}]
689 Valid formats are ``plain`` (default) and ``json``.
692 Get Statistics for Stuck PGs
693 ============================
695 To see the statistics for all PGs that are stuck in a specified state, run a
696 command of the following form:
700 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
702 - **Inactive** PGs cannot process reads or writes because they are waiting for
703 enough OSDs with the most up-to-date data to come ``up`` and ``in``.
705 - **Undersized** PGs contain objects that have not been replicated the desired
706 number of times. Under normal conditions, it can be assumed that these PGs
709 - **Stale** PGs are in an unknown state -- the OSDs that host them have not
710 reported to the monitor cluster for a certain period of time (determined by
711 ``mon_osd_report_timeout``).
713 Valid formats are ``plain`` (default) and ``json``. The threshold defines the
714 minimum number of seconds the PG is stuck before it is included in the returned
715 statistics (default: 300).
721 To get the PG map for a particular PG, run a command of the following form:
733 Ceph will return the PG map, the PG, and the OSD status. The output resembles
738 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
741 Get a PG's Statistics
742 =====================
744 To see statistics for a particular PG, run a command of the following form:
748 ceph pg {pg-id} query
754 To scrub a PG, run a command of the following form:
758 ceph pg scrub {pg-id}
760 Ceph checks the primary and replica OSDs, generates a catalog of all objects in
761 the PG, and compares the objects against each other in order to ensure that no
762 objects are missing or mismatched and that their contents are consistent. If
763 the replicas all match, then a final semantic sweep takes place to ensure that
764 all snapshot-related object metadata is consistent. Errors are reported in
767 To scrub all PGs from a specific pool, run a command of the following form:
771 ceph osd pool scrub {pool-name}
774 Prioritize backfill/recovery of PG(s)
775 =====================================
777 You might encounter a situation in which multiple PGs require recovery or
778 backfill, but the data in some PGs is more important than the data in others
779 (for example, some PGs hold data for images that are used by running machines
780 and other PGs are used by inactive machines and hold data that is less
781 relevant). In that case, you might want to prioritize recovery or backfill of
782 the PGs with especially important data so that the performance of the cluster
783 and the availability of their data are restored sooner. To designate specific
784 PG(s) as prioritized during recovery, run a command of the following form:
788 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
790 To mark specific PG(s) as prioritized during backfill, run a command of the
795 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
797 These commands instruct Ceph to perform recovery or backfill on the specified
798 PGs before processing the other PGs. Prioritization does not interrupt current
799 backfills or recovery, but places the specified PGs at the top of the queue so
800 that they will be acted upon next. If you change your mind or realize that you
801 have prioritized the wrong PGs, run one or both of the following commands:
805 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
806 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
808 These commands remove the ``force`` flag from the specified PGs, so that the
809 PGs will be processed in their usual order. As in the case of adding the
810 ``force`` flag, this affects only those PGs that are still queued but does not
811 affect PGs currently undergoing recovery.
813 The ``force`` flag is cleared automatically after recovery or backfill of the
816 Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
817 is, to perform recovery or backfill on those PGs first), run one or both of the
822 ceph osd pool force-recovery {pool-name}
823 ceph osd pool force-backfill {pool-name}
825 These commands can also be cancelled. To revert to the default order, run one
826 or both of the following commands:
830 ceph osd pool cancel-force-recovery {pool-name}
831 ceph osd pool cancel-force-backfill {pool-name}
833 .. warning:: These commands can break the order of Ceph's internal priority
834 computations, so use them with caution! If you have multiple pools that are
835 currently sharing the same underlying OSDs, and if the data held by certain
836 pools is more important than the data held by other pools, then we recommend
837 that you run a command of the following form to arrange a custom
838 recovery/backfill priority for all pools:
842 ceph osd pool set {pool-name} recovery_priority {value}
844 For example, if you have twenty pools, you could make the most important pool
845 priority ``20``, and the next most important pool priority ``19``, and so on.
847 Another option is to set the recovery/backfill priority for only a proper
848 subset of pools. In such a scenario, three important pools might (all) be
849 assigned priority ``1`` and all other pools would be left without an assigned
850 recovery/backfill priority. Another possibility is to select three important
851 pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
854 .. important:: Numbers of greater value have higher priority than numbers of
855 lesser value when using ``ceph osd pool set {pool-name} recovery_priority
856 {value}`` to set their recovery/backfill priority. For example, a pool with
857 the recovery/backfill priority ``30`` has a higher priority than a pool with
858 the recovery/backfill priority ``15``.
860 Reverting Lost RADOS Objects
861 ============================
863 If the cluster has lost one or more RADOS objects and you have decided to
864 abandon the search for the lost data, you must mark the unfound objects
867 If every possible location has been queried and all OSDs are ``up`` and ``in``,
868 but certain RADOS objects are still lost, you might have to give up on those
869 objects. This situation can arise when rare and unusual combinations of
870 failures allow the cluster to learn about writes that were performed before the
871 writes themselves were recovered.
873 The command to mark a RADOS object ``lost`` has only one supported option:
874 ``revert``. The ``revert`` option will either roll back to a previous version
875 of the RADOS object (if it is old enough to have a previous version) or forget
876 about it entirely (if it is too new to have a previous version). To mark the
877 "unfound" objects ``lost``, run a command of the following form:
882 ceph pg {pg-id} mark_unfound_lost revert|delete
884 .. important:: Use this feature with caution. It might confuse applications
885 that expect the object(s) to exist.
895 .. _Create a Pool: ../pools#createpool
896 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
897 .. _pgcalc: https://old.ceph.com/pgcalc/