]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
check in ceph 17.2.3 sources
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
20effc67
TL
1.. _placement groups:
2
7c673cae
FG
3==================
4 Placement Groups
5==================
6
11fdf7f2
TL
7.. _pg-autoscaler:
8
9Autoscaling placement groups
10============================
11
12Placement groups (PGs) are an internal implementation detail of how
20effc67
TL
13Ceph distributes data. You may enable *pg-autoscaling* to allow the cluster to
14make recommendations or automatically adjust the numbers of PGs (``pgp_num``)
15for each pool based on expected cluster and pool utilization.
11fdf7f2 16
20effc67 17Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
11fdf7f2 18
20effc67 19* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate ``pgp_num`` for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
11fdf7f2
TL
20* ``on``: Enable automated adjustments of the PG count for the given pool.
21* ``warn``: Raise health alerts when the PG count should be adjusted
22
20effc67 23To set the autoscaling mode for an existing pool::
11fdf7f2
TL
24
25 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
26
20effc67 27For example to enable autoscaling on pool ``foo``::
11fdf7f2
TL
28
29 ceph osd pool set foo pg_autoscale_mode on
30
31You can also configure the default ``pg_autoscale_mode`` that is
20effc67 32set on any pools that are subsequently created::
11fdf7f2 33
9f95a23c 34 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
11fdf7f2 35
20effc67
TL
36You can disable or enable the autoscaler for all pools with
37the ``noautoscale`` flag. By default this flag is set to be ``off``,
38but you can turn it ``on`` by using the command::
39
40 ceph osd pool set noautoscale
41
42You can turn it ``off`` using the command::
43
44 ceph osd pool unset noautoscale
45
46To ``get`` the value of the flag use the command::
47
48 ceph osd pool get noautoscale
49
11fdf7f2
TL
50Viewing PG scaling recommendations
51----------------------------------
52
53You can view each pool, its relative utilization, and any suggested changes to
54the PG count with this command::
55
56 ceph osd pool autoscale-status
57
58Output will be something like::
59
20effc67
TL
60 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
61 a 12900M 3.0 82431M 0.4695 8 128 warn True
62 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
63 b 0 953.6M 3.0 82431M 0.0347 8 warn False
11fdf7f2
TL
64
65**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
66present, is the amount of data the administrator has specified that
67they expect to eventually be stored in this pool. The system uses
68the larger of the two values for its calculation.
69
70**RATE** is the multiplier for the pool that determines how much raw
71storage capacity is consumed. For example, a 3 replica pool will
72have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
73ratio of 1.5.
74
75**RAW CAPACITY** is the total amount of raw storage capacity on the
76OSDs that are responsible for storing this pool's (and perhaps other
77pools') data. **RATIO** is the ratio of that total capacity that
78this pool is consuming (i.e., ratio = size * rate / raw capacity).
79
80**TARGET RATIO**, if present, is the ratio of storage that the
9f95a23c
TL
81administrator has specified that they expect this pool to consume
82relative to other pools with target ratios set.
83If both target size bytes and ratio are specified, the
11fdf7f2
TL
84ratio takes precedence.
85
9f95a23c
TL
86**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
87
20effc67
TL
881. Subtracting any capacity expected to be used by pools with target size set
892. Normalizing the target ratios among pools with target ratio set so
9f95a23c
TL
90 they collectively target the rest of the space. For example, 4
91 pools with target_ratio 1.0 would have an effective ratio of 0.25.
92
93The system uses the larger of the actual ratio and the effective ratio
94for its calculation.
95
522d829b
TL
96**BIAS** is used as a multiplier to manually adjust a pool's PG based
97on prior information about how much PGs a specific pool is expected
98to have.
99
11fdf7f2
TL
100**PG_NUM** is the current number of PGs for the pool (or the current
101number of PGs that the pool is working towards, if a ``pg_num``
102change is in progress). **NEW PG_NUM**, if present, is what the
103system believes the pool's ``pg_num`` should be changed to. It is
104always a power of 2, and will only be present if the "ideal" value
20effc67
TL
105varies from the current value by more than a factor of 3 by default.
106This factor can be be adjusted with::
107
108 ceph osd pool set threshold 2.0
11fdf7f2 109
522d829b 110**AUTOSCALE**, is the pool ``pg_autoscale_mode``
11fdf7f2
TL
111and will be either ``on``, ``off``, or ``warn``.
112
20effc67
TL
113The final column, **BULK** determines if the pool is ``bulk``
114and will be either ``True`` or ``False``. A ``bulk`` pool
115means that the pool is expected to be large and should start out
116with large amount of PGs for performance purposes. On the other hand,
117pools without the ``bulk`` flag are expected to be smaller e.g.,
118.mgr or meta pools.
522d829b 119
11fdf7f2
TL
120
121Automated scaling
122-----------------
123
20effc67 124Allowing the cluster to automatically scale ``pgp_num`` based on usage is the
11fdf7f2
TL
125simplest approach. Ceph will look at the total available storage and
126target number of PGs for the whole system, look at how much data is
20effc67 127stored in each pool, and try to apportion PGs accordingly. The
11fdf7f2
TL
128system is relatively conservative with its approach, only making
129changes to a pool when the current number of PGs (``pg_num``) is more
20effc67 130than a factor of 3 off from what it thinks it should be.
11fdf7f2
TL
131
132The target number of PGs per OSD is based on the
133``mon_target_pg_per_osd`` configurable (default: 100), which can be
134adjusted with::
135
136 ceph config set global mon_target_pg_per_osd 100
137
138The autoscaler analyzes pools and adjusts on a per-subtree basis.
139Because each pool may map to a different CRUSH rule, and each rule may
140distribute data across different devices, Ceph will consider
141utilization of each subtree of the hierarchy independently. For
142example, a pool that maps to OSDs of class `ssd` and a pool that maps
143to OSDs of class `hdd` will each have optimal PG counts that depend on
144the number of those respective device types.
145
20effc67 146The autoscaler uses the `bulk` flag to determine which pool
1d09f67e
TL
147should start out with a full complement of PGs and only
148scales down when the usage ratio across the pool is not even.
20effc67
TL
149However, if the pool doesn't have the `bulk` flag, the pool will
150start out with minimal PGs and only when there is more usage in the pool.
522d829b 151
1d09f67e 152The autoscaler identifies any overlapping roots and prevents the pools
20effc67 153with such roots from scaling because overlapping roots can cause problems
522d829b
TL
154with the scaling process.
155
20effc67
TL
156To create pool with `bulk` flag::
157
1d09f67e 158 ceph osd pool create <pool-name> --bulk
522d829b 159
20effc67 160To set/unset `bulk` flag of existing pool::
522d829b 161
1d09f67e 162 ceph osd pool set <pool-name> bulk <true/false/1/0>
522d829b 163
20effc67 164To get `bulk` flag of existing pool::
522d829b 165
20effc67 166 ceph osd pool get <pool-name> bulk
11fdf7f2
TL
167
168.. _specifying_pool_target_size:
169
170Specifying expected pool size
171-----------------------------
172
173When a cluster or pool is first created, it will consume a small
174fraction of the total cluster capacity and will appear to the system
175as if it should only need a small number of placement groups.
176However, in most cases cluster administrators have a good idea which
177pools are expected to consume most of the system capacity over time.
178By providing this information to Ceph, a more appropriate number of
179PGs can be used from the beginning, preventing subsequent changes in
180``pg_num`` and the overhead associated with moving data around when
181those adjustments are made.
182
9f95a23c
TL
183The *target size* of a pool can be specified in two ways: either in
184terms of the absolute size of the pool (i.e., bytes), or as a weight
185relative to other pools with a ``target_size_ratio`` set.
11fdf7f2 186
20effc67 187For example::
11fdf7f2
TL
188
189 ceph osd pool set mypool target_size_bytes 100T
190
191will tell the system that `mypool` is expected to consume 100 TiB of
20effc67 192space. Alternatively::
11fdf7f2 193
9f95a23c 194 ceph osd pool set mypool target_size_ratio 1.0
11fdf7f2 195
9f95a23c
TL
196will tell the system that `mypool` is expected to consume 1.0 relative
197to the other pools with ``target_size_ratio`` set. If `mypool` is the
198only pool in the cluster, this means an expected use of 100% of the
199total capacity. If there is a second pool with ``target_size_ratio``
2001.0, both pools would expect to use 50% of the cluster capacity.
11fdf7f2
TL
201
202You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
203
204Note that if impossible target size values are specified (for example,
9f95a23c
TL
205a capacity larger than the total cluster) then a health warning
206(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
207
208If both ``target_size_ratio`` and ``target_size_bytes`` are specified
209for a pool, only the ratio will be considered, and a health warning
210(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
11fdf7f2
TL
211
212Specifying bounds on a pool's PGs
213---------------------------------
214
215It is also possible to specify a minimum number of PGs for a pool.
216This is useful for establishing a lower bound on the amount of
217parallelism client will see when doing IO, even when a pool is mostly
218empty. Setting the lower bound prevents Ceph from reducing (or
219recommending you reduce) the PG number below the configured number.
220
20effc67 221You can set the minimum or maximum number of PGs for a pool with::
11fdf7f2
TL
222
223 ceph osd pool set <pool-name> pg_num_min <num>
20effc67 224 ceph osd pool set <pool-name> pg_num_max <num>
11fdf7f2 225
20effc67
TL
226You can also specify the minimum or maximum PG count at pool creation
227time with the optional ``--pg-num-min <num>`` or ``--pg-num-max
228<num>`` arguments to the ``ceph osd pool create`` command.
11fdf7f2 229
7c673cae
FG
230.. _preselection:
231
232A preselection of pg_num
233========================
234
235When creating a new pool with::
236
9f95a23c 237 ceph osd pool create {pool-name} [pg_num]
7c673cae 238
9f95a23c
TL
239it is optional to choose the value of ``pg_num``. If you do not
240specify ``pg_num``, the cluster can (by default) automatically tune it
241for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
7c673cae 242
9f95a23c
TL
243Alternatively, ``pg_num`` can be explicitly provided. However,
244whether you specify a ``pg_num`` value or not does not affect whether
245the value is automatically tuned by the cluster after the fact. To
20effc67 246enable or disable auto-tuning::
7c673cae 247
9f95a23c 248 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
7c673cae 249
9f95a23c
TL
250The "rule of thumb" for PGs per OSD has traditionally be 100. With
251the additional of the balancer (which is also enabled by default), a
252value of more like 50 PGs per OSD is probably reasonable. The
253challenge (which the autoscaler normally does for you), is to:
7c673cae 254
9f95a23c
TL
255- have the PGs per pool proportional to the data in the pool, and
256- end up with 50-100 PGs per OSDs, after the replication or
257 erasuring-coding fan-out of each PG across OSDs is taken into
258 consideration
7c673cae
FG
259
260How are Placement Groups used ?
261===============================
262
263A placement group (PG) aggregates objects within a pool because
264tracking object placement and object metadata on a per-object basis is
265computationally expensive--i.e., a system with millions of objects
266cannot realistically track placement on a per-object basis.
267
268.. ditaa::
269 /-----\ /-----\ /-----\ /-----\ /-----\
270 | obj | | obj | | obj | | obj | | obj |
271 \-----/ \-----/ \-----/ \-----/ \-----/
272 | | | | |
273 +--------+--------+ +---+----+
274 | |
275 v v
276 +-----------------------+ +-----------------------+
277 | Placement Group #1 | | Placement Group #2 |
278 | | | |
279 +-----------------------+ +-----------------------+
280 | |
281 +------------------------------+
282 |
283 v
284 +-----------------------+
285 | Pool |
286 | |
287 +-----------------------+
288
289The Ceph client will calculate which placement group an object should
290be in. It does this by hashing the object ID and applying an operation
291based on the number of PGs in the defined pool and the ID of the pool.
292See `Mapping PGs to OSDs`_ for details.
293
294The object's contents within a placement group are stored in a set of
295OSDs. For instance, in a replicated pool of size two, each placement
296group will store objects on two OSDs, as shown below.
297
298.. ditaa::
7c673cae
FG
299 +-----------------------+ +-----------------------+
300 | Placement Group #1 | | Placement Group #2 |
301 | | | |
302 +-----------------------+ +-----------------------+
303 | | | |
304 v v v v
305 /----------\ /----------\ /----------\ /----------\
306 | | | | | | | |
307 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
308 | | | | | | | |
309 \----------/ \----------/ \----------/ \----------/
310
311
312Should OSD #2 fail, another will be assigned to Placement Group #1 and
313will be filled with copies of all objects in OSD #1. If the pool size
314is changed from two to three, an additional OSD will be assigned to
315the placement group and will receive copies of all objects in the
316placement group.
317
11fdf7f2 318Placement groups do not own the OSD; they share it with other
7c673cae
FG
319placement groups from the same pool or even other pools. If OSD #2
320fails, the Placement Group #2 will also have to restore copies of
321objects, using OSD #3.
322
323When the number of placement groups increases, the new placement
324groups will be assigned OSDs. The result of the CRUSH function will
325also change and some objects from the former placement groups will be
326copied over to the new Placement Groups and removed from the old ones.
327
328Placement Groups Tradeoffs
329==========================
330
331Data durability and even distribution among all OSDs call for more
332placement groups but their number should be reduced to the minimum to
333save CPU and memory.
334
335.. _data durability:
336
337Data durability
338---------------
339
340After an OSD fails, the risk of data loss increases until the data it
341contained is fully recovered. Let's imagine a scenario that causes
342permanent data loss in a single placement group:
343
344- The OSD fails and all copies of the object it contains are lost.
345 For all objects within the placement group the number of replica
11fdf7f2 346 suddenly drops from three to two.
7c673cae 347
11fdf7f2 348- Ceph starts recovery for this placement group by choosing a new OSD
7c673cae
FG
349 to re-create the third copy of all objects.
350
351- Another OSD, within the same placement group, fails before the new
352 OSD is fully populated with the third copy. Some objects will then
353 only have one surviving copies.
354
355- Ceph picks yet another OSD and keeps copying objects to restore the
356 desired number of copies.
357
358- A third OSD, within the same placement group, fails before recovery
359 is complete. If this OSD contained the only remaining copy of an
360 object, it is permanently lost.
361
362In a cluster containing 10 OSDs with 512 placement groups in a three
363replica pool, CRUSH will give each placement groups three OSDs. In the
364end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
365Groups. When the first OSD fails, the above scenario will therefore
366start recovery for all 150 placement groups at the same time.
367
368The 150 placement groups being recovered are likely to be
369homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
370therefore likely to send copies of objects to all others and also
371receive some new objects to be stored because they became part of a
372new placement group.
373
374The amount of time it takes for this recovery to complete entirely
375depends on the architecture of the Ceph cluster. Let say each OSD is
376hosted by a 1TB SSD on a single machine and all of them are connected
377to a 10Gb/s switch and the recovery for a single OSD completes within
378M minutes. If there are two OSDs per machine using spinners with no
379SSD journal and a 1Gb/s switch, it will at least be an order of
380magnitude slower.
381
382In a cluster of this size, the number of placement groups has almost
383no influence on data durability. It could be 128 or 8192 and the
384recovery would not be slower or faster.
385
386However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
387is likely to speed up recovery and therefore improve data durability
388significantly. Each OSD now participates in only ~75 placement groups
389instead of ~150 when there were only 10 OSDs and it will still require
390all 19 remaining OSDs to perform the same amount of object copies in
391order to recover. But where 10 OSDs had to copy approximately 100GB
392each, they now have to copy 50GB each instead. If the network was the
393bottleneck, recovery will happen twice as fast. In other words,
394recovery goes faster when the number of OSDs increases.
395
396If this cluster grows to 40 OSDs, each of them will only host ~35
397placement groups. If an OSD dies, recovery will keep going faster
398unless it is blocked by another bottleneck. However, if this cluster
399grows to 200 OSDs, each of them will only host ~7 placement groups. If
400an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
401in these placement groups: recovery will take longer than when there
402were 40 OSDs, meaning the number of placement groups should be
403increased.
404
405No matter how short the recovery time is, there is a chance for a
406second OSD to fail while it is in progress. In the 10 OSDs cluster
407described above, if any of them fail, then ~17 placement groups
408(i.e. ~150 / 9 placement groups being recovered) will only have one
409surviving copy. And if any of the 8 remaining OSD fail, the last
410objects of two placement groups are likely to be lost (i.e. ~17 / 8
411placement groups with only one remaining copy being recovered).
412
413When the size of the cluster grows to 20 OSDs, the number of Placement
414Groups damaged by the loss of three OSDs drops. The second OSD lost
415will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
416instead of ~17 and the third OSD lost will only lose data if it is one
417of the four OSDs containing the surviving copy. In other words, if the
418probability of losing one OSD is 0.0001% during the recovery time
11fdf7f2 419frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
7c673cae
FG
4200.0001% in the cluster with 20 OSDs.
421
422In a nutshell, more OSDs mean faster recovery and a lower risk of
423cascading failures leading to the permanent loss of a Placement
424Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
425cluster with less than 50 OSDs as far as data durability is concerned.
426
427Note: It may take a long time for a new OSD added to the cluster to be
428populated with placement groups that were assigned to it. However
429there is no degradation of any object and it has no impact on the
430durability of the data contained in the Cluster.
431
432.. _object distribution:
433
434Object distribution within a pool
435---------------------------------
436
437Ideally objects are evenly distributed in each placement group. Since
438CRUSH computes the placement group for each object, but does not
439actually know how much data is stored in each OSD within this
440placement group, the ratio between the number of placement groups and
441the number of OSDs may influence the distribution of the data
442significantly.
443
11fdf7f2 444For instance, if there was a single placement group for ten OSDs in a
7c673cae
FG
445three replica pool, only three OSD would be used because CRUSH would
446have no other choice. When more placement groups are available,
447objects are more likely to be evenly spread among them. CRUSH also
448makes every effort to evenly spread OSDs among all existing Placement
449Groups.
450
451As long as there are one or two orders of magnitude more Placement
eafe8130
TL
452Groups than OSDs, the distribution should be even. For instance, 256
453placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
454etc.
7c673cae
FG
455
456Uneven data distribution can be caused by factors other than the ratio
457between OSDs and placement groups. Since CRUSH does not take into
458account the size of the objects, a few very large objects may create
459an imbalance. Let say one million 4K objects totaling 4GB are evenly
eafe8130 460spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
7c673cae
FG
461= 400MB on each OSD. If one 400MB object is added to the pool, the
462three OSDs supporting the placement group in which the object has been
463placed will be filled with 400MB + 400MB = 800MB while the seven
464others will remain occupied with only 400MB.
465
466.. _resource usage:
467
468Memory, CPU and network usage
469-----------------------------
470
471For each placement group, OSDs and MONs need memory, network and CPU
472at all times and even more during recovery. Sharing this overhead by
473clustering objects within a placement group is one of the main reasons
474they exist.
475
476Minimizing the number of placement groups saves significant amounts of
477resources.
478
11fdf7f2
TL
479.. _choosing-number-of-placement-groups:
480
7c673cae
FG
481Choosing the number of Placement Groups
482=======================================
483
11fdf7f2
TL
484.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
485
7c673cae
FG
486If you have more than 50 OSDs, we recommend approximately 50-100
487placement groups per OSD to balance out resource usage, data
11fdf7f2 488durability and distribution. If you have less than 50 OSDs, choosing
7c673cae 489among the `preselection`_ above is best. For a single pool of objects,
f67539c2 490you can use the following formula to get a baseline
7c673cae 491
f67539c2 492 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
7c673cae
FG
493
494Where **pool size** is either the number of replicas for replicated
495pools or the K+M sum for erasure coded pools (as returned by **ceph
496osd erasure-code-profile get**).
497
498You should then check if the result makes sense with the way you
499designed your Ceph cluster to maximize `data durability`_,
500`object distribution`_ and minimize `resource usage`_.
501
eafe8130
TL
502The result should always be **rounded up to the nearest power of two**.
503
504Only a power of two will evenly balance the number of objects among
505placement groups. Other values will result in an uneven distribution of
506data across your OSDs. Their use should be limited to incrementally
507stepping from one power of two to another.
7c673cae
FG
508
509As an example, for a cluster with 200 OSDs and a pool size of 3
f67539c2 510replicas, you would estimate your number of PGs as follows
7c673cae 511
f67539c2 512 :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
7c673cae
FG
513
514When using multiple data pools for storing objects, you need to ensure
515that you balance the number of placement groups per pool with the
516number of placement groups per OSD so that you arrive at a reasonable
517total number of placement groups that provides reasonably low variance
518per OSD without taxing system resources or making the peering process
519too slow.
520
521For instance a cluster of 10 pools each with 512 placement groups on
522ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
523that is 512 placement groups per OSD. That does not use too many
524resources. However, if 1,000 pools were created with 512 placement
525groups each, the OSDs will handle ~50,000 placement groups each and it
526would require significantly more resources and time for peering.
527
224ce89b
WB
528You may find the `PGCalc`_ tool helpful.
529
530
7c673cae
FG
531.. _setting the number of placement groups:
532
533Set the Number of Placement Groups
534==================================
535
536To set the number of placement groups in a pool, you must specify the
537number of placement groups at the time you create the pool.
11fdf7f2 538See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
7c673cae
FG
539
540 ceph osd pool set {pool-name} pg_num {pg_num}
541
11fdf7f2 542After you increase the number of placement groups, you must also
7c673cae
FG
543increase the number of placement groups for placement (``pgp_num``)
544before your cluster will rebalance. The ``pgp_num`` will be the number of
545placement groups that will be considered for placement by the CRUSH
546algorithm. Increasing ``pg_num`` splits the placement groups but data
547will not be migrated to the newer placement groups until placement
548groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
549should be equal to the ``pg_num``. To increase the number of
550placement groups for placement, execute the following::
551
552 ceph osd pool set {pool-name} pgp_num {pgp_num}
553
11fdf7f2
TL
554When decreasing the number of PGs, ``pgp_num`` is adjusted
555automatically for you.
7c673cae
FG
556
557Get the Number of Placement Groups
558==================================
559
560To get the number of placement groups in a pool, execute the following::
561
562 ceph osd pool get {pool-name} pg_num
563
564
565Get a Cluster's PG Statistics
566=============================
567
568To get the statistics for the placement groups in your cluster, execute the following::
569
570 ceph pg dump [--format {format}]
571
572Valid formats are ``plain`` (default) and ``json``.
573
574
575Get Statistics for Stuck PGs
576============================
577
578To get the statistics for all placement groups stuck in a specified state,
579execute the following::
580
581 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
582
583**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
584with the most up-to-date data to come up and in.
585
586**Unclean** Placement groups contain objects that are not replicated the desired number
587of times. They should be recovering.
588
589**Stale** Placement groups are in an unknown state - the OSDs that host them have not
590reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
591
592Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
593of seconds the placement group is stuck before including it in the returned statistics
594(default 300 seconds).
595
596
597Get a PG Map
598============
599
600To get the placement group map for a particular placement group, execute the following::
601
602 ceph pg map {pg-id}
603
604For example::
605
606 ceph pg map 1.6c
607
608Ceph will return the placement group map, the placement group, and the OSD status::
609
610 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
611
612
613Get a PGs Statistics
614====================
615
616To retrieve statistics for a particular placement group, execute the following::
617
618 ceph pg {pg-id} query
619
620
621Scrub a Placement Group
622=======================
623
624To scrub a placement group, execute the following::
625
626 ceph pg scrub {pg-id}
627
628Ceph checks the primary and any replica nodes, generates a catalog of all objects
629in the placement group and compares them to ensure that no objects are missing
630or mismatched, and their contents are consistent. Assuming the replicas all
631match, a final semantic sweep ensures that all of the snapshot-related object
632metadata is consistent. Errors are reported via logs.
633
11fdf7f2
TL
634To scrub all placement groups from a specific pool, execute the following::
635
636 ceph osd pool scrub {pool-name}
637
c07f9fc5
FG
638Prioritize backfill/recovery of a Placement Group(s)
639====================================================
640
641You may run into a situation where a bunch of placement groups will require
642recovery and/or backfill, and some particular groups hold data more important
643than others (for example, those PGs may hold data for images used by running
644machines and other PGs may be used by inactive machines/less relevant data).
645In that case, you may want to prioritize recovery of those groups so
646performance and/or availability of data stored on those groups is restored
11fdf7f2 647earlier. To do this (mark particular placement group(s) as prioritized during
c07f9fc5
FG
648backfill or recovery), execute the following::
649
650 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
651 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
652
653This will cause Ceph to perform recovery or backfill on specified placement
654groups first, before other placement groups. This does not interrupt currently
655ongoing backfills or recovery, but causes specified PGs to be processed
656as soon as possible. If you change your mind or prioritize wrong groups,
657use::
658
659 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
660 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
661
662This will remove "force" flag from those PGs and they will be processed
663in default order. Again, this doesn't affect currently processed placement
664group, only those that are still queued.
665
666The "force" flag is cleared automatically after recovery or backfill of group
667is done.
7c673cae 668
11fdf7f2
TL
669Similarly, you may use the following commands to force Ceph to perform recovery
670or backfill on all placement groups from a specified pool first::
671
672 ceph osd pool force-recovery {pool-name}
673 ceph osd pool force-backfill {pool-name}
674
675or::
676
677 ceph osd pool cancel-force-recovery {pool-name}
678 ceph osd pool cancel-force-backfill {pool-name}
679
680to restore to the default recovery or backfill priority if you change your mind.
681
682Note that these commands could possibly break the ordering of Ceph's internal
683priority computations, so use them with caution!
684Especially, if you have multiple pools that are currently sharing the same
685underlying OSDs, and some particular pools hold data more important than others,
686we recommend you use the following command to re-arrange all pools's
687recovery/backfill priority in a better order::
688
689 ceph osd pool set {pool-name} recovery_priority {value}
690
691For example, if you have 10 pools you could make the most important one priority 10,
692next 9, etc. Or you could leave most pools alone and have say 3 important pools
693all priority 1 or priorities 3, 2, 1 respectively.
694
7c673cae
FG
695Revert Lost
696===========
697
698If the cluster has lost one or more objects, and you have decided to
699abandon the search for the lost data, you must mark the unfound objects
700as ``lost``.
701
702If all possible locations have been queried and objects are still
703lost, you may have to give up on the lost objects. This is
704possible given unusual combinations of failures that allow the cluster
705to learn about writes that were performed before the writes themselves
706are recovered.
707
708Currently the only supported option is "revert", which will either roll back to
709a previous version of the object or (if it was a new object) forget about it
710entirely. To mark the "unfound" objects as "lost", execute the following::
711
712 ceph pg {pg-id} mark_unfound_lost revert|delete
713
714.. important:: Use this feature with caution, because it may confuse
715 applications that expect the object(s) to exist.
716
717
718.. toctree::
719 :hidden:
720
721 pg-states
722 pg-concepts
723
724
725.. _Create a Pool: ../pools#createpool
726.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
33c7a0ef 727.. _pgcalc: https://old.ceph.com/pgcalc/