]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
import ceph quincy 17.2.4
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
20effc67
TL
1.. _placement groups:
2
7c673cae
FG
3==================
4 Placement Groups
5==================
6
11fdf7f2
TL
7.. _pg-autoscaler:
8
9Autoscaling placement groups
10============================
11
12Placement groups (PGs) are an internal implementation detail of how
20effc67
TL
13Ceph distributes data. You may enable *pg-autoscaling* to allow the cluster to
14make recommendations or automatically adjust the numbers of PGs (``pgp_num``)
15for each pool based on expected cluster and pool utilization.
11fdf7f2 16
20effc67 17Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
11fdf7f2 18
20effc67 19* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate ``pgp_num`` for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
11fdf7f2
TL
20* ``on``: Enable automated adjustments of the PG count for the given pool.
21* ``warn``: Raise health alerts when the PG count should be adjusted
22
20effc67 23To set the autoscaling mode for an existing pool::
11fdf7f2
TL
24
25 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
26
20effc67 27For example to enable autoscaling on pool ``foo``::
11fdf7f2
TL
28
29 ceph osd pool set foo pg_autoscale_mode on
30
31You can also configure the default ``pg_autoscale_mode`` that is
20effc67 32set on any pools that are subsequently created::
11fdf7f2 33
9f95a23c 34 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
11fdf7f2 35
20effc67
TL
36You can disable or enable the autoscaler for all pools with
37the ``noautoscale`` flag. By default this flag is set to be ``off``,
38but you can turn it ``on`` by using the command::
39
40 ceph osd pool set noautoscale
41
42You can turn it ``off`` using the command::
43
44 ceph osd pool unset noautoscale
45
46To ``get`` the value of the flag use the command::
47
48 ceph osd pool get noautoscale
49
11fdf7f2
TL
50Viewing PG scaling recommendations
51----------------------------------
52
53You can view each pool, its relative utilization, and any suggested changes to
54the PG count with this command::
55
56 ceph osd pool autoscale-status
57
58Output will be something like::
59
20effc67
TL
60 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
61 a 12900M 3.0 82431M 0.4695 8 128 warn True
62 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
63 b 0 953.6M 3.0 82431M 0.0347 8 warn False
11fdf7f2
TL
64
65**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
66present, is the amount of data the administrator has specified that
67they expect to eventually be stored in this pool. The system uses
68the larger of the two values for its calculation.
69
70**RATE** is the multiplier for the pool that determines how much raw
71storage capacity is consumed. For example, a 3 replica pool will
72have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
73ratio of 1.5.
74
75**RAW CAPACITY** is the total amount of raw storage capacity on the
76OSDs that are responsible for storing this pool's (and perhaps other
77pools') data. **RATIO** is the ratio of that total capacity that
78this pool is consuming (i.e., ratio = size * rate / raw capacity).
79
80**TARGET RATIO**, if present, is the ratio of storage that the
9f95a23c
TL
81administrator has specified that they expect this pool to consume
82relative to other pools with target ratios set.
83If both target size bytes and ratio are specified, the
11fdf7f2
TL
84ratio takes precedence.
85
9f95a23c
TL
86**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
87
20effc67
TL
881. Subtracting any capacity expected to be used by pools with target size set
892. Normalizing the target ratios among pools with target ratio set so
9f95a23c
TL
90 they collectively target the rest of the space. For example, 4
91 pools with target_ratio 1.0 would have an effective ratio of 0.25.
92
93The system uses the larger of the actual ratio and the effective ratio
94for its calculation.
95
522d829b
TL
96**BIAS** is used as a multiplier to manually adjust a pool's PG based
97on prior information about how much PGs a specific pool is expected
98to have.
99
11fdf7f2
TL
100**PG_NUM** is the current number of PGs for the pool (or the current
101number of PGs that the pool is working towards, if a ``pg_num``
102change is in progress). **NEW PG_NUM**, if present, is what the
103system believes the pool's ``pg_num`` should be changed to. It is
104always a power of 2, and will only be present if the "ideal" value
20effc67
TL
105varies from the current value by more than a factor of 3 by default.
106This factor can be be adjusted with::
107
108 ceph osd pool set threshold 2.0
11fdf7f2 109
522d829b 110**AUTOSCALE**, is the pool ``pg_autoscale_mode``
11fdf7f2
TL
111and will be either ``on``, ``off``, or ``warn``.
112
20effc67
TL
113The final column, **BULK** determines if the pool is ``bulk``
114and will be either ``True`` or ``False``. A ``bulk`` pool
115means that the pool is expected to be large and should start out
116with large amount of PGs for performance purposes. On the other hand,
117pools without the ``bulk`` flag are expected to be smaller e.g.,
118.mgr or meta pools.
522d829b 119
11fdf7f2
TL
120
121Automated scaling
122-----------------
123
20effc67 124Allowing the cluster to automatically scale ``pgp_num`` based on usage is the
11fdf7f2
TL
125simplest approach. Ceph will look at the total available storage and
126target number of PGs for the whole system, look at how much data is
20effc67 127stored in each pool, and try to apportion PGs accordingly. The
11fdf7f2
TL
128system is relatively conservative with its approach, only making
129changes to a pool when the current number of PGs (``pg_num``) is more
20effc67 130than a factor of 3 off from what it thinks it should be.
11fdf7f2
TL
131
132The target number of PGs per OSD is based on the
133``mon_target_pg_per_osd`` configurable (default: 100), which can be
134adjusted with::
135
136 ceph config set global mon_target_pg_per_osd 100
137
138The autoscaler analyzes pools and adjusts on a per-subtree basis.
139Because each pool may map to a different CRUSH rule, and each rule may
140distribute data across different devices, Ceph will consider
141utilization of each subtree of the hierarchy independently. For
142example, a pool that maps to OSDs of class `ssd` and a pool that maps
143to OSDs of class `hdd` will each have optimal PG counts that depend on
144the number of those respective device types.
145
2a845540
TL
146In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow
147trees with both `ssd` and `hdd` devices), the autoscaler will
148issue a warning to the user in the manager log stating the name of the pool
149and the set of roots that overlap each other. The autoscaler will not
150scale any pools with overlapping roots because this can cause problems
151with the scaling process. We recommend making each pool belong to only
152one root (one OSD class) to get rid of the warning and ensure a successful
153scaling process.
154
20effc67 155The autoscaler uses the `bulk` flag to determine which pool
1d09f67e
TL
156should start out with a full complement of PGs and only
157scales down when the usage ratio across the pool is not even.
20effc67
TL
158However, if the pool doesn't have the `bulk` flag, the pool will
159start out with minimal PGs and only when there is more usage in the pool.
522d829b 160
20effc67
TL
161To create pool with `bulk` flag::
162
1d09f67e 163 ceph osd pool create <pool-name> --bulk
522d829b 164
20effc67 165To set/unset `bulk` flag of existing pool::
522d829b 166
1d09f67e 167 ceph osd pool set <pool-name> bulk <true/false/1/0>
522d829b 168
20effc67 169To get `bulk` flag of existing pool::
522d829b 170
20effc67 171 ceph osd pool get <pool-name> bulk
11fdf7f2
TL
172
173.. _specifying_pool_target_size:
174
175Specifying expected pool size
176-----------------------------
177
178When a cluster or pool is first created, it will consume a small
179fraction of the total cluster capacity and will appear to the system
180as if it should only need a small number of placement groups.
181However, in most cases cluster administrators have a good idea which
182pools are expected to consume most of the system capacity over time.
183By providing this information to Ceph, a more appropriate number of
184PGs can be used from the beginning, preventing subsequent changes in
185``pg_num`` and the overhead associated with moving data around when
186those adjustments are made.
187
9f95a23c
TL
188The *target size* of a pool can be specified in two ways: either in
189terms of the absolute size of the pool (i.e., bytes), or as a weight
190relative to other pools with a ``target_size_ratio`` set.
11fdf7f2 191
20effc67 192For example::
11fdf7f2
TL
193
194 ceph osd pool set mypool target_size_bytes 100T
195
196will tell the system that `mypool` is expected to consume 100 TiB of
20effc67 197space. Alternatively::
11fdf7f2 198
9f95a23c 199 ceph osd pool set mypool target_size_ratio 1.0
11fdf7f2 200
9f95a23c
TL
201will tell the system that `mypool` is expected to consume 1.0 relative
202to the other pools with ``target_size_ratio`` set. If `mypool` is the
203only pool in the cluster, this means an expected use of 100% of the
204total capacity. If there is a second pool with ``target_size_ratio``
2051.0, both pools would expect to use 50% of the cluster capacity.
11fdf7f2
TL
206
207You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
208
209Note that if impossible target size values are specified (for example,
9f95a23c
TL
210a capacity larger than the total cluster) then a health warning
211(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
212
213If both ``target_size_ratio`` and ``target_size_bytes`` are specified
214for a pool, only the ratio will be considered, and a health warning
215(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
11fdf7f2
TL
216
217Specifying bounds on a pool's PGs
218---------------------------------
219
220It is also possible to specify a minimum number of PGs for a pool.
221This is useful for establishing a lower bound on the amount of
222parallelism client will see when doing IO, even when a pool is mostly
223empty. Setting the lower bound prevents Ceph from reducing (or
224recommending you reduce) the PG number below the configured number.
225
20effc67 226You can set the minimum or maximum number of PGs for a pool with::
11fdf7f2
TL
227
228 ceph osd pool set <pool-name> pg_num_min <num>
20effc67 229 ceph osd pool set <pool-name> pg_num_max <num>
11fdf7f2 230
20effc67
TL
231You can also specify the minimum or maximum PG count at pool creation
232time with the optional ``--pg-num-min <num>`` or ``--pg-num-max
233<num>`` arguments to the ``ceph osd pool create`` command.
11fdf7f2 234
7c673cae
FG
235.. _preselection:
236
237A preselection of pg_num
238========================
239
240When creating a new pool with::
241
9f95a23c 242 ceph osd pool create {pool-name} [pg_num]
7c673cae 243
9f95a23c
TL
244it is optional to choose the value of ``pg_num``. If you do not
245specify ``pg_num``, the cluster can (by default) automatically tune it
246for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
7c673cae 247
9f95a23c
TL
248Alternatively, ``pg_num`` can be explicitly provided. However,
249whether you specify a ``pg_num`` value or not does not affect whether
250the value is automatically tuned by the cluster after the fact. To
20effc67 251enable or disable auto-tuning::
7c673cae 252
9f95a23c 253 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
7c673cae 254
9f95a23c
TL
255The "rule of thumb" for PGs per OSD has traditionally be 100. With
256the additional of the balancer (which is also enabled by default), a
257value of more like 50 PGs per OSD is probably reasonable. The
258challenge (which the autoscaler normally does for you), is to:
7c673cae 259
9f95a23c
TL
260- have the PGs per pool proportional to the data in the pool, and
261- end up with 50-100 PGs per OSDs, after the replication or
262 erasuring-coding fan-out of each PG across OSDs is taken into
263 consideration
7c673cae
FG
264
265How are Placement Groups used ?
266===============================
267
268A placement group (PG) aggregates objects within a pool because
269tracking object placement and object metadata on a per-object basis is
270computationally expensive--i.e., a system with millions of objects
271cannot realistically track placement on a per-object basis.
272
273.. ditaa::
274 /-----\ /-----\ /-----\ /-----\ /-----\
275 | obj | | obj | | obj | | obj | | obj |
276 \-----/ \-----/ \-----/ \-----/ \-----/
277 | | | | |
278 +--------+--------+ +---+----+
279 | |
280 v v
281 +-----------------------+ +-----------------------+
282 | Placement Group #1 | | Placement Group #2 |
283 | | | |
284 +-----------------------+ +-----------------------+
285 | |
286 +------------------------------+
287 |
288 v
289 +-----------------------+
290 | Pool |
291 | |
292 +-----------------------+
293
294The Ceph client will calculate which placement group an object should
295be in. It does this by hashing the object ID and applying an operation
296based on the number of PGs in the defined pool and the ID of the pool.
297See `Mapping PGs to OSDs`_ for details.
298
299The object's contents within a placement group are stored in a set of
300OSDs. For instance, in a replicated pool of size two, each placement
301group will store objects on two OSDs, as shown below.
302
303.. ditaa::
7c673cae
FG
304 +-----------------------+ +-----------------------+
305 | Placement Group #1 | | Placement Group #2 |
306 | | | |
307 +-----------------------+ +-----------------------+
308 | | | |
309 v v v v
310 /----------\ /----------\ /----------\ /----------\
311 | | | | | | | |
312 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
313 | | | | | | | |
314 \----------/ \----------/ \----------/ \----------/
315
316
317Should OSD #2 fail, another will be assigned to Placement Group #1 and
318will be filled with copies of all objects in OSD #1. If the pool size
319is changed from two to three, an additional OSD will be assigned to
320the placement group and will receive copies of all objects in the
321placement group.
322
11fdf7f2 323Placement groups do not own the OSD; they share it with other
7c673cae
FG
324placement groups from the same pool or even other pools. If OSD #2
325fails, the Placement Group #2 will also have to restore copies of
326objects, using OSD #3.
327
328When the number of placement groups increases, the new placement
329groups will be assigned OSDs. The result of the CRUSH function will
330also change and some objects from the former placement groups will be
331copied over to the new Placement Groups and removed from the old ones.
332
333Placement Groups Tradeoffs
334==========================
335
336Data durability and even distribution among all OSDs call for more
337placement groups but their number should be reduced to the minimum to
338save CPU and memory.
339
340.. _data durability:
341
342Data durability
343---------------
344
345After an OSD fails, the risk of data loss increases until the data it
346contained is fully recovered. Let's imagine a scenario that causes
347permanent data loss in a single placement group:
348
349- The OSD fails and all copies of the object it contains are lost.
350 For all objects within the placement group the number of replica
11fdf7f2 351 suddenly drops from three to two.
7c673cae 352
11fdf7f2 353- Ceph starts recovery for this placement group by choosing a new OSD
7c673cae
FG
354 to re-create the third copy of all objects.
355
356- Another OSD, within the same placement group, fails before the new
357 OSD is fully populated with the third copy. Some objects will then
358 only have one surviving copies.
359
360- Ceph picks yet another OSD and keeps copying objects to restore the
361 desired number of copies.
362
363- A third OSD, within the same placement group, fails before recovery
364 is complete. If this OSD contained the only remaining copy of an
365 object, it is permanently lost.
366
367In a cluster containing 10 OSDs with 512 placement groups in a three
368replica pool, CRUSH will give each placement groups three OSDs. In the
369end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
370Groups. When the first OSD fails, the above scenario will therefore
371start recovery for all 150 placement groups at the same time.
372
373The 150 placement groups being recovered are likely to be
374homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
375therefore likely to send copies of objects to all others and also
376receive some new objects to be stored because they became part of a
377new placement group.
378
379The amount of time it takes for this recovery to complete entirely
380depends on the architecture of the Ceph cluster. Let say each OSD is
381hosted by a 1TB SSD on a single machine and all of them are connected
382to a 10Gb/s switch and the recovery for a single OSD completes within
383M minutes. If there are two OSDs per machine using spinners with no
384SSD journal and a 1Gb/s switch, it will at least be an order of
385magnitude slower.
386
387In a cluster of this size, the number of placement groups has almost
388no influence on data durability. It could be 128 or 8192 and the
389recovery would not be slower or faster.
390
391However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
392is likely to speed up recovery and therefore improve data durability
393significantly. Each OSD now participates in only ~75 placement groups
394instead of ~150 when there were only 10 OSDs and it will still require
395all 19 remaining OSDs to perform the same amount of object copies in
396order to recover. But where 10 OSDs had to copy approximately 100GB
397each, they now have to copy 50GB each instead. If the network was the
398bottleneck, recovery will happen twice as fast. In other words,
399recovery goes faster when the number of OSDs increases.
400
401If this cluster grows to 40 OSDs, each of them will only host ~35
402placement groups. If an OSD dies, recovery will keep going faster
403unless it is blocked by another bottleneck. However, if this cluster
404grows to 200 OSDs, each of them will only host ~7 placement groups. If
405an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
406in these placement groups: recovery will take longer than when there
407were 40 OSDs, meaning the number of placement groups should be
408increased.
409
410No matter how short the recovery time is, there is a chance for a
411second OSD to fail while it is in progress. In the 10 OSDs cluster
412described above, if any of them fail, then ~17 placement groups
413(i.e. ~150 / 9 placement groups being recovered) will only have one
414surviving copy. And if any of the 8 remaining OSD fail, the last
415objects of two placement groups are likely to be lost (i.e. ~17 / 8
416placement groups with only one remaining copy being recovered).
417
418When the size of the cluster grows to 20 OSDs, the number of Placement
419Groups damaged by the loss of three OSDs drops. The second OSD lost
420will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
421instead of ~17 and the third OSD lost will only lose data if it is one
422of the four OSDs containing the surviving copy. In other words, if the
423probability of losing one OSD is 0.0001% during the recovery time
11fdf7f2 424frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
7c673cae
FG
4250.0001% in the cluster with 20 OSDs.
426
427In a nutshell, more OSDs mean faster recovery and a lower risk of
428cascading failures leading to the permanent loss of a Placement
429Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
430cluster with less than 50 OSDs as far as data durability is concerned.
431
432Note: It may take a long time for a new OSD added to the cluster to be
433populated with placement groups that were assigned to it. However
434there is no degradation of any object and it has no impact on the
435durability of the data contained in the Cluster.
436
437.. _object distribution:
438
439Object distribution within a pool
440---------------------------------
441
442Ideally objects are evenly distributed in each placement group. Since
443CRUSH computes the placement group for each object, but does not
444actually know how much data is stored in each OSD within this
445placement group, the ratio between the number of placement groups and
446the number of OSDs may influence the distribution of the data
447significantly.
448
11fdf7f2 449For instance, if there was a single placement group for ten OSDs in a
7c673cae
FG
450three replica pool, only three OSD would be used because CRUSH would
451have no other choice. When more placement groups are available,
452objects are more likely to be evenly spread among them. CRUSH also
453makes every effort to evenly spread OSDs among all existing Placement
454Groups.
455
456As long as there are one or two orders of magnitude more Placement
eafe8130
TL
457Groups than OSDs, the distribution should be even. For instance, 256
458placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
459etc.
7c673cae
FG
460
461Uneven data distribution can be caused by factors other than the ratio
462between OSDs and placement groups. Since CRUSH does not take into
463account the size of the objects, a few very large objects may create
464an imbalance. Let say one million 4K objects totaling 4GB are evenly
eafe8130 465spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
7c673cae
FG
466= 400MB on each OSD. If one 400MB object is added to the pool, the
467three OSDs supporting the placement group in which the object has been
468placed will be filled with 400MB + 400MB = 800MB while the seven
469others will remain occupied with only 400MB.
470
471.. _resource usage:
472
473Memory, CPU and network usage
474-----------------------------
475
476For each placement group, OSDs and MONs need memory, network and CPU
477at all times and even more during recovery. Sharing this overhead by
478clustering objects within a placement group is one of the main reasons
479they exist.
480
481Minimizing the number of placement groups saves significant amounts of
482resources.
483
11fdf7f2
TL
484.. _choosing-number-of-placement-groups:
485
7c673cae
FG
486Choosing the number of Placement Groups
487=======================================
488
11fdf7f2
TL
489.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
490
7c673cae
FG
491If you have more than 50 OSDs, we recommend approximately 50-100
492placement groups per OSD to balance out resource usage, data
11fdf7f2 493durability and distribution. If you have less than 50 OSDs, choosing
7c673cae 494among the `preselection`_ above is best. For a single pool of objects,
f67539c2 495you can use the following formula to get a baseline
7c673cae 496
f67539c2 497 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
7c673cae
FG
498
499Where **pool size** is either the number of replicas for replicated
500pools or the K+M sum for erasure coded pools (as returned by **ceph
501osd erasure-code-profile get**).
502
503You should then check if the result makes sense with the way you
504designed your Ceph cluster to maximize `data durability`_,
505`object distribution`_ and minimize `resource usage`_.
506
eafe8130
TL
507The result should always be **rounded up to the nearest power of two**.
508
509Only a power of two will evenly balance the number of objects among
510placement groups. Other values will result in an uneven distribution of
511data across your OSDs. Their use should be limited to incrementally
512stepping from one power of two to another.
7c673cae
FG
513
514As an example, for a cluster with 200 OSDs and a pool size of 3
f67539c2 515replicas, you would estimate your number of PGs as follows
7c673cae 516
f67539c2 517 :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
7c673cae
FG
518
519When using multiple data pools for storing objects, you need to ensure
520that you balance the number of placement groups per pool with the
521number of placement groups per OSD so that you arrive at a reasonable
522total number of placement groups that provides reasonably low variance
523per OSD without taxing system resources or making the peering process
524too slow.
525
526For instance a cluster of 10 pools each with 512 placement groups on
527ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
528that is 512 placement groups per OSD. That does not use too many
529resources. However, if 1,000 pools were created with 512 placement
530groups each, the OSDs will handle ~50,000 placement groups each and it
531would require significantly more resources and time for peering.
532
224ce89b
WB
533You may find the `PGCalc`_ tool helpful.
534
535
7c673cae
FG
536.. _setting the number of placement groups:
537
538Set the Number of Placement Groups
539==================================
540
541To set the number of placement groups in a pool, you must specify the
542number of placement groups at the time you create the pool.
11fdf7f2 543See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
7c673cae
FG
544
545 ceph osd pool set {pool-name} pg_num {pg_num}
546
11fdf7f2 547After you increase the number of placement groups, you must also
7c673cae
FG
548increase the number of placement groups for placement (``pgp_num``)
549before your cluster will rebalance. The ``pgp_num`` will be the number of
550placement groups that will be considered for placement by the CRUSH
551algorithm. Increasing ``pg_num`` splits the placement groups but data
552will not be migrated to the newer placement groups until placement
553groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
554should be equal to the ``pg_num``. To increase the number of
555placement groups for placement, execute the following::
556
557 ceph osd pool set {pool-name} pgp_num {pgp_num}
558
11fdf7f2
TL
559When decreasing the number of PGs, ``pgp_num`` is adjusted
560automatically for you.
7c673cae
FG
561
562Get the Number of Placement Groups
563==================================
564
565To get the number of placement groups in a pool, execute the following::
566
567 ceph osd pool get {pool-name} pg_num
568
569
570Get a Cluster's PG Statistics
571=============================
572
573To get the statistics for the placement groups in your cluster, execute the following::
574
575 ceph pg dump [--format {format}]
576
577Valid formats are ``plain`` (default) and ``json``.
578
579
580Get Statistics for Stuck PGs
581============================
582
583To get the statistics for all placement groups stuck in a specified state,
584execute the following::
585
586 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
587
588**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
589with the most up-to-date data to come up and in.
590
591**Unclean** Placement groups contain objects that are not replicated the desired number
592of times. They should be recovering.
593
594**Stale** Placement groups are in an unknown state - the OSDs that host them have not
595reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
596
597Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
598of seconds the placement group is stuck before including it in the returned statistics
599(default 300 seconds).
600
601
602Get a PG Map
603============
604
605To get the placement group map for a particular placement group, execute the following::
606
607 ceph pg map {pg-id}
608
609For example::
610
611 ceph pg map 1.6c
612
613Ceph will return the placement group map, the placement group, and the OSD status::
614
615 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
616
617
618Get a PGs Statistics
619====================
620
621To retrieve statistics for a particular placement group, execute the following::
622
623 ceph pg {pg-id} query
624
625
626Scrub a Placement Group
627=======================
628
629To scrub a placement group, execute the following::
630
631 ceph pg scrub {pg-id}
632
633Ceph checks the primary and any replica nodes, generates a catalog of all objects
634in the placement group and compares them to ensure that no objects are missing
635or mismatched, and their contents are consistent. Assuming the replicas all
636match, a final semantic sweep ensures that all of the snapshot-related object
637metadata is consistent. Errors are reported via logs.
638
11fdf7f2
TL
639To scrub all placement groups from a specific pool, execute the following::
640
641 ceph osd pool scrub {pool-name}
642
c07f9fc5
FG
643Prioritize backfill/recovery of a Placement Group(s)
644====================================================
645
646You may run into a situation where a bunch of placement groups will require
647recovery and/or backfill, and some particular groups hold data more important
648than others (for example, those PGs may hold data for images used by running
649machines and other PGs may be used by inactive machines/less relevant data).
650In that case, you may want to prioritize recovery of those groups so
651performance and/or availability of data stored on those groups is restored
11fdf7f2 652earlier. To do this (mark particular placement group(s) as prioritized during
c07f9fc5
FG
653backfill or recovery), execute the following::
654
655 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
656 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
657
658This will cause Ceph to perform recovery or backfill on specified placement
659groups first, before other placement groups. This does not interrupt currently
660ongoing backfills or recovery, but causes specified PGs to be processed
661as soon as possible. If you change your mind or prioritize wrong groups,
662use::
663
664 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
665 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
666
667This will remove "force" flag from those PGs and they will be processed
668in default order. Again, this doesn't affect currently processed placement
669group, only those that are still queued.
670
671The "force" flag is cleared automatically after recovery or backfill of group
672is done.
7c673cae 673
11fdf7f2
TL
674Similarly, you may use the following commands to force Ceph to perform recovery
675or backfill on all placement groups from a specified pool first::
676
677 ceph osd pool force-recovery {pool-name}
678 ceph osd pool force-backfill {pool-name}
679
680or::
681
682 ceph osd pool cancel-force-recovery {pool-name}
683 ceph osd pool cancel-force-backfill {pool-name}
684
685to restore to the default recovery or backfill priority if you change your mind.
686
687Note that these commands could possibly break the ordering of Ceph's internal
688priority computations, so use them with caution!
689Especially, if you have multiple pools that are currently sharing the same
690underlying OSDs, and some particular pools hold data more important than others,
691we recommend you use the following command to re-arrange all pools's
692recovery/backfill priority in a better order::
693
694 ceph osd pool set {pool-name} recovery_priority {value}
695
696For example, if you have 10 pools you could make the most important one priority 10,
697next 9, etc. Or you could leave most pools alone and have say 3 important pools
698all priority 1 or priorities 3, 2, 1 respectively.
699
7c673cae
FG
700Revert Lost
701===========
702
703If the cluster has lost one or more objects, and you have decided to
704abandon the search for the lost data, you must mark the unfound objects
705as ``lost``.
706
707If all possible locations have been queried and objects are still
708lost, you may have to give up on the lost objects. This is
709possible given unusual combinations of failures that allow the cluster
710to learn about writes that were performed before the writes themselves
711are recovered.
712
713Currently the only supported option is "revert", which will either roll back to
714a previous version of the object or (if it was a new object) forget about it
715entirely. To mark the "unfound" objects as "lost", execute the following::
716
717 ceph pg {pg-id} mark_unfound_lost revert|delete
718
719.. important:: Use this feature with caution, because it may confuse
720 applications that expect the object(s) to exist.
721
722
723.. toctree::
724 :hidden:
725
726 pg-states
727 pg-concepts
728
729
730.. _Create a Pool: ../pools#createpool
731.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
33c7a0ef 732.. _pgcalc: https://old.ceph.com/pgcalc/