]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
buildsys: switch source download to quincy
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
7c673cae
FG
1==================
2 Placement Groups
3==================
4
11fdf7f2
TL
5.. _pg-autoscaler:
6
7Autoscaling placement groups
8============================
9
10Placement groups (PGs) are an internal implementation detail of how
11Ceph distributes data. You can allow the cluster to either make
12recommendations or automatically tune PGs based on how the cluster is
13used by enabling *pg-autoscaling*.
14
15Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
16
17* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
18* ``on``: Enable automated adjustments of the PG count for the given pool.
19* ``warn``: Raise health alerts when the PG count should be adjusted
20
21To set the autoscaling mode for existing pools,::
22
23 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
24
25For example to enable autoscaling on pool ``foo``,::
26
27 ceph osd pool set foo pg_autoscale_mode on
28
29You can also configure the default ``pg_autoscale_mode`` that is
30applied to any pools that are created in the future with::
31
9f95a23c 32 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
11fdf7f2
TL
33
34Viewing PG scaling recommendations
35----------------------------------
36
37You can view each pool, its relative utilization, and any suggested changes to
38the PG count with this command::
39
40 ceph osd pool autoscale-status
41
42Output will be something like::
43
522d829b
TL
44 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE PROFILE
45 a 12900M 3.0 82431M 0.4695 8 128 warn scale-up
46 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn scale-down
47 b 0 953.6M 3.0 82431M 0.0347 8 warn scale-down
11fdf7f2
TL
48
49**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
50present, is the amount of data the administrator has specified that
51they expect to eventually be stored in this pool. The system uses
52the larger of the two values for its calculation.
53
54**RATE** is the multiplier for the pool that determines how much raw
55storage capacity is consumed. For example, a 3 replica pool will
56have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
57ratio of 1.5.
58
59**RAW CAPACITY** is the total amount of raw storage capacity on the
60OSDs that are responsible for storing this pool's (and perhaps other
61pools') data. **RATIO** is the ratio of that total capacity that
62this pool is consuming (i.e., ratio = size * rate / raw capacity).
63
64**TARGET RATIO**, if present, is the ratio of storage that the
9f95a23c
TL
65administrator has specified that they expect this pool to consume
66relative to other pools with target ratios set.
67If both target size bytes and ratio are specified, the
11fdf7f2
TL
68ratio takes precedence.
69
9f95a23c
TL
70**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
71
721. subtracting any capacity expected to be used by pools with target size set
732. normalizing the target ratios among pools with target ratio set so
74 they collectively target the rest of the space. For example, 4
75 pools with target_ratio 1.0 would have an effective ratio of 0.25.
76
77The system uses the larger of the actual ratio and the effective ratio
78for its calculation.
79
522d829b
TL
80**BIAS** is used as a multiplier to manually adjust a pool's PG based
81on prior information about how much PGs a specific pool is expected
82to have.
83
11fdf7f2
TL
84**PG_NUM** is the current number of PGs for the pool (or the current
85number of PGs that the pool is working towards, if a ``pg_num``
86change is in progress). **NEW PG_NUM**, if present, is what the
87system believes the pool's ``pg_num`` should be changed to. It is
88always a power of 2, and will only be present if the "ideal" value
89varies from the current value by more than a factor of 3.
90
522d829b 91**AUTOSCALE**, is the pool ``pg_autoscale_mode``
11fdf7f2
TL
92and will be either ``on``, ``off``, or ``warn``.
93
522d829b
TL
94The final column, **PROFILE** shows the autoscale profile
95used by each pool. ``scale-up`` and ``scale-down`` are the
96currently available profiles.
97
11fdf7f2
TL
98
99Automated scaling
100-----------------
101
102Allowing the cluster to automatically scale PGs based on usage is the
103simplest approach. Ceph will look at the total available storage and
104target number of PGs for the whole system, look at how much data is
105stored in each pool, and try to apportion the PGs accordingly. The
106system is relatively conservative with its approach, only making
107changes to a pool when the current number of PGs (``pg_num``) is more
108than 3 times off from what it thinks it should be.
109
110The target number of PGs per OSD is based on the
111``mon_target_pg_per_osd`` configurable (default: 100), which can be
112adjusted with::
113
114 ceph config set global mon_target_pg_per_osd 100
115
116The autoscaler analyzes pools and adjusts on a per-subtree basis.
117Because each pool may map to a different CRUSH rule, and each rule may
118distribute data across different devices, Ceph will consider
119utilization of each subtree of the hierarchy independently. For
120example, a pool that maps to OSDs of class `ssd` and a pool that maps
121to OSDs of class `hdd` will each have optimal PG counts that depend on
122the number of those respective device types.
123
a4b75251
TL
124The autoscaler uses the `scale-up` profile by default,
125where it starts out each pool with minimal PGs and scales
126up PGs when there is more usage in each pool. However, it also has
127a `scale-down` profile, where each pool starts out with a full complements
128of PGs and only scales down when the usage ratio across the pools is not even.
522d829b
TL
129
130With only the `scale-down` profile, the autoscaler identifies
131any overlapping roots and prevents the pools with such roots
132from scaling because overlapping roots can cause problems
133with the scaling process.
134
a4b75251 135To use the `scale-down` profile::
522d829b 136
a4b75251 137 ceph osd pool set autoscale-profile scale-down
522d829b 138
a4b75251 139To switch back to the default `scale-up` profile::
522d829b 140
a4b75251 141 ceph osd pool set autoscale-profile scale-up
522d829b
TL
142
143Existing clusters will continue to use the `scale-up` profile.
144To use the `scale-down` profile, users will need to set autoscale-profile `scale-down`,
145after upgrading to a version of Ceph that provides the `scale-down` feature.
11fdf7f2
TL
146
147.. _specifying_pool_target_size:
148
149Specifying expected pool size
150-----------------------------
151
152When a cluster or pool is first created, it will consume a small
153fraction of the total cluster capacity and will appear to the system
154as if it should only need a small number of placement groups.
155However, in most cases cluster administrators have a good idea which
156pools are expected to consume most of the system capacity over time.
157By providing this information to Ceph, a more appropriate number of
158PGs can be used from the beginning, preventing subsequent changes in
159``pg_num`` and the overhead associated with moving data around when
160those adjustments are made.
161
9f95a23c
TL
162The *target size* of a pool can be specified in two ways: either in
163terms of the absolute size of the pool (i.e., bytes), or as a weight
164relative to other pools with a ``target_size_ratio`` set.
11fdf7f2
TL
165
166For example,::
167
168 ceph osd pool set mypool target_size_bytes 100T
169
170will tell the system that `mypool` is expected to consume 100 TiB of
171space. Alternatively,::
172
9f95a23c 173 ceph osd pool set mypool target_size_ratio 1.0
11fdf7f2 174
9f95a23c
TL
175will tell the system that `mypool` is expected to consume 1.0 relative
176to the other pools with ``target_size_ratio`` set. If `mypool` is the
177only pool in the cluster, this means an expected use of 100% of the
178total capacity. If there is a second pool with ``target_size_ratio``
1791.0, both pools would expect to use 50% of the cluster capacity.
11fdf7f2
TL
180
181You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
182
183Note that if impossible target size values are specified (for example,
9f95a23c
TL
184a capacity larger than the total cluster) then a health warning
185(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
186
187If both ``target_size_ratio`` and ``target_size_bytes`` are specified
188for a pool, only the ratio will be considered, and a health warning
189(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
11fdf7f2
TL
190
191Specifying bounds on a pool's PGs
192---------------------------------
193
194It is also possible to specify a minimum number of PGs for a pool.
195This is useful for establishing a lower bound on the amount of
196parallelism client will see when doing IO, even when a pool is mostly
197empty. Setting the lower bound prevents Ceph from reducing (or
198recommending you reduce) the PG number below the configured number.
199
200You can set the minimum number of PGs for a pool with::
201
202 ceph osd pool set <pool-name> pg_num_min <num>
203
204You can also specify the minimum PG count at pool creation time with
205the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
206create`` command.
207
7c673cae
FG
208.. _preselection:
209
210A preselection of pg_num
211========================
212
213When creating a new pool with::
214
9f95a23c 215 ceph osd pool create {pool-name} [pg_num]
7c673cae 216
9f95a23c
TL
217it is optional to choose the value of ``pg_num``. If you do not
218specify ``pg_num``, the cluster can (by default) automatically tune it
219for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
7c673cae 220
9f95a23c
TL
221Alternatively, ``pg_num`` can be explicitly provided. However,
222whether you specify a ``pg_num`` value or not does not affect whether
223the value is automatically tuned by the cluster after the fact. To
224enable or disable auto-tuning,::
7c673cae 225
9f95a23c 226 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
7c673cae 227
9f95a23c
TL
228The "rule of thumb" for PGs per OSD has traditionally be 100. With
229the additional of the balancer (which is also enabled by default), a
230value of more like 50 PGs per OSD is probably reasonable. The
231challenge (which the autoscaler normally does for you), is to:
7c673cae 232
9f95a23c
TL
233- have the PGs per pool proportional to the data in the pool, and
234- end up with 50-100 PGs per OSDs, after the replication or
235 erasuring-coding fan-out of each PG across OSDs is taken into
236 consideration
7c673cae
FG
237
238How are Placement Groups used ?
239===============================
240
241A placement group (PG) aggregates objects within a pool because
242tracking object placement and object metadata on a per-object basis is
243computationally expensive--i.e., a system with millions of objects
244cannot realistically track placement on a per-object basis.
245
246.. ditaa::
247 /-----\ /-----\ /-----\ /-----\ /-----\
248 | obj | | obj | | obj | | obj | | obj |
249 \-----/ \-----/ \-----/ \-----/ \-----/
250 | | | | |
251 +--------+--------+ +---+----+
252 | |
253 v v
254 +-----------------------+ +-----------------------+
255 | Placement Group #1 | | Placement Group #2 |
256 | | | |
257 +-----------------------+ +-----------------------+
258 | |
259 +------------------------------+
260 |
261 v
262 +-----------------------+
263 | Pool |
264 | |
265 +-----------------------+
266
267The Ceph client will calculate which placement group an object should
268be in. It does this by hashing the object ID and applying an operation
269based on the number of PGs in the defined pool and the ID of the pool.
270See `Mapping PGs to OSDs`_ for details.
271
272The object's contents within a placement group are stored in a set of
273OSDs. For instance, in a replicated pool of size two, each placement
274group will store objects on two OSDs, as shown below.
275
276.. ditaa::
7c673cae
FG
277 +-----------------------+ +-----------------------+
278 | Placement Group #1 | | Placement Group #2 |
279 | | | |
280 +-----------------------+ +-----------------------+
281 | | | |
282 v v v v
283 /----------\ /----------\ /----------\ /----------\
284 | | | | | | | |
285 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
286 | | | | | | | |
287 \----------/ \----------/ \----------/ \----------/
288
289
290Should OSD #2 fail, another will be assigned to Placement Group #1 and
291will be filled with copies of all objects in OSD #1. If the pool size
292is changed from two to three, an additional OSD will be assigned to
293the placement group and will receive copies of all objects in the
294placement group.
295
11fdf7f2 296Placement groups do not own the OSD; they share it with other
7c673cae
FG
297placement groups from the same pool or even other pools. If OSD #2
298fails, the Placement Group #2 will also have to restore copies of
299objects, using OSD #3.
300
301When the number of placement groups increases, the new placement
302groups will be assigned OSDs. The result of the CRUSH function will
303also change and some objects from the former placement groups will be
304copied over to the new Placement Groups and removed from the old ones.
305
306Placement Groups Tradeoffs
307==========================
308
309Data durability and even distribution among all OSDs call for more
310placement groups but their number should be reduced to the minimum to
311save CPU and memory.
312
313.. _data durability:
314
315Data durability
316---------------
317
318After an OSD fails, the risk of data loss increases until the data it
319contained is fully recovered. Let's imagine a scenario that causes
320permanent data loss in a single placement group:
321
322- The OSD fails and all copies of the object it contains are lost.
323 For all objects within the placement group the number of replica
11fdf7f2 324 suddenly drops from three to two.
7c673cae 325
11fdf7f2 326- Ceph starts recovery for this placement group by choosing a new OSD
7c673cae
FG
327 to re-create the third copy of all objects.
328
329- Another OSD, within the same placement group, fails before the new
330 OSD is fully populated with the third copy. Some objects will then
331 only have one surviving copies.
332
333- Ceph picks yet another OSD and keeps copying objects to restore the
334 desired number of copies.
335
336- A third OSD, within the same placement group, fails before recovery
337 is complete. If this OSD contained the only remaining copy of an
338 object, it is permanently lost.
339
340In a cluster containing 10 OSDs with 512 placement groups in a three
341replica pool, CRUSH will give each placement groups three OSDs. In the
342end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
343Groups. When the first OSD fails, the above scenario will therefore
344start recovery for all 150 placement groups at the same time.
345
346The 150 placement groups being recovered are likely to be
347homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
348therefore likely to send copies of objects to all others and also
349receive some new objects to be stored because they became part of a
350new placement group.
351
352The amount of time it takes for this recovery to complete entirely
353depends on the architecture of the Ceph cluster. Let say each OSD is
354hosted by a 1TB SSD on a single machine and all of them are connected
355to a 10Gb/s switch and the recovery for a single OSD completes within
356M minutes. If there are two OSDs per machine using spinners with no
357SSD journal and a 1Gb/s switch, it will at least be an order of
358magnitude slower.
359
360In a cluster of this size, the number of placement groups has almost
361no influence on data durability. It could be 128 or 8192 and the
362recovery would not be slower or faster.
363
364However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
365is likely to speed up recovery and therefore improve data durability
366significantly. Each OSD now participates in only ~75 placement groups
367instead of ~150 when there were only 10 OSDs and it will still require
368all 19 remaining OSDs to perform the same amount of object copies in
369order to recover. But where 10 OSDs had to copy approximately 100GB
370each, they now have to copy 50GB each instead. If the network was the
371bottleneck, recovery will happen twice as fast. In other words,
372recovery goes faster when the number of OSDs increases.
373
374If this cluster grows to 40 OSDs, each of them will only host ~35
375placement groups. If an OSD dies, recovery will keep going faster
376unless it is blocked by another bottleneck. However, if this cluster
377grows to 200 OSDs, each of them will only host ~7 placement groups. If
378an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
379in these placement groups: recovery will take longer than when there
380were 40 OSDs, meaning the number of placement groups should be
381increased.
382
383No matter how short the recovery time is, there is a chance for a
384second OSD to fail while it is in progress. In the 10 OSDs cluster
385described above, if any of them fail, then ~17 placement groups
386(i.e. ~150 / 9 placement groups being recovered) will only have one
387surviving copy. And if any of the 8 remaining OSD fail, the last
388objects of two placement groups are likely to be lost (i.e. ~17 / 8
389placement groups with only one remaining copy being recovered).
390
391When the size of the cluster grows to 20 OSDs, the number of Placement
392Groups damaged by the loss of three OSDs drops. The second OSD lost
393will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
394instead of ~17 and the third OSD lost will only lose data if it is one
395of the four OSDs containing the surviving copy. In other words, if the
396probability of losing one OSD is 0.0001% during the recovery time
11fdf7f2 397frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
7c673cae
FG
3980.0001% in the cluster with 20 OSDs.
399
400In a nutshell, more OSDs mean faster recovery and a lower risk of
401cascading failures leading to the permanent loss of a Placement
402Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
403cluster with less than 50 OSDs as far as data durability is concerned.
404
405Note: It may take a long time for a new OSD added to the cluster to be
406populated with placement groups that were assigned to it. However
407there is no degradation of any object and it has no impact on the
408durability of the data contained in the Cluster.
409
410.. _object distribution:
411
412Object distribution within a pool
413---------------------------------
414
415Ideally objects are evenly distributed in each placement group. Since
416CRUSH computes the placement group for each object, but does not
417actually know how much data is stored in each OSD within this
418placement group, the ratio between the number of placement groups and
419the number of OSDs may influence the distribution of the data
420significantly.
421
11fdf7f2 422For instance, if there was a single placement group for ten OSDs in a
7c673cae
FG
423three replica pool, only three OSD would be used because CRUSH would
424have no other choice. When more placement groups are available,
425objects are more likely to be evenly spread among them. CRUSH also
426makes every effort to evenly spread OSDs among all existing Placement
427Groups.
428
429As long as there are one or two orders of magnitude more Placement
eafe8130
TL
430Groups than OSDs, the distribution should be even. For instance, 256
431placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
432etc.
7c673cae
FG
433
434Uneven data distribution can be caused by factors other than the ratio
435between OSDs and placement groups. Since CRUSH does not take into
436account the size of the objects, a few very large objects may create
437an imbalance. Let say one million 4K objects totaling 4GB are evenly
eafe8130 438spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
7c673cae
FG
439= 400MB on each OSD. If one 400MB object is added to the pool, the
440three OSDs supporting the placement group in which the object has been
441placed will be filled with 400MB + 400MB = 800MB while the seven
442others will remain occupied with only 400MB.
443
444.. _resource usage:
445
446Memory, CPU and network usage
447-----------------------------
448
449For each placement group, OSDs and MONs need memory, network and CPU
450at all times and even more during recovery. Sharing this overhead by
451clustering objects within a placement group is one of the main reasons
452they exist.
453
454Minimizing the number of placement groups saves significant amounts of
455resources.
456
11fdf7f2
TL
457.. _choosing-number-of-placement-groups:
458
7c673cae
FG
459Choosing the number of Placement Groups
460=======================================
461
11fdf7f2
TL
462.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
463
7c673cae
FG
464If you have more than 50 OSDs, we recommend approximately 50-100
465placement groups per OSD to balance out resource usage, data
11fdf7f2 466durability and distribution. If you have less than 50 OSDs, choosing
7c673cae 467among the `preselection`_ above is best. For a single pool of objects,
f67539c2 468you can use the following formula to get a baseline
7c673cae 469
f67539c2 470 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
7c673cae
FG
471
472Where **pool size** is either the number of replicas for replicated
473pools or the K+M sum for erasure coded pools (as returned by **ceph
474osd erasure-code-profile get**).
475
476You should then check if the result makes sense with the way you
477designed your Ceph cluster to maximize `data durability`_,
478`object distribution`_ and minimize `resource usage`_.
479
eafe8130
TL
480The result should always be **rounded up to the nearest power of two**.
481
482Only a power of two will evenly balance the number of objects among
483placement groups. Other values will result in an uneven distribution of
484data across your OSDs. Their use should be limited to incrementally
485stepping from one power of two to another.
7c673cae
FG
486
487As an example, for a cluster with 200 OSDs and a pool size of 3
f67539c2 488replicas, you would estimate your number of PGs as follows
7c673cae 489
f67539c2 490 :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
7c673cae
FG
491
492When using multiple data pools for storing objects, you need to ensure
493that you balance the number of placement groups per pool with the
494number of placement groups per OSD so that you arrive at a reasonable
495total number of placement groups that provides reasonably low variance
496per OSD without taxing system resources or making the peering process
497too slow.
498
499For instance a cluster of 10 pools each with 512 placement groups on
500ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
501that is 512 placement groups per OSD. That does not use too many
502resources. However, if 1,000 pools were created with 512 placement
503groups each, the OSDs will handle ~50,000 placement groups each and it
504would require significantly more resources and time for peering.
505
224ce89b
WB
506You may find the `PGCalc`_ tool helpful.
507
508
7c673cae
FG
509.. _setting the number of placement groups:
510
511Set the Number of Placement Groups
512==================================
513
514To set the number of placement groups in a pool, you must specify the
515number of placement groups at the time you create the pool.
11fdf7f2 516See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
7c673cae
FG
517
518 ceph osd pool set {pool-name} pg_num {pg_num}
519
11fdf7f2 520After you increase the number of placement groups, you must also
7c673cae
FG
521increase the number of placement groups for placement (``pgp_num``)
522before your cluster will rebalance. The ``pgp_num`` will be the number of
523placement groups that will be considered for placement by the CRUSH
524algorithm. Increasing ``pg_num`` splits the placement groups but data
525will not be migrated to the newer placement groups until placement
526groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
527should be equal to the ``pg_num``. To increase the number of
528placement groups for placement, execute the following::
529
530 ceph osd pool set {pool-name} pgp_num {pgp_num}
531
11fdf7f2
TL
532When decreasing the number of PGs, ``pgp_num`` is adjusted
533automatically for you.
7c673cae
FG
534
535Get the Number of Placement Groups
536==================================
537
538To get the number of placement groups in a pool, execute the following::
539
540 ceph osd pool get {pool-name} pg_num
541
542
543Get a Cluster's PG Statistics
544=============================
545
546To get the statistics for the placement groups in your cluster, execute the following::
547
548 ceph pg dump [--format {format}]
549
550Valid formats are ``plain`` (default) and ``json``.
551
552
553Get Statistics for Stuck PGs
554============================
555
556To get the statistics for all placement groups stuck in a specified state,
557execute the following::
558
559 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
560
561**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
562with the most up-to-date data to come up and in.
563
564**Unclean** Placement groups contain objects that are not replicated the desired number
565of times. They should be recovering.
566
567**Stale** Placement groups are in an unknown state - the OSDs that host them have not
568reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
569
570Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
571of seconds the placement group is stuck before including it in the returned statistics
572(default 300 seconds).
573
574
575Get a PG Map
576============
577
578To get the placement group map for a particular placement group, execute the following::
579
580 ceph pg map {pg-id}
581
582For example::
583
584 ceph pg map 1.6c
585
586Ceph will return the placement group map, the placement group, and the OSD status::
587
588 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
589
590
591Get a PGs Statistics
592====================
593
594To retrieve statistics for a particular placement group, execute the following::
595
596 ceph pg {pg-id} query
597
598
599Scrub a Placement Group
600=======================
601
602To scrub a placement group, execute the following::
603
604 ceph pg scrub {pg-id}
605
606Ceph checks the primary and any replica nodes, generates a catalog of all objects
607in the placement group and compares them to ensure that no objects are missing
608or mismatched, and their contents are consistent. Assuming the replicas all
609match, a final semantic sweep ensures that all of the snapshot-related object
610metadata is consistent. Errors are reported via logs.
611
11fdf7f2
TL
612To scrub all placement groups from a specific pool, execute the following::
613
614 ceph osd pool scrub {pool-name}
615
c07f9fc5
FG
616Prioritize backfill/recovery of a Placement Group(s)
617====================================================
618
619You may run into a situation where a bunch of placement groups will require
620recovery and/or backfill, and some particular groups hold data more important
621than others (for example, those PGs may hold data for images used by running
622machines and other PGs may be used by inactive machines/less relevant data).
623In that case, you may want to prioritize recovery of those groups so
624performance and/or availability of data stored on those groups is restored
11fdf7f2 625earlier. To do this (mark particular placement group(s) as prioritized during
c07f9fc5
FG
626backfill or recovery), execute the following::
627
628 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
629 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
630
631This will cause Ceph to perform recovery or backfill on specified placement
632groups first, before other placement groups. This does not interrupt currently
633ongoing backfills or recovery, but causes specified PGs to be processed
634as soon as possible. If you change your mind or prioritize wrong groups,
635use::
636
637 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
638 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
639
640This will remove "force" flag from those PGs and they will be processed
641in default order. Again, this doesn't affect currently processed placement
642group, only those that are still queued.
643
644The "force" flag is cleared automatically after recovery or backfill of group
645is done.
7c673cae 646
11fdf7f2
TL
647Similarly, you may use the following commands to force Ceph to perform recovery
648or backfill on all placement groups from a specified pool first::
649
650 ceph osd pool force-recovery {pool-name}
651 ceph osd pool force-backfill {pool-name}
652
653or::
654
655 ceph osd pool cancel-force-recovery {pool-name}
656 ceph osd pool cancel-force-backfill {pool-name}
657
658to restore to the default recovery or backfill priority if you change your mind.
659
660Note that these commands could possibly break the ordering of Ceph's internal
661priority computations, so use them with caution!
662Especially, if you have multiple pools that are currently sharing the same
663underlying OSDs, and some particular pools hold data more important than others,
664we recommend you use the following command to re-arrange all pools's
665recovery/backfill priority in a better order::
666
667 ceph osd pool set {pool-name} recovery_priority {value}
668
669For example, if you have 10 pools you could make the most important one priority 10,
670next 9, etc. Or you could leave most pools alone and have say 3 important pools
671all priority 1 or priorities 3, 2, 1 respectively.
672
7c673cae
FG
673Revert Lost
674===========
675
676If the cluster has lost one or more objects, and you have decided to
677abandon the search for the lost data, you must mark the unfound objects
678as ``lost``.
679
680If all possible locations have been queried and objects are still
681lost, you may have to give up on the lost objects. This is
682possible given unusual combinations of failures that allow the cluster
683to learn about writes that were performed before the writes themselves
684are recovered.
685
686Currently the only supported option is "revert", which will either roll back to
687a previous version of the object or (if it was a new object) forget about it
688entirely. To mark the "unfound" objects as "lost", execute the following::
689
690 ceph pg {pg-id} mark_unfound_lost revert|delete
691
692.. important:: Use this feature with caution, because it may confuse
693 applications that expect the object(s) to exist.
694
695
696.. toctree::
697 :hidden:
698
699 pg-states
700 pg-concepts
701
702
703.. _Create a Pool: ../pools#createpool
704.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
705.. _pgcalc: http://ceph.com/pgcalc/