]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/placement-groups.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
1 .. _placement groups:
2
3 ==================
4 Placement Groups
5 ==================
6
7 .. _pg-autoscaler:
8
9 Autoscaling placement groups
10 ============================
11
12 Placement groups (PGs) are an internal implementation detail of how
13 Ceph distributes data. You may enable *pg-autoscaling* to allow the cluster to
14 make recommendations or automatically adjust the numbers of PGs (``pgp_num``)
15 for each pool based on expected cluster and pool utilization.
16
17 Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
18
19 * ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate ``pgp_num`` for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
20 * ``on``: Enable automated adjustments of the PG count for the given pool.
21 * ``warn``: Raise health alerts when the PG count should be adjusted
22
23 To set the autoscaling mode for an existing pool:
24
25 .. prompt:: bash #
26
27 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
28
29 For example to enable autoscaling on pool ``foo``:
30
31 .. prompt:: bash #
32
33 ceph osd pool set foo pg_autoscale_mode on
34
35 You can also configure the default ``pg_autoscale_mode`` that is
36 set on any pools that are subsequently created:
37
38 .. prompt:: bash #
39
40 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
41
42 You can disable or enable the autoscaler for all pools with
43 the ``noautoscale`` flag. By default this flag is set to be ``off``,
44 but you can turn it ``on`` by using the command:
45
46 .. prompt:: bash $
47
48 ceph osd pool set noautoscale
49
50 You can turn it ``off`` using the command:
51
52 .. prompt:: bash #
53
54 ceph osd pool unset noautoscale
55
56 To ``get`` the value of the flag use the command:
57
58 .. prompt:: bash #
59
60 ceph osd pool get noautoscale
61
62 Viewing PG scaling recommendations
63 ----------------------------------
64
65 You can view each pool, its relative utilization, and any suggested changes to
66 the PG count with this command:
67
68 .. prompt:: bash #
69
70 ceph osd pool autoscale-status
71
72 Output will be something like::
73
74 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
75 a 12900M 3.0 82431M 0.4695 8 128 warn True
76 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
77 b 0 953.6M 3.0 82431M 0.0347 8 warn False
78
79 **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
80 present, is the amount of data the administrator has specified that
81 they expect to eventually be stored in this pool. The system uses
82 the larger of the two values for its calculation.
83
84 **RATE** is the multiplier for the pool that determines how much raw
85 storage capacity is consumed. For example, a 3 replica pool will
86 have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
87 ratio of 1.5.
88
89 **RAW CAPACITY** is the total amount of raw storage capacity on the
90 OSDs that are responsible for storing this pool's (and perhaps other
91 pools') data. **RATIO** is the ratio of that total capacity that
92 this pool is consuming (i.e., ratio = size * rate / raw capacity).
93
94 **TARGET RATIO**, if present, is the ratio of storage that the
95 administrator has specified that they expect this pool to consume
96 relative to other pools with target ratios set.
97 If both target size bytes and ratio are specified, the
98 ratio takes precedence.
99
100 **EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
101
102 1. Subtracting any capacity expected to be used by pools with target size set
103 2. Normalizing the target ratios among pools with target ratio set so
104 they collectively target the rest of the space. For example, 4
105 pools with target_ratio 1.0 would have an effective ratio of 0.25.
106
107 The system uses the larger of the actual ratio and the effective ratio
108 for its calculation.
109
110 **BIAS** is used as a multiplier to manually adjust a pool's PG based
111 on prior information about how much PGs a specific pool is expected
112 to have.
113
114 **PG_NUM** is the current number of PGs for the pool (or the current
115 number of PGs that the pool is working towards, if a ``pg_num``
116 change is in progress). **NEW PG_NUM**, if present, is what the
117 system believes the pool's ``pg_num`` should be changed to. It is
118 always a power of 2, and will only be present if the "ideal" value
119 varies from the current value by more than a factor of 3 by default.
120 This factor can be be adjusted with:
121
122 .. prompt:: bash #
123
124 ceph osd pool set threshold 2.0
125
126 **AUTOSCALE**, is the pool ``pg_autoscale_mode``
127 and will be either ``on``, ``off``, or ``warn``.
128
129 The final column, **BULK** determines if the pool is ``bulk``
130 and will be either ``True`` or ``False``. A ``bulk`` pool
131 means that the pool is expected to be large and should start out
132 with large amount of PGs for performance purposes. On the other hand,
133 pools without the ``bulk`` flag are expected to be smaller e.g.,
134 .mgr or meta pools.
135
136
137 Automated scaling
138 -----------------
139
140 Allowing the cluster to automatically scale ``pgp_num`` based on usage is the
141 simplest approach. Ceph will look at the total available storage and
142 target number of PGs for the whole system, look at how much data is
143 stored in each pool, and try to apportion PGs accordingly. The
144 system is relatively conservative with its approach, only making
145 changes to a pool when the current number of PGs (``pg_num``) is more
146 than a factor of 3 off from what it thinks it should be.
147
148 The target number of PGs per OSD is based on the
149 ``mon_target_pg_per_osd`` configurable (default: 100), which can be
150 adjusted with:
151
152 .. prompt:: bash #
153
154 ceph config set global mon_target_pg_per_osd 100
155
156 The autoscaler analyzes pools and adjusts on a per-subtree basis.
157 Because each pool may map to a different CRUSH rule, and each rule may
158 distribute data across different devices, Ceph will consider
159 utilization of each subtree of the hierarchy independently. For
160 example, a pool that maps to OSDs of class `ssd` and a pool that maps
161 to OSDs of class `hdd` will each have optimal PG counts that depend on
162 the number of those respective device types.
163
164 In the case where a pool uses OSDs under two or more CRUSH roots, e.g., (shadow
165 trees with both `ssd` and `hdd` devices), the autoscaler will
166 issue a warning to the user in the manager log stating the name of the pool
167 and the set of roots that overlap each other. The autoscaler will not
168 scale any pools with overlapping roots because this can cause problems
169 with the scaling process. We recommend making each pool belong to only
170 one root (one OSD class) to get rid of the warning and ensure a successful
171 scaling process.
172
173 The autoscaler uses the `bulk` flag to determine which pool
174 should start out with a full complement of PGs and only
175 scales down when the usage ratio across the pool is not even.
176 However, if the pool doesn't have the `bulk` flag, the pool will
177 start out with minimal PGs and only when there is more usage in the pool.
178
179 To create pool with `bulk` flag:
180
181 .. prompt:: bash #
182
183 ceph osd pool create <pool-name> --bulk
184
185 To set/unset `bulk` flag of existing pool:
186
187 .. prompt:: bash #
188
189 ceph osd pool set <pool-name> bulk <true/false/1/0>
190
191 To get `bulk` flag of existing pool:
192
193 .. prompt:: bash #
194
195 ceph osd pool get <pool-name> bulk
196
197 .. _specifying_pool_target_size:
198
199 Specifying expected pool size
200 -----------------------------
201
202 When a cluster or pool is first created, it will consume a small
203 fraction of the total cluster capacity and will appear to the system
204 as if it should only need a small number of placement groups.
205 However, in most cases cluster administrators have a good idea which
206 pools are expected to consume most of the system capacity over time.
207 By providing this information to Ceph, a more appropriate number of
208 PGs can be used from the beginning, preventing subsequent changes in
209 ``pg_num`` and the overhead associated with moving data around when
210 those adjustments are made.
211
212 The *target size* of a pool can be specified in two ways: either in
213 terms of the absolute size of the pool (i.e., bytes), or as a weight
214 relative to other pools with a ``target_size_ratio`` set.
215
216 For example:
217
218 .. prompt:: bash #
219
220 ceph osd pool set mypool target_size_bytes 100T
221
222 will tell the system that `mypool` is expected to consume 100 TiB of
223 space. Alternatively:
224
225 .. prompt:: bash #
226
227 ceph osd pool set mypool target_size_ratio 1.0
228
229 will tell the system that `mypool` is expected to consume 1.0 relative
230 to the other pools with ``target_size_ratio`` set. If `mypool` is the
231 only pool in the cluster, this means an expected use of 100% of the
232 total capacity. If there is a second pool with ``target_size_ratio``
233 1.0, both pools would expect to use 50% of the cluster capacity.
234
235 You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
236
237 Note that if impossible target size values are specified (for example,
238 a capacity larger than the total cluster) then a health warning
239 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
240
241 If both ``target_size_ratio`` and ``target_size_bytes`` are specified
242 for a pool, only the ratio will be considered, and a health warning
243 (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
244
245 Specifying bounds on a pool's PGs
246 ---------------------------------
247
248 It is also possible to specify a minimum number of PGs for a pool.
249 This is useful for establishing a lower bound on the amount of
250 parallelism client will see when doing IO, even when a pool is mostly
251 empty. Setting the lower bound prevents Ceph from reducing (or
252 recommending you reduce) the PG number below the configured number.
253
254 You can set the minimum or maximum number of PGs for a pool with:
255
256 .. prompt:: bash #
257
258 ceph osd pool set <pool-name> pg_num_min <num>
259 ceph osd pool set <pool-name> pg_num_max <num>
260
261 You can also specify the minimum or maximum PG count at pool creation
262 time with the optional ``--pg-num-min <num>`` or ``--pg-num-max
263 <num>`` arguments to the ``ceph osd pool create`` command.
264
265 .. _preselection:
266
267 A preselection of pg_num
268 ========================
269
270 When creating a new pool with:
271
272 .. prompt:: bash #
273
274 ceph osd pool create {pool-name} [pg_num]
275
276 it is optional to choose the value of ``pg_num``. If you do not
277 specify ``pg_num``, the cluster can (by default) automatically tune it
278 for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
279
280 Alternatively, ``pg_num`` can be explicitly provided. However,
281 whether you specify a ``pg_num`` value or not does not affect whether
282 the value is automatically tuned by the cluster after the fact. To
283 enable or disable auto-tuning:
284
285 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
286
287 The "rule of thumb" for PGs per OSD has traditionally be 100. With
288 the additional of the balancer (which is also enabled by default), a
289 value of more like 50 PGs per OSD is probably reasonable. The
290 challenge (which the autoscaler normally does for you), is to:
291
292 - have the PGs per pool proportional to the data in the pool, and
293 - end up with 50-100 PGs per OSDs, after the replication or
294 erasuring-coding fan-out of each PG across OSDs is taken into
295 consideration
296
297 How are Placement Groups used ?
298 ===============================
299
300 A placement group (PG) aggregates objects within a pool because
301 tracking object placement and object metadata on a per-object basis is
302 computationally expensive--i.e., a system with millions of objects
303 cannot realistically track placement on a per-object basis.
304
305 .. ditaa::
306 /-----\ /-----\ /-----\ /-----\ /-----\
307 | obj | | obj | | obj | | obj | | obj |
308 \-----/ \-----/ \-----/ \-----/ \-----/
309 | | | | |
310 +--------+--------+ +---+----+
311 | |
312 v v
313 +-----------------------+ +-----------------------+
314 | Placement Group #1 | | Placement Group #2 |
315 | | | |
316 +-----------------------+ +-----------------------+
317 | |
318 +------------------------------+
319 |
320 v
321 +-----------------------+
322 | Pool |
323 | |
324 +-----------------------+
325
326 The Ceph client will calculate which placement group an object should
327 be in. It does this by hashing the object ID and applying an operation
328 based on the number of PGs in the defined pool and the ID of the pool.
329 See `Mapping PGs to OSDs`_ for details.
330
331 The object's contents within a placement group are stored in a set of
332 OSDs. For instance, in a replicated pool of size two, each placement
333 group will store objects on two OSDs, as shown below.
334
335 .. ditaa::
336 +-----------------------+ +-----------------------+
337 | Placement Group #1 | | Placement Group #2 |
338 | | | |
339 +-----------------------+ +-----------------------+
340 | | | |
341 v v v v
342 /----------\ /----------\ /----------\ /----------\
343 | | | | | | | |
344 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
345 | | | | | | | |
346 \----------/ \----------/ \----------/ \----------/
347
348
349 Should OSD #2 fail, another will be assigned to Placement Group #1 and
350 will be filled with copies of all objects in OSD #1. If the pool size
351 is changed from two to three, an additional OSD will be assigned to
352 the placement group and will receive copies of all objects in the
353 placement group.
354
355 Placement groups do not own the OSD; they share it with other
356 placement groups from the same pool or even other pools. If OSD #2
357 fails, the Placement Group #2 will also have to restore copies of
358 objects, using OSD #3.
359
360 When the number of placement groups increases, the new placement
361 groups will be assigned OSDs. The result of the CRUSH function will
362 also change and some objects from the former placement groups will be
363 copied over to the new Placement Groups and removed from the old ones.
364
365 Placement Groups Tradeoffs
366 ==========================
367
368 Data durability and even distribution among all OSDs call for more
369 placement groups but their number should be reduced to the minimum to
370 save CPU and memory.
371
372 .. _data durability:
373
374 Data durability
375 ---------------
376
377 After an OSD fails, the risk of data loss increases until the data it
378 contained is fully recovered. Let's imagine a scenario that causes
379 permanent data loss in a single placement group:
380
381 - The OSD fails and all copies of the object it contains are lost.
382 For all objects within the placement group the number of replica
383 suddenly drops from three to two.
384
385 - Ceph starts recovery for this placement group by choosing a new OSD
386 to re-create the third copy of all objects.
387
388 - Another OSD, within the same placement group, fails before the new
389 OSD is fully populated with the third copy. Some objects will then
390 only have one surviving copies.
391
392 - Ceph picks yet another OSD and keeps copying objects to restore the
393 desired number of copies.
394
395 - A third OSD, within the same placement group, fails before recovery
396 is complete. If this OSD contained the only remaining copy of an
397 object, it is permanently lost.
398
399 In a cluster containing 10 OSDs with 512 placement groups in a three
400 replica pool, CRUSH will give each placement groups three OSDs. In the
401 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
402 Groups. When the first OSD fails, the above scenario will therefore
403 start recovery for all 150 placement groups at the same time.
404
405 The 150 placement groups being recovered are likely to be
406 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
407 therefore likely to send copies of objects to all others and also
408 receive some new objects to be stored because they became part of a
409 new placement group.
410
411 The amount of time it takes for this recovery to complete entirely
412 depends on the architecture of the Ceph cluster. Let say each OSD is
413 hosted by a 1TB SSD on a single machine and all of them are connected
414 to a 10Gb/s switch and the recovery for a single OSD completes within
415 M minutes. If there are two OSDs per machine using spinners with no
416 SSD journal and a 1Gb/s switch, it will at least be an order of
417 magnitude slower.
418
419 In a cluster of this size, the number of placement groups has almost
420 no influence on data durability. It could be 128 or 8192 and the
421 recovery would not be slower or faster.
422
423 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
424 is likely to speed up recovery and therefore improve data durability
425 significantly. Each OSD now participates in only ~75 placement groups
426 instead of ~150 when there were only 10 OSDs and it will still require
427 all 19 remaining OSDs to perform the same amount of object copies in
428 order to recover. But where 10 OSDs had to copy approximately 100GB
429 each, they now have to copy 50GB each instead. If the network was the
430 bottleneck, recovery will happen twice as fast. In other words,
431 recovery goes faster when the number of OSDs increases.
432
433 If this cluster grows to 40 OSDs, each of them will only host ~35
434 placement groups. If an OSD dies, recovery will keep going faster
435 unless it is blocked by another bottleneck. However, if this cluster
436 grows to 200 OSDs, each of them will only host ~7 placement groups. If
437 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
438 in these placement groups: recovery will take longer than when there
439 were 40 OSDs, meaning the number of placement groups should be
440 increased.
441
442 No matter how short the recovery time is, there is a chance for a
443 second OSD to fail while it is in progress. In the 10 OSDs cluster
444 described above, if any of them fail, then ~17 placement groups
445 (i.e. ~150 / 9 placement groups being recovered) will only have one
446 surviving copy. And if any of the 8 remaining OSD fail, the last
447 objects of two placement groups are likely to be lost (i.e. ~17 / 8
448 placement groups with only one remaining copy being recovered).
449
450 When the size of the cluster grows to 20 OSDs, the number of Placement
451 Groups damaged by the loss of three OSDs drops. The second OSD lost
452 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
453 instead of ~17 and the third OSD lost will only lose data if it is one
454 of the four OSDs containing the surviving copy. In other words, if the
455 probability of losing one OSD is 0.0001% during the recovery time
456 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
457 0.0001% in the cluster with 20 OSDs.
458
459 In a nutshell, more OSDs mean faster recovery and a lower risk of
460 cascading failures leading to the permanent loss of a Placement
461 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
462 cluster with less than 50 OSDs as far as data durability is concerned.
463
464 Note: It may take a long time for a new OSD added to the cluster to be
465 populated with placement groups that were assigned to it. However
466 there is no degradation of any object and it has no impact on the
467 durability of the data contained in the Cluster.
468
469 .. _object distribution:
470
471 Object distribution within a pool
472 ---------------------------------
473
474 Ideally objects are evenly distributed in each placement group. Since
475 CRUSH computes the placement group for each object, but does not
476 actually know how much data is stored in each OSD within this
477 placement group, the ratio between the number of placement groups and
478 the number of OSDs may influence the distribution of the data
479 significantly.
480
481 For instance, if there was a single placement group for ten OSDs in a
482 three replica pool, only three OSD would be used because CRUSH would
483 have no other choice. When more placement groups are available,
484 objects are more likely to be evenly spread among them. CRUSH also
485 makes every effort to evenly spread OSDs among all existing Placement
486 Groups.
487
488 As long as there are one or two orders of magnitude more Placement
489 Groups than OSDs, the distribution should be even. For instance, 256
490 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
491 etc.
492
493 Uneven data distribution can be caused by factors other than the ratio
494 between OSDs and placement groups. Since CRUSH does not take into
495 account the size of the objects, a few very large objects may create
496 an imbalance. Let say one million 4K objects totaling 4GB are evenly
497 spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
498 = 400MB on each OSD. If one 400MB object is added to the pool, the
499 three OSDs supporting the placement group in which the object has been
500 placed will be filled with 400MB + 400MB = 800MB while the seven
501 others will remain occupied with only 400MB.
502
503 .. _resource usage:
504
505 Memory, CPU and network usage
506 -----------------------------
507
508 For each placement group, OSDs and MONs need memory, network and CPU
509 at all times and even more during recovery. Sharing this overhead by
510 clustering objects within a placement group is one of the main reasons
511 they exist.
512
513 Minimizing the number of placement groups saves significant amounts of
514 resources.
515
516 .. _choosing-number-of-placement-groups:
517
518 Choosing the number of Placement Groups
519 =======================================
520
521 .. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
522
523 If you have more than 50 OSDs, we recommend approximately 50-100
524 placement groups per OSD to balance out resource usage, data
525 durability and distribution. If you have less than 50 OSDs, choosing
526 among the `preselection`_ above is best. For a single pool of objects,
527 you can use the following formula to get a baseline
528
529 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
530
531 Where **pool size** is either the number of replicas for replicated
532 pools or the K+M sum for erasure coded pools (as returned by **ceph
533 osd erasure-code-profile get**).
534
535 You should then check if the result makes sense with the way you
536 designed your Ceph cluster to maximize `data durability`_,
537 `object distribution`_ and minimize `resource usage`_.
538
539 The result should always be **rounded up to the nearest power of two**.
540
541 Only a power of two will evenly balance the number of objects among
542 placement groups. Other values will result in an uneven distribution of
543 data across your OSDs. Their use should be limited to incrementally
544 stepping from one power of two to another.
545
546 As an example, for a cluster with 200 OSDs and a pool size of 3
547 replicas, you would estimate your number of PGs as follows
548
549 :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
550
551 When using multiple data pools for storing objects, you need to ensure
552 that you balance the number of placement groups per pool with the
553 number of placement groups per OSD so that you arrive at a reasonable
554 total number of placement groups that provides reasonably low variance
555 per OSD without taxing system resources or making the peering process
556 too slow.
557
558 For instance a cluster of 10 pools each with 512 placement groups on
559 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
560 that is 512 placement groups per OSD. That does not use too many
561 resources. However, if 1,000 pools were created with 512 placement
562 groups each, the OSDs will handle ~50,000 placement groups each and it
563 would require significantly more resources and time for peering.
564
565 You may find the `PGCalc`_ tool helpful.
566
567
568 .. _setting the number of placement groups:
569
570 Set the Number of Placement Groups
571 ==================================
572
573 To set the number of placement groups in a pool, you must specify the
574 number of placement groups at the time you create the pool.
575 See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with:
576
577 .. prompt:: bash #
578
579 ceph osd pool set {pool-name} pg_num {pg_num}
580
581 After you increase the number of placement groups, you must also
582 increase the number of placement groups for placement (``pgp_num``)
583 before your cluster will rebalance. The ``pgp_num`` will be the number of
584 placement groups that will be considered for placement by the CRUSH
585 algorithm. Increasing ``pg_num`` splits the placement groups but data
586 will not be migrated to the newer placement groups until placement
587 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
588 should be equal to the ``pg_num``. To increase the number of
589 placement groups for placement, execute the following:
590
591 .. prompt:: bash #
592
593 ceph osd pool set {pool-name} pgp_num {pgp_num}
594
595 When decreasing the number of PGs, ``pgp_num`` is adjusted
596 automatically for you.
597
598 Get the Number of Placement Groups
599 ==================================
600
601 To get the number of placement groups in a pool, execute the following:
602
603 .. prompt:: bash #
604
605 ceph osd pool get {pool-name} pg_num
606
607
608 Get a Cluster's PG Statistics
609 =============================
610
611 To get the statistics for the placement groups in your cluster, execute the following:
612
613 .. prompt:: bash #
614
615 ceph pg dump [--format {format}]
616
617 Valid formats are ``plain`` (default) and ``json``.
618
619
620 Get Statistics for Stuck PGs
621 ============================
622
623 To get the statistics for all placement groups stuck in a specified state,
624 execute the following:
625
626 .. prompt:: bash #
627
628 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
629
630 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
631 with the most up-to-date data to come up and in.
632
633 **Unclean** Placement groups contain objects that are not replicated the desired number
634 of times. They should be recovering.
635
636 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
637 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
638
639 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
640 of seconds the placement group is stuck before including it in the returned statistics
641 (default 300 seconds).
642
643
644 Get a PG Map
645 ============
646
647 To get the placement group map for a particular placement group, execute the following:
648
649 .. prompt:: bash #
650
651 ceph pg map {pg-id}
652
653 For example:
654
655 .. prompt:: bash #
656
657 ceph pg map 1.6c
658
659 Ceph will return the placement group map, the placement group, and the OSD status:
660
661 .. prompt:: bash #
662
663 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
664
665
666 Get a PGs Statistics
667 ====================
668
669 To retrieve statistics for a particular placement group, execute the following:
670
671 .. prompt:: bash #
672
673 ceph pg {pg-id} query
674
675
676 Scrub a Placement Group
677 =======================
678
679 To scrub a placement group, execute the following:
680
681 .. prompt:: bash #
682
683 ceph pg scrub {pg-id}
684
685 Ceph checks the primary and any replica nodes, generates a catalog of all objects
686 in the placement group and compares them to ensure that no objects are missing
687 or mismatched, and their contents are consistent. Assuming the replicas all
688 match, a final semantic sweep ensures that all of the snapshot-related object
689 metadata is consistent. Errors are reported via logs.
690
691 To scrub all placement groups from a specific pool, execute the following:
692
693 .. prompt:: bash #
694
695 ceph osd pool scrub {pool-name}
696
697 Prioritize backfill/recovery of a Placement Group(s)
698 ====================================================
699
700 You may run into a situation where a bunch of placement groups will require
701 recovery and/or backfill, and some particular groups hold data more important
702 than others (for example, those PGs may hold data for images used by running
703 machines and other PGs may be used by inactive machines/less relevant data).
704 In that case, you may want to prioritize recovery of those groups so
705 performance and/or availability of data stored on those groups is restored
706 earlier. To do this (mark particular placement group(s) as prioritized during
707 backfill or recovery), execute the following:
708
709 .. prompt:: bash #
710
711 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
712 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
713
714 This will cause Ceph to perform recovery or backfill on specified placement
715 groups first, before other placement groups. This does not interrupt currently
716 ongoing backfills or recovery, but causes specified PGs to be processed
717 as soon as possible. If you change your mind or prioritize wrong groups,
718 use:
719
720 .. prompt:: bash #
721
722 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
723 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
724
725 This will remove "force" flag from those PGs and they will be processed
726 in default order. Again, this doesn't affect currently processed placement
727 group, only those that are still queued.
728
729 The "force" flag is cleared automatically after recovery or backfill of group
730 is done.
731
732 Similarly, you may use the following commands to force Ceph to perform recovery
733 or backfill on all placement groups from a specified pool first:
734
735 .. prompt:: bash #
736
737 ceph osd pool force-recovery {pool-name}
738 ceph osd pool force-backfill {pool-name}
739
740 or:
741
742 .. prompt:: bash #
743
744 ceph osd pool cancel-force-recovery {pool-name}
745 ceph osd pool cancel-force-backfill {pool-name}
746
747 to restore to the default recovery or backfill priority if you change your mind.
748
749 Note that these commands could possibly break the ordering of Ceph's internal
750 priority computations, so use them with caution!
751 Especially, if you have multiple pools that are currently sharing the same
752 underlying OSDs, and some particular pools hold data more important than others,
753 we recommend you use the following command to re-arrange all pools's
754 recovery/backfill priority in a better order:
755
756 .. prompt:: bash #
757
758 ceph osd pool set {pool-name} recovery_priority {value}
759
760 For example, if you have 10 pools you could make the most important one priority 10,
761 next 9, etc. Or you could leave most pools alone and have say 3 important pools
762 all priority 1 or priorities 3, 2, 1 respectively.
763
764 Revert Lost
765 ===========
766
767 If the cluster has lost one or more objects, and you have decided to
768 abandon the search for the lost data, you must mark the unfound objects
769 as ``lost``.
770
771 If all possible locations have been queried and objects are still
772 lost, you may have to give up on the lost objects. This is
773 possible given unusual combinations of failures that allow the cluster
774 to learn about writes that were performed before the writes themselves
775 are recovered.
776
777 Currently the only supported option is "revert", which will either roll back to
778 a previous version of the object or (if it was a new object) forget about it
779 entirely. To mark the "unfound" objects as "lost", execute the following:
780
781 .. prompt:: bash #
782
783 ceph pg {pg-id} mark_unfound_lost revert|delete
784
785 .. important:: Use this feature with caution, because it may confuse
786 applications that expect the object(s) to exist.
787
788
789 .. toctree::
790 :hidden:
791
792 pg-states
793 pg-concepts
794
795
796 .. _Create a Pool: ../pools#createpool
797 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
798 .. _pgcalc: https://old.ceph.com/pgcalc/