]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/placement-groups.rst
import 15.2.4
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
1 ==================
2 Placement Groups
3 ==================
4
5 .. _pg-autoscaler:
6
7 Autoscaling placement groups
8 ============================
9
10 Placement groups (PGs) are an internal implementation detail of how
11 Ceph distributes data. You can allow the cluster to either make
12 recommendations or automatically tune PGs based on how the cluster is
13 used by enabling *pg-autoscaling*.
14
15 Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
16
17 * ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
18 * ``on``: Enable automated adjustments of the PG count for the given pool.
19 * ``warn``: Raise health alerts when the PG count should be adjusted
20
21 To set the autoscaling mode for existing pools,::
22
23 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
24
25 For example to enable autoscaling on pool ``foo``,::
26
27 ceph osd pool set foo pg_autoscale_mode on
28
29 You can also configure the default ``pg_autoscale_mode`` that is
30 applied to any pools that are created in the future with::
31
32 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
33
34 Viewing PG scaling recommendations
35 ----------------------------------
36
37 You can view each pool, its relative utilization, and any suggested changes to
38 the PG count with this command::
39
40 ceph osd pool autoscale-status
41
42 Output will be something like::
43
44 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO PG_NUM NEW PG_NUM AUTOSCALE
45 a 12900M 3.0 82431M 0.4695 8 128 warn
46 c 0 3.0 82431M 0.0000 0.2000 0.9884 1 64 warn
47 b 0 953.6M 3.0 82431M 0.0347 8 warn
48
49 **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
50 present, is the amount of data the administrator has specified that
51 they expect to eventually be stored in this pool. The system uses
52 the larger of the two values for its calculation.
53
54 **RATE** is the multiplier for the pool that determines how much raw
55 storage capacity is consumed. For example, a 3 replica pool will
56 have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
57 ratio of 1.5.
58
59 **RAW CAPACITY** is the total amount of raw storage capacity on the
60 OSDs that are responsible for storing this pool's (and perhaps other
61 pools') data. **RATIO** is the ratio of that total capacity that
62 this pool is consuming (i.e., ratio = size * rate / raw capacity).
63
64 **TARGET RATIO**, if present, is the ratio of storage that the
65 administrator has specified that they expect this pool to consume
66 relative to other pools with target ratios set.
67 If both target size bytes and ratio are specified, the
68 ratio takes precedence.
69
70 **EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
71
72 1. subtracting any capacity expected to be used by pools with target size set
73 2. normalizing the target ratios among pools with target ratio set so
74 they collectively target the rest of the space. For example, 4
75 pools with target_ratio 1.0 would have an effective ratio of 0.25.
76
77 The system uses the larger of the actual ratio and the effective ratio
78 for its calculation.
79
80 **PG_NUM** is the current number of PGs for the pool (or the current
81 number of PGs that the pool is working towards, if a ``pg_num``
82 change is in progress). **NEW PG_NUM**, if present, is what the
83 system believes the pool's ``pg_num`` should be changed to. It is
84 always a power of 2, and will only be present if the "ideal" value
85 varies from the current value by more than a factor of 3.
86
87 The final column, **AUTOSCALE**, is the pool ``pg_autoscale_mode``,
88 and will be either ``on``, ``off``, or ``warn``.
89
90
91 Automated scaling
92 -----------------
93
94 Allowing the cluster to automatically scale PGs based on usage is the
95 simplest approach. Ceph will look at the total available storage and
96 target number of PGs for the whole system, look at how much data is
97 stored in each pool, and try to apportion the PGs accordingly. The
98 system is relatively conservative with its approach, only making
99 changes to a pool when the current number of PGs (``pg_num``) is more
100 than 3 times off from what it thinks it should be.
101
102 The target number of PGs per OSD is based on the
103 ``mon_target_pg_per_osd`` configurable (default: 100), which can be
104 adjusted with::
105
106 ceph config set global mon_target_pg_per_osd 100
107
108 The autoscaler analyzes pools and adjusts on a per-subtree basis.
109 Because each pool may map to a different CRUSH rule, and each rule may
110 distribute data across different devices, Ceph will consider
111 utilization of each subtree of the hierarchy independently. For
112 example, a pool that maps to OSDs of class `ssd` and a pool that maps
113 to OSDs of class `hdd` will each have optimal PG counts that depend on
114 the number of those respective device types.
115
116
117 .. _specifying_pool_target_size:
118
119 Specifying expected pool size
120 -----------------------------
121
122 When a cluster or pool is first created, it will consume a small
123 fraction of the total cluster capacity and will appear to the system
124 as if it should only need a small number of placement groups.
125 However, in most cases cluster administrators have a good idea which
126 pools are expected to consume most of the system capacity over time.
127 By providing this information to Ceph, a more appropriate number of
128 PGs can be used from the beginning, preventing subsequent changes in
129 ``pg_num`` and the overhead associated with moving data around when
130 those adjustments are made.
131
132 The *target size* of a pool can be specified in two ways: either in
133 terms of the absolute size of the pool (i.e., bytes), or as a weight
134 relative to other pools with a ``target_size_ratio`` set.
135
136 For example,::
137
138 ceph osd pool set mypool target_size_bytes 100T
139
140 will tell the system that `mypool` is expected to consume 100 TiB of
141 space. Alternatively,::
142
143 ceph osd pool set mypool target_size_ratio 1.0
144
145 will tell the system that `mypool` is expected to consume 1.0 relative
146 to the other pools with ``target_size_ratio`` set. If `mypool` is the
147 only pool in the cluster, this means an expected use of 100% of the
148 total capacity. If there is a second pool with ``target_size_ratio``
149 1.0, both pools would expect to use 50% of the cluster capacity.
150
151 You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
152
153 Note that if impossible target size values are specified (for example,
154 a capacity larger than the total cluster) then a health warning
155 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
156
157 If both ``target_size_ratio`` and ``target_size_bytes`` are specified
158 for a pool, only the ratio will be considered, and a health warning
159 (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
160
161 Specifying bounds on a pool's PGs
162 ---------------------------------
163
164 It is also possible to specify a minimum number of PGs for a pool.
165 This is useful for establishing a lower bound on the amount of
166 parallelism client will see when doing IO, even when a pool is mostly
167 empty. Setting the lower bound prevents Ceph from reducing (or
168 recommending you reduce) the PG number below the configured number.
169
170 You can set the minimum number of PGs for a pool with::
171
172 ceph osd pool set <pool-name> pg_num_min <num>
173
174 You can also specify the minimum PG count at pool creation time with
175 the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
176 create`` command.
177
178 .. _preselection:
179
180 A preselection of pg_num
181 ========================
182
183 When creating a new pool with::
184
185 ceph osd pool create {pool-name} [pg_num]
186
187 it is optional to choose the value of ``pg_num``. If you do not
188 specify ``pg_num``, the cluster can (by default) automatically tune it
189 for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
190
191 Alternatively, ``pg_num`` can be explicitly provided. However,
192 whether you specify a ``pg_num`` value or not does not affect whether
193 the value is automatically tuned by the cluster after the fact. To
194 enable or disable auto-tuning,::
195
196 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
197
198 The "rule of thumb" for PGs per OSD has traditionally be 100. With
199 the additional of the balancer (which is also enabled by default), a
200 value of more like 50 PGs per OSD is probably reasonable. The
201 challenge (which the autoscaler normally does for you), is to:
202
203 - have the PGs per pool proportional to the data in the pool, and
204 - end up with 50-100 PGs per OSDs, after the replication or
205 erasuring-coding fan-out of each PG across OSDs is taken into
206 consideration
207
208 How are Placement Groups used ?
209 ===============================
210
211 A placement group (PG) aggregates objects within a pool because
212 tracking object placement and object metadata on a per-object basis is
213 computationally expensive--i.e., a system with millions of objects
214 cannot realistically track placement on a per-object basis.
215
216 .. ditaa::
217 /-----\ /-----\ /-----\ /-----\ /-----\
218 | obj | | obj | | obj | | obj | | obj |
219 \-----/ \-----/ \-----/ \-----/ \-----/
220 | | | | |
221 +--------+--------+ +---+----+
222 | |
223 v v
224 +-----------------------+ +-----------------------+
225 | Placement Group #1 | | Placement Group #2 |
226 | | | |
227 +-----------------------+ +-----------------------+
228 | |
229 +------------------------------+
230 |
231 v
232 +-----------------------+
233 | Pool |
234 | |
235 +-----------------------+
236
237 The Ceph client will calculate which placement group an object should
238 be in. It does this by hashing the object ID and applying an operation
239 based on the number of PGs in the defined pool and the ID of the pool.
240 See `Mapping PGs to OSDs`_ for details.
241
242 The object's contents within a placement group are stored in a set of
243 OSDs. For instance, in a replicated pool of size two, each placement
244 group will store objects on two OSDs, as shown below.
245
246 .. ditaa::
247
248 +-----------------------+ +-----------------------+
249 | Placement Group #1 | | Placement Group #2 |
250 | | | |
251 +-----------------------+ +-----------------------+
252 | | | |
253 v v v v
254 /----------\ /----------\ /----------\ /----------\
255 | | | | | | | |
256 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
257 | | | | | | | |
258 \----------/ \----------/ \----------/ \----------/
259
260
261 Should OSD #2 fail, another will be assigned to Placement Group #1 and
262 will be filled with copies of all objects in OSD #1. If the pool size
263 is changed from two to three, an additional OSD will be assigned to
264 the placement group and will receive copies of all objects in the
265 placement group.
266
267 Placement groups do not own the OSD; they share it with other
268 placement groups from the same pool or even other pools. If OSD #2
269 fails, the Placement Group #2 will also have to restore copies of
270 objects, using OSD #3.
271
272 When the number of placement groups increases, the new placement
273 groups will be assigned OSDs. The result of the CRUSH function will
274 also change and some objects from the former placement groups will be
275 copied over to the new Placement Groups and removed from the old ones.
276
277 Placement Groups Tradeoffs
278 ==========================
279
280 Data durability and even distribution among all OSDs call for more
281 placement groups but their number should be reduced to the minimum to
282 save CPU and memory.
283
284 .. _data durability:
285
286 Data durability
287 ---------------
288
289 After an OSD fails, the risk of data loss increases until the data it
290 contained is fully recovered. Let's imagine a scenario that causes
291 permanent data loss in a single placement group:
292
293 - The OSD fails and all copies of the object it contains are lost.
294 For all objects within the placement group the number of replica
295 suddenly drops from three to two.
296
297 - Ceph starts recovery for this placement group by choosing a new OSD
298 to re-create the third copy of all objects.
299
300 - Another OSD, within the same placement group, fails before the new
301 OSD is fully populated with the third copy. Some objects will then
302 only have one surviving copies.
303
304 - Ceph picks yet another OSD and keeps copying objects to restore the
305 desired number of copies.
306
307 - A third OSD, within the same placement group, fails before recovery
308 is complete. If this OSD contained the only remaining copy of an
309 object, it is permanently lost.
310
311 In a cluster containing 10 OSDs with 512 placement groups in a three
312 replica pool, CRUSH will give each placement groups three OSDs. In the
313 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
314 Groups. When the first OSD fails, the above scenario will therefore
315 start recovery for all 150 placement groups at the same time.
316
317 The 150 placement groups being recovered are likely to be
318 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
319 therefore likely to send copies of objects to all others and also
320 receive some new objects to be stored because they became part of a
321 new placement group.
322
323 The amount of time it takes for this recovery to complete entirely
324 depends on the architecture of the Ceph cluster. Let say each OSD is
325 hosted by a 1TB SSD on a single machine and all of them are connected
326 to a 10Gb/s switch and the recovery for a single OSD completes within
327 M minutes. If there are two OSDs per machine using spinners with no
328 SSD journal and a 1Gb/s switch, it will at least be an order of
329 magnitude slower.
330
331 In a cluster of this size, the number of placement groups has almost
332 no influence on data durability. It could be 128 or 8192 and the
333 recovery would not be slower or faster.
334
335 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
336 is likely to speed up recovery and therefore improve data durability
337 significantly. Each OSD now participates in only ~75 placement groups
338 instead of ~150 when there were only 10 OSDs and it will still require
339 all 19 remaining OSDs to perform the same amount of object copies in
340 order to recover. But where 10 OSDs had to copy approximately 100GB
341 each, they now have to copy 50GB each instead. If the network was the
342 bottleneck, recovery will happen twice as fast. In other words,
343 recovery goes faster when the number of OSDs increases.
344
345 If this cluster grows to 40 OSDs, each of them will only host ~35
346 placement groups. If an OSD dies, recovery will keep going faster
347 unless it is blocked by another bottleneck. However, if this cluster
348 grows to 200 OSDs, each of them will only host ~7 placement groups. If
349 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
350 in these placement groups: recovery will take longer than when there
351 were 40 OSDs, meaning the number of placement groups should be
352 increased.
353
354 No matter how short the recovery time is, there is a chance for a
355 second OSD to fail while it is in progress. In the 10 OSDs cluster
356 described above, if any of them fail, then ~17 placement groups
357 (i.e. ~150 / 9 placement groups being recovered) will only have one
358 surviving copy. And if any of the 8 remaining OSD fail, the last
359 objects of two placement groups are likely to be lost (i.e. ~17 / 8
360 placement groups with only one remaining copy being recovered).
361
362 When the size of the cluster grows to 20 OSDs, the number of Placement
363 Groups damaged by the loss of three OSDs drops. The second OSD lost
364 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
365 instead of ~17 and the third OSD lost will only lose data if it is one
366 of the four OSDs containing the surviving copy. In other words, if the
367 probability of losing one OSD is 0.0001% during the recovery time
368 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
369 0.0001% in the cluster with 20 OSDs.
370
371 In a nutshell, more OSDs mean faster recovery and a lower risk of
372 cascading failures leading to the permanent loss of a Placement
373 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
374 cluster with less than 50 OSDs as far as data durability is concerned.
375
376 Note: It may take a long time for a new OSD added to the cluster to be
377 populated with placement groups that were assigned to it. However
378 there is no degradation of any object and it has no impact on the
379 durability of the data contained in the Cluster.
380
381 .. _object distribution:
382
383 Object distribution within a pool
384 ---------------------------------
385
386 Ideally objects are evenly distributed in each placement group. Since
387 CRUSH computes the placement group for each object, but does not
388 actually know how much data is stored in each OSD within this
389 placement group, the ratio between the number of placement groups and
390 the number of OSDs may influence the distribution of the data
391 significantly.
392
393 For instance, if there was a single placement group for ten OSDs in a
394 three replica pool, only three OSD would be used because CRUSH would
395 have no other choice. When more placement groups are available,
396 objects are more likely to be evenly spread among them. CRUSH also
397 makes every effort to evenly spread OSDs among all existing Placement
398 Groups.
399
400 As long as there are one or two orders of magnitude more Placement
401 Groups than OSDs, the distribution should be even. For instance, 256
402 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
403 etc.
404
405 Uneven data distribution can be caused by factors other than the ratio
406 between OSDs and placement groups. Since CRUSH does not take into
407 account the size of the objects, a few very large objects may create
408 an imbalance. Let say one million 4K objects totaling 4GB are evenly
409 spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
410 = 400MB on each OSD. If one 400MB object is added to the pool, the
411 three OSDs supporting the placement group in which the object has been
412 placed will be filled with 400MB + 400MB = 800MB while the seven
413 others will remain occupied with only 400MB.
414
415 .. _resource usage:
416
417 Memory, CPU and network usage
418 -----------------------------
419
420 For each placement group, OSDs and MONs need memory, network and CPU
421 at all times and even more during recovery. Sharing this overhead by
422 clustering objects within a placement group is one of the main reasons
423 they exist.
424
425 Minimizing the number of placement groups saves significant amounts of
426 resources.
427
428 .. _choosing-number-of-placement-groups:
429
430 Choosing the number of Placement Groups
431 =======================================
432
433 .. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
434
435 If you have more than 50 OSDs, we recommend approximately 50-100
436 placement groups per OSD to balance out resource usage, data
437 durability and distribution. If you have less than 50 OSDs, choosing
438 among the `preselection`_ above is best. For a single pool of objects,
439 you can use the following formula to get a baseline::
440
441 (OSDs * 100)
442 Total PGs = ------------
443 pool size
444
445 Where **pool size** is either the number of replicas for replicated
446 pools or the K+M sum for erasure coded pools (as returned by **ceph
447 osd erasure-code-profile get**).
448
449 You should then check if the result makes sense with the way you
450 designed your Ceph cluster to maximize `data durability`_,
451 `object distribution`_ and minimize `resource usage`_.
452
453 The result should always be **rounded up to the nearest power of two**.
454
455 Only a power of two will evenly balance the number of objects among
456 placement groups. Other values will result in an uneven distribution of
457 data across your OSDs. Their use should be limited to incrementally
458 stepping from one power of two to another.
459
460 As an example, for a cluster with 200 OSDs and a pool size of 3
461 replicas, you would estimate your number of PGs as follows::
462
463 (200 * 100)
464 ----------- = 6667. Nearest power of 2: 8192
465 3
466
467 When using multiple data pools for storing objects, you need to ensure
468 that you balance the number of placement groups per pool with the
469 number of placement groups per OSD so that you arrive at a reasonable
470 total number of placement groups that provides reasonably low variance
471 per OSD without taxing system resources or making the peering process
472 too slow.
473
474 For instance a cluster of 10 pools each with 512 placement groups on
475 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
476 that is 512 placement groups per OSD. That does not use too many
477 resources. However, if 1,000 pools were created with 512 placement
478 groups each, the OSDs will handle ~50,000 placement groups each and it
479 would require significantly more resources and time for peering.
480
481 You may find the `PGCalc`_ tool helpful.
482
483
484 .. _setting the number of placement groups:
485
486 Set the Number of Placement Groups
487 ==================================
488
489 To set the number of placement groups in a pool, you must specify the
490 number of placement groups at the time you create the pool.
491 See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
492
493 ceph osd pool set {pool-name} pg_num {pg_num}
494
495 After you increase the number of placement groups, you must also
496 increase the number of placement groups for placement (``pgp_num``)
497 before your cluster will rebalance. The ``pgp_num`` will be the number of
498 placement groups that will be considered for placement by the CRUSH
499 algorithm. Increasing ``pg_num`` splits the placement groups but data
500 will not be migrated to the newer placement groups until placement
501 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
502 should be equal to the ``pg_num``. To increase the number of
503 placement groups for placement, execute the following::
504
505 ceph osd pool set {pool-name} pgp_num {pgp_num}
506
507 When decreasing the number of PGs, ``pgp_num`` is adjusted
508 automatically for you.
509
510 Get the Number of Placement Groups
511 ==================================
512
513 To get the number of placement groups in a pool, execute the following::
514
515 ceph osd pool get {pool-name} pg_num
516
517
518 Get a Cluster's PG Statistics
519 =============================
520
521 To get the statistics for the placement groups in your cluster, execute the following::
522
523 ceph pg dump [--format {format}]
524
525 Valid formats are ``plain`` (default) and ``json``.
526
527
528 Get Statistics for Stuck PGs
529 ============================
530
531 To get the statistics for all placement groups stuck in a specified state,
532 execute the following::
533
534 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
535
536 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
537 with the most up-to-date data to come up and in.
538
539 **Unclean** Placement groups contain objects that are not replicated the desired number
540 of times. They should be recovering.
541
542 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
543 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
544
545 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
546 of seconds the placement group is stuck before including it in the returned statistics
547 (default 300 seconds).
548
549
550 Get a PG Map
551 ============
552
553 To get the placement group map for a particular placement group, execute the following::
554
555 ceph pg map {pg-id}
556
557 For example::
558
559 ceph pg map 1.6c
560
561 Ceph will return the placement group map, the placement group, and the OSD status::
562
563 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
564
565
566 Get a PGs Statistics
567 ====================
568
569 To retrieve statistics for a particular placement group, execute the following::
570
571 ceph pg {pg-id} query
572
573
574 Scrub a Placement Group
575 =======================
576
577 To scrub a placement group, execute the following::
578
579 ceph pg scrub {pg-id}
580
581 Ceph checks the primary and any replica nodes, generates a catalog of all objects
582 in the placement group and compares them to ensure that no objects are missing
583 or mismatched, and their contents are consistent. Assuming the replicas all
584 match, a final semantic sweep ensures that all of the snapshot-related object
585 metadata is consistent. Errors are reported via logs.
586
587 To scrub all placement groups from a specific pool, execute the following::
588
589 ceph osd pool scrub {pool-name}
590
591 Prioritize backfill/recovery of a Placement Group(s)
592 ====================================================
593
594 You may run into a situation where a bunch of placement groups will require
595 recovery and/or backfill, and some particular groups hold data more important
596 than others (for example, those PGs may hold data for images used by running
597 machines and other PGs may be used by inactive machines/less relevant data).
598 In that case, you may want to prioritize recovery of those groups so
599 performance and/or availability of data stored on those groups is restored
600 earlier. To do this (mark particular placement group(s) as prioritized during
601 backfill or recovery), execute the following::
602
603 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
604 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
605
606 This will cause Ceph to perform recovery or backfill on specified placement
607 groups first, before other placement groups. This does not interrupt currently
608 ongoing backfills or recovery, but causes specified PGs to be processed
609 as soon as possible. If you change your mind or prioritize wrong groups,
610 use::
611
612 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
613 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
614
615 This will remove "force" flag from those PGs and they will be processed
616 in default order. Again, this doesn't affect currently processed placement
617 group, only those that are still queued.
618
619 The "force" flag is cleared automatically after recovery or backfill of group
620 is done.
621
622 Similarly, you may use the following commands to force Ceph to perform recovery
623 or backfill on all placement groups from a specified pool first::
624
625 ceph osd pool force-recovery {pool-name}
626 ceph osd pool force-backfill {pool-name}
627
628 or::
629
630 ceph osd pool cancel-force-recovery {pool-name}
631 ceph osd pool cancel-force-backfill {pool-name}
632
633 to restore to the default recovery or backfill priority if you change your mind.
634
635 Note that these commands could possibly break the ordering of Ceph's internal
636 priority computations, so use them with caution!
637 Especially, if you have multiple pools that are currently sharing the same
638 underlying OSDs, and some particular pools hold data more important than others,
639 we recommend you use the following command to re-arrange all pools's
640 recovery/backfill priority in a better order::
641
642 ceph osd pool set {pool-name} recovery_priority {value}
643
644 For example, if you have 10 pools you could make the most important one priority 10,
645 next 9, etc. Or you could leave most pools alone and have say 3 important pools
646 all priority 1 or priorities 3, 2, 1 respectively.
647
648 Revert Lost
649 ===========
650
651 If the cluster has lost one or more objects, and you have decided to
652 abandon the search for the lost data, you must mark the unfound objects
653 as ``lost``.
654
655 If all possible locations have been queried and objects are still
656 lost, you may have to give up on the lost objects. This is
657 possible given unusual combinations of failures that allow the cluster
658 to learn about writes that were performed before the writes themselves
659 are recovered.
660
661 Currently the only supported option is "revert", which will either roll back to
662 a previous version of the object or (if it was a new object) forget about it
663 entirely. To mark the "unfound" objects as "lost", execute the following::
664
665 ceph pg {pg-id} mark_unfound_lost revert|delete
666
667 .. important:: Use this feature with caution, because it may confuse
668 applications that expect the object(s) to exist.
669
670
671 .. toctree::
672 :hidden:
673
674 pg-states
675 pg-concepts
676
677
678 .. _Create a Pool: ../pools#createpool
679 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
680 .. _pgcalc: http://ceph.com/pgcalc/