]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
use the buster suite for getting the source package for now
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
7c673cae
FG
1==================
2 Placement Groups
3==================
4
11fdf7f2
TL
5.. _pg-autoscaler:
6
7Autoscaling placement groups
8============================
9
10Placement groups (PGs) are an internal implementation detail of how
11Ceph distributes data. You can allow the cluster to either make
12recommendations or automatically tune PGs based on how the cluster is
13used by enabling *pg-autoscaling*.
14
15Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
16
17* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
18* ``on``: Enable automated adjustments of the PG count for the given pool.
19* ``warn``: Raise health alerts when the PG count should be adjusted
20
21To set the autoscaling mode for existing pools,::
22
23 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
24
25For example to enable autoscaling on pool ``foo``,::
26
27 ceph osd pool set foo pg_autoscale_mode on
28
29You can also configure the default ``pg_autoscale_mode`` that is
30applied to any pools that are created in the future with::
31
9f95a23c 32 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
11fdf7f2
TL
33
34Viewing PG scaling recommendations
35----------------------------------
36
37You can view each pool, its relative utilization, and any suggested changes to
38the PG count with this command::
39
40 ceph osd pool autoscale-status
41
42Output will be something like::
43
9f95a23c
TL
44 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO PG_NUM NEW PG_NUM AUTOSCALE
45 a 12900M 3.0 82431M 0.4695 8 128 warn
46 c 0 3.0 82431M 0.0000 0.2000 0.9884 1 64 warn
47 b 0 953.6M 3.0 82431M 0.0347 8 warn
11fdf7f2
TL
48
49**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
50present, is the amount of data the administrator has specified that
51they expect to eventually be stored in this pool. The system uses
52the larger of the two values for its calculation.
53
54**RATE** is the multiplier for the pool that determines how much raw
55storage capacity is consumed. For example, a 3 replica pool will
56have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
57ratio of 1.5.
58
59**RAW CAPACITY** is the total amount of raw storage capacity on the
60OSDs that are responsible for storing this pool's (and perhaps other
61pools') data. **RATIO** is the ratio of that total capacity that
62this pool is consuming (i.e., ratio = size * rate / raw capacity).
63
64**TARGET RATIO**, if present, is the ratio of storage that the
9f95a23c
TL
65administrator has specified that they expect this pool to consume
66relative to other pools with target ratios set.
67If both target size bytes and ratio are specified, the
11fdf7f2
TL
68ratio takes precedence.
69
9f95a23c
TL
70**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
71
721. subtracting any capacity expected to be used by pools with target size set
732. normalizing the target ratios among pools with target ratio set so
74 they collectively target the rest of the space. For example, 4
75 pools with target_ratio 1.0 would have an effective ratio of 0.25.
76
77The system uses the larger of the actual ratio and the effective ratio
78for its calculation.
79
11fdf7f2
TL
80**PG_NUM** is the current number of PGs for the pool (or the current
81number of PGs that the pool is working towards, if a ``pg_num``
82change is in progress). **NEW PG_NUM**, if present, is what the
83system believes the pool's ``pg_num`` should be changed to. It is
84always a power of 2, and will only be present if the "ideal" value
85varies from the current value by more than a factor of 3.
86
87The final column, **AUTOSCALE**, is the pool ``pg_autoscale_mode``,
88and will be either ``on``, ``off``, or ``warn``.
89
90
91Automated scaling
92-----------------
93
94Allowing the cluster to automatically scale PGs based on usage is the
95simplest approach. Ceph will look at the total available storage and
96target number of PGs for the whole system, look at how much data is
97stored in each pool, and try to apportion the PGs accordingly. The
98system is relatively conservative with its approach, only making
99changes to a pool when the current number of PGs (``pg_num``) is more
100than 3 times off from what it thinks it should be.
101
102The target number of PGs per OSD is based on the
103``mon_target_pg_per_osd`` configurable (default: 100), which can be
104adjusted with::
105
106 ceph config set global mon_target_pg_per_osd 100
107
108The autoscaler analyzes pools and adjusts on a per-subtree basis.
109Because each pool may map to a different CRUSH rule, and each rule may
110distribute data across different devices, Ceph will consider
111utilization of each subtree of the hierarchy independently. For
112example, a pool that maps to OSDs of class `ssd` and a pool that maps
113to OSDs of class `hdd` will each have optimal PG counts that depend on
114the number of those respective device types.
115
116
117.. _specifying_pool_target_size:
118
119Specifying expected pool size
120-----------------------------
121
122When a cluster or pool is first created, it will consume a small
123fraction of the total cluster capacity and will appear to the system
124as if it should only need a small number of placement groups.
125However, in most cases cluster administrators have a good idea which
126pools are expected to consume most of the system capacity over time.
127By providing this information to Ceph, a more appropriate number of
128PGs can be used from the beginning, preventing subsequent changes in
129``pg_num`` and the overhead associated with moving data around when
130those adjustments are made.
131
9f95a23c
TL
132The *target size* of a pool can be specified in two ways: either in
133terms of the absolute size of the pool (i.e., bytes), or as a weight
134relative to other pools with a ``target_size_ratio`` set.
11fdf7f2
TL
135
136For example,::
137
138 ceph osd pool set mypool target_size_bytes 100T
139
140will tell the system that `mypool` is expected to consume 100 TiB of
141space. Alternatively,::
142
9f95a23c 143 ceph osd pool set mypool target_size_ratio 1.0
11fdf7f2 144
9f95a23c
TL
145will tell the system that `mypool` is expected to consume 1.0 relative
146to the other pools with ``target_size_ratio`` set. If `mypool` is the
147only pool in the cluster, this means an expected use of 100% of the
148total capacity. If there is a second pool with ``target_size_ratio``
1491.0, both pools would expect to use 50% of the cluster capacity.
11fdf7f2
TL
150
151You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
152
153Note that if impossible target size values are specified (for example,
9f95a23c
TL
154a capacity larger than the total cluster) then a health warning
155(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
156
157If both ``target_size_ratio`` and ``target_size_bytes`` are specified
158for a pool, only the ratio will be considered, and a health warning
159(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
11fdf7f2
TL
160
161Specifying bounds on a pool's PGs
162---------------------------------
163
164It is also possible to specify a minimum number of PGs for a pool.
165This is useful for establishing a lower bound on the amount of
166parallelism client will see when doing IO, even when a pool is mostly
167empty. Setting the lower bound prevents Ceph from reducing (or
168recommending you reduce) the PG number below the configured number.
169
170You can set the minimum number of PGs for a pool with::
171
172 ceph osd pool set <pool-name> pg_num_min <num>
173
174You can also specify the minimum PG count at pool creation time with
175the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
176create`` command.
177
7c673cae
FG
178.. _preselection:
179
180A preselection of pg_num
181========================
182
183When creating a new pool with::
184
9f95a23c 185 ceph osd pool create {pool-name} [pg_num]
7c673cae 186
9f95a23c
TL
187it is optional to choose the value of ``pg_num``. If you do not
188specify ``pg_num``, the cluster can (by default) automatically tune it
189for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
7c673cae 190
9f95a23c
TL
191Alternatively, ``pg_num`` can be explicitly provided. However,
192whether you specify a ``pg_num`` value or not does not affect whether
193the value is automatically tuned by the cluster after the fact. To
194enable or disable auto-tuning,::
7c673cae 195
9f95a23c 196 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
7c673cae 197
9f95a23c
TL
198The "rule of thumb" for PGs per OSD has traditionally be 100. With
199the additional of the balancer (which is also enabled by default), a
200value of more like 50 PGs per OSD is probably reasonable. The
201challenge (which the autoscaler normally does for you), is to:
7c673cae 202
9f95a23c
TL
203- have the PGs per pool proportional to the data in the pool, and
204- end up with 50-100 PGs per OSDs, after the replication or
205 erasuring-coding fan-out of each PG across OSDs is taken into
206 consideration
7c673cae
FG
207
208How are Placement Groups used ?
209===============================
210
211A placement group (PG) aggregates objects within a pool because
212tracking object placement and object metadata on a per-object basis is
213computationally expensive--i.e., a system with millions of objects
214cannot realistically track placement on a per-object basis.
215
216.. ditaa::
217 /-----\ /-----\ /-----\ /-----\ /-----\
218 | obj | | obj | | obj | | obj | | obj |
219 \-----/ \-----/ \-----/ \-----/ \-----/
220 | | | | |
221 +--------+--------+ +---+----+
222 | |
223 v v
224 +-----------------------+ +-----------------------+
225 | Placement Group #1 | | Placement Group #2 |
226 | | | |
227 +-----------------------+ +-----------------------+
228 | |
229 +------------------------------+
230 |
231 v
232 +-----------------------+
233 | Pool |
234 | |
235 +-----------------------+
236
237The Ceph client will calculate which placement group an object should
238be in. It does this by hashing the object ID and applying an operation
239based on the number of PGs in the defined pool and the ID of the pool.
240See `Mapping PGs to OSDs`_ for details.
241
242The object's contents within a placement group are stored in a set of
243OSDs. For instance, in a replicated pool of size two, each placement
244group will store objects on two OSDs, as shown below.
245
246.. ditaa::
7c673cae
FG
247 +-----------------------+ +-----------------------+
248 | Placement Group #1 | | Placement Group #2 |
249 | | | |
250 +-----------------------+ +-----------------------+
251 | | | |
252 v v v v
253 /----------\ /----------\ /----------\ /----------\
254 | | | | | | | |
255 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
256 | | | | | | | |
257 \----------/ \----------/ \----------/ \----------/
258
259
260Should OSD #2 fail, another will be assigned to Placement Group #1 and
261will be filled with copies of all objects in OSD #1. If the pool size
262is changed from two to three, an additional OSD will be assigned to
263the placement group and will receive copies of all objects in the
264placement group.
265
11fdf7f2 266Placement groups do not own the OSD; they share it with other
7c673cae
FG
267placement groups from the same pool or even other pools. If OSD #2
268fails, the Placement Group #2 will also have to restore copies of
269objects, using OSD #3.
270
271When the number of placement groups increases, the new placement
272groups will be assigned OSDs. The result of the CRUSH function will
273also change and some objects from the former placement groups will be
274copied over to the new Placement Groups and removed from the old ones.
275
276Placement Groups Tradeoffs
277==========================
278
279Data durability and even distribution among all OSDs call for more
280placement groups but their number should be reduced to the minimum to
281save CPU and memory.
282
283.. _data durability:
284
285Data durability
286---------------
287
288After an OSD fails, the risk of data loss increases until the data it
289contained is fully recovered. Let's imagine a scenario that causes
290permanent data loss in a single placement group:
291
292- The OSD fails and all copies of the object it contains are lost.
293 For all objects within the placement group the number of replica
11fdf7f2 294 suddenly drops from three to two.
7c673cae 295
11fdf7f2 296- Ceph starts recovery for this placement group by choosing a new OSD
7c673cae
FG
297 to re-create the third copy of all objects.
298
299- Another OSD, within the same placement group, fails before the new
300 OSD is fully populated with the third copy. Some objects will then
301 only have one surviving copies.
302
303- Ceph picks yet another OSD and keeps copying objects to restore the
304 desired number of copies.
305
306- A third OSD, within the same placement group, fails before recovery
307 is complete. If this OSD contained the only remaining copy of an
308 object, it is permanently lost.
309
310In a cluster containing 10 OSDs with 512 placement groups in a three
311replica pool, CRUSH will give each placement groups three OSDs. In the
312end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
313Groups. When the first OSD fails, the above scenario will therefore
314start recovery for all 150 placement groups at the same time.
315
316The 150 placement groups being recovered are likely to be
317homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
318therefore likely to send copies of objects to all others and also
319receive some new objects to be stored because they became part of a
320new placement group.
321
322The amount of time it takes for this recovery to complete entirely
323depends on the architecture of the Ceph cluster. Let say each OSD is
324hosted by a 1TB SSD on a single machine and all of them are connected
325to a 10Gb/s switch and the recovery for a single OSD completes within
326M minutes. If there are two OSDs per machine using spinners with no
327SSD journal and a 1Gb/s switch, it will at least be an order of
328magnitude slower.
329
330In a cluster of this size, the number of placement groups has almost
331no influence on data durability. It could be 128 or 8192 and the
332recovery would not be slower or faster.
333
334However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
335is likely to speed up recovery and therefore improve data durability
336significantly. Each OSD now participates in only ~75 placement groups
337instead of ~150 when there were only 10 OSDs and it will still require
338all 19 remaining OSDs to perform the same amount of object copies in
339order to recover. But where 10 OSDs had to copy approximately 100GB
340each, they now have to copy 50GB each instead. If the network was the
341bottleneck, recovery will happen twice as fast. In other words,
342recovery goes faster when the number of OSDs increases.
343
344If this cluster grows to 40 OSDs, each of them will only host ~35
345placement groups. If an OSD dies, recovery will keep going faster
346unless it is blocked by another bottleneck. However, if this cluster
347grows to 200 OSDs, each of them will only host ~7 placement groups. If
348an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
349in these placement groups: recovery will take longer than when there
350were 40 OSDs, meaning the number of placement groups should be
351increased.
352
353No matter how short the recovery time is, there is a chance for a
354second OSD to fail while it is in progress. In the 10 OSDs cluster
355described above, if any of them fail, then ~17 placement groups
356(i.e. ~150 / 9 placement groups being recovered) will only have one
357surviving copy. And if any of the 8 remaining OSD fail, the last
358objects of two placement groups are likely to be lost (i.e. ~17 / 8
359placement groups with only one remaining copy being recovered).
360
361When the size of the cluster grows to 20 OSDs, the number of Placement
362Groups damaged by the loss of three OSDs drops. The second OSD lost
363will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
364instead of ~17 and the third OSD lost will only lose data if it is one
365of the four OSDs containing the surviving copy. In other words, if the
366probability of losing one OSD is 0.0001% during the recovery time
11fdf7f2 367frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
7c673cae
FG
3680.0001% in the cluster with 20 OSDs.
369
370In a nutshell, more OSDs mean faster recovery and a lower risk of
371cascading failures leading to the permanent loss of a Placement
372Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
373cluster with less than 50 OSDs as far as data durability is concerned.
374
375Note: It may take a long time for a new OSD added to the cluster to be
376populated with placement groups that were assigned to it. However
377there is no degradation of any object and it has no impact on the
378durability of the data contained in the Cluster.
379
380.. _object distribution:
381
382Object distribution within a pool
383---------------------------------
384
385Ideally objects are evenly distributed in each placement group. Since
386CRUSH computes the placement group for each object, but does not
387actually know how much data is stored in each OSD within this
388placement group, the ratio between the number of placement groups and
389the number of OSDs may influence the distribution of the data
390significantly.
391
11fdf7f2 392For instance, if there was a single placement group for ten OSDs in a
7c673cae
FG
393three replica pool, only three OSD would be used because CRUSH would
394have no other choice. When more placement groups are available,
395objects are more likely to be evenly spread among them. CRUSH also
396makes every effort to evenly spread OSDs among all existing Placement
397Groups.
398
399As long as there are one or two orders of magnitude more Placement
eafe8130
TL
400Groups than OSDs, the distribution should be even. For instance, 256
401placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
402etc.
7c673cae
FG
403
404Uneven data distribution can be caused by factors other than the ratio
405between OSDs and placement groups. Since CRUSH does not take into
406account the size of the objects, a few very large objects may create
407an imbalance. Let say one million 4K objects totaling 4GB are evenly
eafe8130 408spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
7c673cae
FG
409= 400MB on each OSD. If one 400MB object is added to the pool, the
410three OSDs supporting the placement group in which the object has been
411placed will be filled with 400MB + 400MB = 800MB while the seven
412others will remain occupied with only 400MB.
413
414.. _resource usage:
415
416Memory, CPU and network usage
417-----------------------------
418
419For each placement group, OSDs and MONs need memory, network and CPU
420at all times and even more during recovery. Sharing this overhead by
421clustering objects within a placement group is one of the main reasons
422they exist.
423
424Minimizing the number of placement groups saves significant amounts of
425resources.
426
11fdf7f2
TL
427.. _choosing-number-of-placement-groups:
428
7c673cae
FG
429Choosing the number of Placement Groups
430=======================================
431
11fdf7f2
TL
432.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
433
7c673cae
FG
434If you have more than 50 OSDs, we recommend approximately 50-100
435placement groups per OSD to balance out resource usage, data
11fdf7f2 436durability and distribution. If you have less than 50 OSDs, choosing
7c673cae 437among the `preselection`_ above is best. For a single pool of objects,
f67539c2 438you can use the following formula to get a baseline
7c673cae 439
f67539c2 440 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
7c673cae
FG
441
442Where **pool size** is either the number of replicas for replicated
443pools or the K+M sum for erasure coded pools (as returned by **ceph
444osd erasure-code-profile get**).
445
446You should then check if the result makes sense with the way you
447designed your Ceph cluster to maximize `data durability`_,
448`object distribution`_ and minimize `resource usage`_.
449
eafe8130
TL
450The result should always be **rounded up to the nearest power of two**.
451
452Only a power of two will evenly balance the number of objects among
453placement groups. Other values will result in an uneven distribution of
454data across your OSDs. Their use should be limited to incrementally
455stepping from one power of two to another.
7c673cae
FG
456
457As an example, for a cluster with 200 OSDs and a pool size of 3
f67539c2 458replicas, you would estimate your number of PGs as follows
7c673cae 459
f67539c2 460 :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
7c673cae
FG
461
462When using multiple data pools for storing objects, you need to ensure
463that you balance the number of placement groups per pool with the
464number of placement groups per OSD so that you arrive at a reasonable
465total number of placement groups that provides reasonably low variance
466per OSD without taxing system resources or making the peering process
467too slow.
468
469For instance a cluster of 10 pools each with 512 placement groups on
470ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
471that is 512 placement groups per OSD. That does not use too many
472resources. However, if 1,000 pools were created with 512 placement
473groups each, the OSDs will handle ~50,000 placement groups each and it
474would require significantly more resources and time for peering.
475
224ce89b
WB
476You may find the `PGCalc`_ tool helpful.
477
478
7c673cae
FG
479.. _setting the number of placement groups:
480
481Set the Number of Placement Groups
482==================================
483
484To set the number of placement groups in a pool, you must specify the
485number of placement groups at the time you create the pool.
11fdf7f2 486See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
7c673cae
FG
487
488 ceph osd pool set {pool-name} pg_num {pg_num}
489
11fdf7f2 490After you increase the number of placement groups, you must also
7c673cae
FG
491increase the number of placement groups for placement (``pgp_num``)
492before your cluster will rebalance. The ``pgp_num`` will be the number of
493placement groups that will be considered for placement by the CRUSH
494algorithm. Increasing ``pg_num`` splits the placement groups but data
495will not be migrated to the newer placement groups until placement
496groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
497should be equal to the ``pg_num``. To increase the number of
498placement groups for placement, execute the following::
499
500 ceph osd pool set {pool-name} pgp_num {pgp_num}
501
11fdf7f2
TL
502When decreasing the number of PGs, ``pgp_num`` is adjusted
503automatically for you.
7c673cae
FG
504
505Get the Number of Placement Groups
506==================================
507
508To get the number of placement groups in a pool, execute the following::
509
510 ceph osd pool get {pool-name} pg_num
511
512
513Get a Cluster's PG Statistics
514=============================
515
516To get the statistics for the placement groups in your cluster, execute the following::
517
518 ceph pg dump [--format {format}]
519
520Valid formats are ``plain`` (default) and ``json``.
521
522
523Get Statistics for Stuck PGs
524============================
525
526To get the statistics for all placement groups stuck in a specified state,
527execute the following::
528
529 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
530
531**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
532with the most up-to-date data to come up and in.
533
534**Unclean** Placement groups contain objects that are not replicated the desired number
535of times. They should be recovering.
536
537**Stale** Placement groups are in an unknown state - the OSDs that host them have not
538reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
539
540Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
541of seconds the placement group is stuck before including it in the returned statistics
542(default 300 seconds).
543
544
545Get a PG Map
546============
547
548To get the placement group map for a particular placement group, execute the following::
549
550 ceph pg map {pg-id}
551
552For example::
553
554 ceph pg map 1.6c
555
556Ceph will return the placement group map, the placement group, and the OSD status::
557
558 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
559
560
561Get a PGs Statistics
562====================
563
564To retrieve statistics for a particular placement group, execute the following::
565
566 ceph pg {pg-id} query
567
568
569Scrub a Placement Group
570=======================
571
572To scrub a placement group, execute the following::
573
574 ceph pg scrub {pg-id}
575
576Ceph checks the primary and any replica nodes, generates a catalog of all objects
577in the placement group and compares them to ensure that no objects are missing
578or mismatched, and their contents are consistent. Assuming the replicas all
579match, a final semantic sweep ensures that all of the snapshot-related object
580metadata is consistent. Errors are reported via logs.
581
11fdf7f2
TL
582To scrub all placement groups from a specific pool, execute the following::
583
584 ceph osd pool scrub {pool-name}
585
c07f9fc5
FG
586Prioritize backfill/recovery of a Placement Group(s)
587====================================================
588
589You may run into a situation where a bunch of placement groups will require
590recovery and/or backfill, and some particular groups hold data more important
591than others (for example, those PGs may hold data for images used by running
592machines and other PGs may be used by inactive machines/less relevant data).
593In that case, you may want to prioritize recovery of those groups so
594performance and/or availability of data stored on those groups is restored
11fdf7f2 595earlier. To do this (mark particular placement group(s) as prioritized during
c07f9fc5
FG
596backfill or recovery), execute the following::
597
598 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
599 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
600
601This will cause Ceph to perform recovery or backfill on specified placement
602groups first, before other placement groups. This does not interrupt currently
603ongoing backfills or recovery, but causes specified PGs to be processed
604as soon as possible. If you change your mind or prioritize wrong groups,
605use::
606
607 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
608 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
609
610This will remove "force" flag from those PGs and they will be processed
611in default order. Again, this doesn't affect currently processed placement
612group, only those that are still queued.
613
614The "force" flag is cleared automatically after recovery or backfill of group
615is done.
7c673cae 616
11fdf7f2
TL
617Similarly, you may use the following commands to force Ceph to perform recovery
618or backfill on all placement groups from a specified pool first::
619
620 ceph osd pool force-recovery {pool-name}
621 ceph osd pool force-backfill {pool-name}
622
623or::
624
625 ceph osd pool cancel-force-recovery {pool-name}
626 ceph osd pool cancel-force-backfill {pool-name}
627
628to restore to the default recovery or backfill priority if you change your mind.
629
630Note that these commands could possibly break the ordering of Ceph's internal
631priority computations, so use them with caution!
632Especially, if you have multiple pools that are currently sharing the same
633underlying OSDs, and some particular pools hold data more important than others,
634we recommend you use the following command to re-arrange all pools's
635recovery/backfill priority in a better order::
636
637 ceph osd pool set {pool-name} recovery_priority {value}
638
639For example, if you have 10 pools you could make the most important one priority 10,
640next 9, etc. Or you could leave most pools alone and have say 3 important pools
641all priority 1 or priorities 3, 2, 1 respectively.
642
7c673cae
FG
643Revert Lost
644===========
645
646If the cluster has lost one or more objects, and you have decided to
647abandon the search for the lost data, you must mark the unfound objects
648as ``lost``.
649
650If all possible locations have been queried and objects are still
651lost, you may have to give up on the lost objects. This is
652possible given unusual combinations of failures that allow the cluster
653to learn about writes that were performed before the writes themselves
654are recovered.
655
656Currently the only supported option is "revert", which will either roll back to
657a previous version of the object or (if it was a new object) forget about it
658entirely. To mark the "unfound" objects as "lost", execute the following::
659
660 ceph pg {pg-id} mark_unfound_lost revert|delete
661
662.. important:: Use this feature with caution, because it may confuse
663 applications that expect the object(s) to exist.
664
665
666.. toctree::
667 :hidden:
668
669 pg-states
670 pg-concepts
671
672
673.. _Create a Pool: ../pools#createpool
674.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
675.. _pgcalc: http://ceph.com/pgcalc/