]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
update download target update for octopus release
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
7c673cae
FG
1==================
2 Placement Groups
3==================
4
11fdf7f2
TL
5.. _pg-autoscaler:
6
7Autoscaling placement groups
8============================
9
10Placement groups (PGs) are an internal implementation detail of how
11Ceph distributes data. You can allow the cluster to either make
12recommendations or automatically tune PGs based on how the cluster is
13used by enabling *pg-autoscaling*.
14
15Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
16
17* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
18* ``on``: Enable automated adjustments of the PG count for the given pool.
19* ``warn``: Raise health alerts when the PG count should be adjusted
20
21To set the autoscaling mode for existing pools,::
22
23 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
24
25For example to enable autoscaling on pool ``foo``,::
26
27 ceph osd pool set foo pg_autoscale_mode on
28
29You can also configure the default ``pg_autoscale_mode`` that is
30applied to any pools that are created in the future with::
31
32 ceph config set global osd_pool_default_autoscale_mode <mode>
33
34Viewing PG scaling recommendations
35----------------------------------
36
37You can view each pool, its relative utilization, and any suggested changes to
38the PG count with this command::
39
40 ceph osd pool autoscale-status
41
42Output will be something like::
43
44 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE
45 a 12900M 3.0 82431M 0.4695 8 128 warn
46 c 0 3.0 82431M 0.0000 0.2000 1 64 warn
47 b 0 953.6M 3.0 82431M 0.0347 8 warn
48
49**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
50present, is the amount of data the administrator has specified that
51they expect to eventually be stored in this pool. The system uses
52the larger of the two values for its calculation.
53
54**RATE** is the multiplier for the pool that determines how much raw
55storage capacity is consumed. For example, a 3 replica pool will
56have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
57ratio of 1.5.
58
59**RAW CAPACITY** is the total amount of raw storage capacity on the
60OSDs that are responsible for storing this pool's (and perhaps other
61pools') data. **RATIO** is the ratio of that total capacity that
62this pool is consuming (i.e., ratio = size * rate / raw capacity).
63
64**TARGET RATIO**, if present, is the ratio of storage that the
65administrator has specified that they expect this pool to consume.
66The system uses the larger of the actual ratio and the target ratio
67for its calculation. If both target size bytes and ratio are specified, the
68ratio takes precedence.
69
70**PG_NUM** is the current number of PGs for the pool (or the current
71number of PGs that the pool is working towards, if a ``pg_num``
72change is in progress). **NEW PG_NUM**, if present, is what the
73system believes the pool's ``pg_num`` should be changed to. It is
74always a power of 2, and will only be present if the "ideal" value
75varies from the current value by more than a factor of 3.
76
77The final column, **AUTOSCALE**, is the pool ``pg_autoscale_mode``,
78and will be either ``on``, ``off``, or ``warn``.
79
80
81Automated scaling
82-----------------
83
84Allowing the cluster to automatically scale PGs based on usage is the
85simplest approach. Ceph will look at the total available storage and
86target number of PGs for the whole system, look at how much data is
87stored in each pool, and try to apportion the PGs accordingly. The
88system is relatively conservative with its approach, only making
89changes to a pool when the current number of PGs (``pg_num``) is more
90than 3 times off from what it thinks it should be.
91
92The target number of PGs per OSD is based on the
93``mon_target_pg_per_osd`` configurable (default: 100), which can be
94adjusted with::
95
96 ceph config set global mon_target_pg_per_osd 100
97
98The autoscaler analyzes pools and adjusts on a per-subtree basis.
99Because each pool may map to a different CRUSH rule, and each rule may
100distribute data across different devices, Ceph will consider
101utilization of each subtree of the hierarchy independently. For
102example, a pool that maps to OSDs of class `ssd` and a pool that maps
103to OSDs of class `hdd` will each have optimal PG counts that depend on
104the number of those respective device types.
105
106
107.. _specifying_pool_target_size:
108
109Specifying expected pool size
110-----------------------------
111
112When a cluster or pool is first created, it will consume a small
113fraction of the total cluster capacity and will appear to the system
114as if it should only need a small number of placement groups.
115However, in most cases cluster administrators have a good idea which
116pools are expected to consume most of the system capacity over time.
117By providing this information to Ceph, a more appropriate number of
118PGs can be used from the beginning, preventing subsequent changes in
119``pg_num`` and the overhead associated with moving data around when
120those adjustments are made.
121
122The *target size** of a pool can be specified in two ways: either in
123terms of the absolute size of the pool (i.e., bytes), or as a ratio of
124the total cluster capacity.
125
126For example,::
127
128 ceph osd pool set mypool target_size_bytes 100T
129
130will tell the system that `mypool` is expected to consume 100 TiB of
131space. Alternatively,::
132
133 ceph osd pool set mypool target_size_ratio .9
134
135will tell the system that `mypool` is expected to consume 90% of the
136total cluster capacity.
137
138You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
139
140Note that if impossible target size values are specified (for example,
141a capacity larger than the total cluster, or ratio(s) that sum to more
142than 1.0) then a health warning
143(``POOL_TARET_SIZE_RATIO_OVERCOMMITTED`` or
144``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
145
146Specifying bounds on a pool's PGs
147---------------------------------
148
149It is also possible to specify a minimum number of PGs for a pool.
150This is useful for establishing a lower bound on the amount of
151parallelism client will see when doing IO, even when a pool is mostly
152empty. Setting the lower bound prevents Ceph from reducing (or
153recommending you reduce) the PG number below the configured number.
154
155You can set the minimum number of PGs for a pool with::
156
157 ceph osd pool set <pool-name> pg_num_min <num>
158
159You can also specify the minimum PG count at pool creation time with
160the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
161create`` command.
162
7c673cae
FG
163.. _preselection:
164
165A preselection of pg_num
166========================
167
168When creating a new pool with::
169
170 ceph osd pool create {pool-name} pg_num
171
11fdf7f2 172it is mandatory to choose the value of ``pg_num`` because it cannot (currently) be
7c673cae
FG
173calculated automatically. Here are a few values commonly used:
174
175- Less than 5 OSDs set ``pg_num`` to 128
176
177- Between 5 and 10 OSDs set ``pg_num`` to 512
178
179- Between 10 and 50 OSDs set ``pg_num`` to 1024
180
181- If you have more than 50 OSDs, you need to understand the tradeoffs
182 and how to calculate the ``pg_num`` value by yourself
183
11fdf7f2 184- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
7c673cae 185
11fdf7f2 186As the number of OSDs increases, choosing the right value for pg_num
7c673cae
FG
187becomes more important because it has a significant influence on the
188behavior of the cluster as well as the durability of the data when
189something goes wrong (i.e. the probability that a catastrophic event
190leads to data loss).
191
192How are Placement Groups used ?
193===============================
194
195A placement group (PG) aggregates objects within a pool because
196tracking object placement and object metadata on a per-object basis is
197computationally expensive--i.e., a system with millions of objects
198cannot realistically track placement on a per-object basis.
199
200.. ditaa::
201 /-----\ /-----\ /-----\ /-----\ /-----\
202 | obj | | obj | | obj | | obj | | obj |
203 \-----/ \-----/ \-----/ \-----/ \-----/
204 | | | | |
205 +--------+--------+ +---+----+
206 | |
207 v v
208 +-----------------------+ +-----------------------+
209 | Placement Group #1 | | Placement Group #2 |
210 | | | |
211 +-----------------------+ +-----------------------+
212 | |
213 +------------------------------+
214 |
215 v
216 +-----------------------+
217 | Pool |
218 | |
219 +-----------------------+
220
221The Ceph client will calculate which placement group an object should
222be in. It does this by hashing the object ID and applying an operation
223based on the number of PGs in the defined pool and the ID of the pool.
224See `Mapping PGs to OSDs`_ for details.
225
226The object's contents within a placement group are stored in a set of
227OSDs. For instance, in a replicated pool of size two, each placement
228group will store objects on two OSDs, as shown below.
229
230.. ditaa::
231
232 +-----------------------+ +-----------------------+
233 | Placement Group #1 | | Placement Group #2 |
234 | | | |
235 +-----------------------+ +-----------------------+
236 | | | |
237 v v v v
238 /----------\ /----------\ /----------\ /----------\
239 | | | | | | | |
240 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
241 | | | | | | | |
242 \----------/ \----------/ \----------/ \----------/
243
244
245Should OSD #2 fail, another will be assigned to Placement Group #1 and
246will be filled with copies of all objects in OSD #1. If the pool size
247is changed from two to three, an additional OSD will be assigned to
248the placement group and will receive copies of all objects in the
249placement group.
250
11fdf7f2 251Placement groups do not own the OSD; they share it with other
7c673cae
FG
252placement groups from the same pool or even other pools. If OSD #2
253fails, the Placement Group #2 will also have to restore copies of
254objects, using OSD #3.
255
256When the number of placement groups increases, the new placement
257groups will be assigned OSDs. The result of the CRUSH function will
258also change and some objects from the former placement groups will be
259copied over to the new Placement Groups and removed from the old ones.
260
261Placement Groups Tradeoffs
262==========================
263
264Data durability and even distribution among all OSDs call for more
265placement groups but their number should be reduced to the minimum to
266save CPU and memory.
267
268.. _data durability:
269
270Data durability
271---------------
272
273After an OSD fails, the risk of data loss increases until the data it
274contained is fully recovered. Let's imagine a scenario that causes
275permanent data loss in a single placement group:
276
277- The OSD fails and all copies of the object it contains are lost.
278 For all objects within the placement group the number of replica
11fdf7f2 279 suddenly drops from three to two.
7c673cae 280
11fdf7f2 281- Ceph starts recovery for this placement group by choosing a new OSD
7c673cae
FG
282 to re-create the third copy of all objects.
283
284- Another OSD, within the same placement group, fails before the new
285 OSD is fully populated with the third copy. Some objects will then
286 only have one surviving copies.
287
288- Ceph picks yet another OSD and keeps copying objects to restore the
289 desired number of copies.
290
291- A third OSD, within the same placement group, fails before recovery
292 is complete. If this OSD contained the only remaining copy of an
293 object, it is permanently lost.
294
295In a cluster containing 10 OSDs with 512 placement groups in a three
296replica pool, CRUSH will give each placement groups three OSDs. In the
297end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
298Groups. When the first OSD fails, the above scenario will therefore
299start recovery for all 150 placement groups at the same time.
300
301The 150 placement groups being recovered are likely to be
302homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
303therefore likely to send copies of objects to all others and also
304receive some new objects to be stored because they became part of a
305new placement group.
306
307The amount of time it takes for this recovery to complete entirely
308depends on the architecture of the Ceph cluster. Let say each OSD is
309hosted by a 1TB SSD on a single machine and all of them are connected
310to a 10Gb/s switch and the recovery for a single OSD completes within
311M minutes. If there are two OSDs per machine using spinners with no
312SSD journal and a 1Gb/s switch, it will at least be an order of
313magnitude slower.
314
315In a cluster of this size, the number of placement groups has almost
316no influence on data durability. It could be 128 or 8192 and the
317recovery would not be slower or faster.
318
319However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
320is likely to speed up recovery and therefore improve data durability
321significantly. Each OSD now participates in only ~75 placement groups
322instead of ~150 when there were only 10 OSDs and it will still require
323all 19 remaining OSDs to perform the same amount of object copies in
324order to recover. But where 10 OSDs had to copy approximately 100GB
325each, they now have to copy 50GB each instead. If the network was the
326bottleneck, recovery will happen twice as fast. In other words,
327recovery goes faster when the number of OSDs increases.
328
329If this cluster grows to 40 OSDs, each of them will only host ~35
330placement groups. If an OSD dies, recovery will keep going faster
331unless it is blocked by another bottleneck. However, if this cluster
332grows to 200 OSDs, each of them will only host ~7 placement groups. If
333an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
334in these placement groups: recovery will take longer than when there
335were 40 OSDs, meaning the number of placement groups should be
336increased.
337
338No matter how short the recovery time is, there is a chance for a
339second OSD to fail while it is in progress. In the 10 OSDs cluster
340described above, if any of them fail, then ~17 placement groups
341(i.e. ~150 / 9 placement groups being recovered) will only have one
342surviving copy. And if any of the 8 remaining OSD fail, the last
343objects of two placement groups are likely to be lost (i.e. ~17 / 8
344placement groups with only one remaining copy being recovered).
345
346When the size of the cluster grows to 20 OSDs, the number of Placement
347Groups damaged by the loss of three OSDs drops. The second OSD lost
348will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
349instead of ~17 and the third OSD lost will only lose data if it is one
350of the four OSDs containing the surviving copy. In other words, if the
351probability of losing one OSD is 0.0001% during the recovery time
11fdf7f2 352frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
7c673cae
FG
3530.0001% in the cluster with 20 OSDs.
354
355In a nutshell, more OSDs mean faster recovery and a lower risk of
356cascading failures leading to the permanent loss of a Placement
357Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
358cluster with less than 50 OSDs as far as data durability is concerned.
359
360Note: It may take a long time for a new OSD added to the cluster to be
361populated with placement groups that were assigned to it. However
362there is no degradation of any object and it has no impact on the
363durability of the data contained in the Cluster.
364
365.. _object distribution:
366
367Object distribution within a pool
368---------------------------------
369
370Ideally objects are evenly distributed in each placement group. Since
371CRUSH computes the placement group for each object, but does not
372actually know how much data is stored in each OSD within this
373placement group, the ratio between the number of placement groups and
374the number of OSDs may influence the distribution of the data
375significantly.
376
11fdf7f2 377For instance, if there was a single placement group for ten OSDs in a
7c673cae
FG
378three replica pool, only three OSD would be used because CRUSH would
379have no other choice. When more placement groups are available,
380objects are more likely to be evenly spread among them. CRUSH also
381makes every effort to evenly spread OSDs among all existing Placement
382Groups.
383
384As long as there are one or two orders of magnitude more Placement
eafe8130
TL
385Groups than OSDs, the distribution should be even. For instance, 256
386placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
387etc.
7c673cae
FG
388
389Uneven data distribution can be caused by factors other than the ratio
390between OSDs and placement groups. Since CRUSH does not take into
391account the size of the objects, a few very large objects may create
392an imbalance. Let say one million 4K objects totaling 4GB are evenly
eafe8130 393spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
7c673cae
FG
394= 400MB on each OSD. If one 400MB object is added to the pool, the
395three OSDs supporting the placement group in which the object has been
396placed will be filled with 400MB + 400MB = 800MB while the seven
397others will remain occupied with only 400MB.
398
399.. _resource usage:
400
401Memory, CPU and network usage
402-----------------------------
403
404For each placement group, OSDs and MONs need memory, network and CPU
405at all times and even more during recovery. Sharing this overhead by
406clustering objects within a placement group is one of the main reasons
407they exist.
408
409Minimizing the number of placement groups saves significant amounts of
410resources.
411
11fdf7f2
TL
412.. _choosing-number-of-placement-groups:
413
7c673cae
FG
414Choosing the number of Placement Groups
415=======================================
416
11fdf7f2
TL
417.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information.
418
7c673cae
FG
419If you have more than 50 OSDs, we recommend approximately 50-100
420placement groups per OSD to balance out resource usage, data
11fdf7f2 421durability and distribution. If you have less than 50 OSDs, choosing
7c673cae
FG
422among the `preselection`_ above is best. For a single pool of objects,
423you can use the following formula to get a baseline::
424
425 (OSDs * 100)
426 Total PGs = ------------
427 pool size
428
429Where **pool size** is either the number of replicas for replicated
430pools or the K+M sum for erasure coded pools (as returned by **ceph
431osd erasure-code-profile get**).
432
433You should then check if the result makes sense with the way you
434designed your Ceph cluster to maximize `data durability`_,
435`object distribution`_ and minimize `resource usage`_.
436
eafe8130
TL
437The result should always be **rounded up to the nearest power of two**.
438
439Only a power of two will evenly balance the number of objects among
440placement groups. Other values will result in an uneven distribution of
441data across your OSDs. Their use should be limited to incrementally
442stepping from one power of two to another.
7c673cae
FG
443
444As an example, for a cluster with 200 OSDs and a pool size of 3
445replicas, you would estimate your number of PGs as follows::
446
447 (200 * 100)
448 ----------- = 6667. Nearest power of 2: 8192
449 3
450
451When using multiple data pools for storing objects, you need to ensure
452that you balance the number of placement groups per pool with the
453number of placement groups per OSD so that you arrive at a reasonable
454total number of placement groups that provides reasonably low variance
455per OSD without taxing system resources or making the peering process
456too slow.
457
458For instance a cluster of 10 pools each with 512 placement groups on
459ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
460that is 512 placement groups per OSD. That does not use too many
461resources. However, if 1,000 pools were created with 512 placement
462groups each, the OSDs will handle ~50,000 placement groups each and it
463would require significantly more resources and time for peering.
464
224ce89b
WB
465You may find the `PGCalc`_ tool helpful.
466
467
7c673cae
FG
468.. _setting the number of placement groups:
469
470Set the Number of Placement Groups
471==================================
472
473To set the number of placement groups in a pool, you must specify the
474number of placement groups at the time you create the pool.
11fdf7f2 475See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with::
7c673cae
FG
476
477 ceph osd pool set {pool-name} pg_num {pg_num}
478
11fdf7f2 479After you increase the number of placement groups, you must also
7c673cae
FG
480increase the number of placement groups for placement (``pgp_num``)
481before your cluster will rebalance. The ``pgp_num`` will be the number of
482placement groups that will be considered for placement by the CRUSH
483algorithm. Increasing ``pg_num`` splits the placement groups but data
484will not be migrated to the newer placement groups until placement
485groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
486should be equal to the ``pg_num``. To increase the number of
487placement groups for placement, execute the following::
488
489 ceph osd pool set {pool-name} pgp_num {pgp_num}
490
11fdf7f2
TL
491When decreasing the number of PGs, ``pgp_num`` is adjusted
492automatically for you.
7c673cae
FG
493
494Get the Number of Placement Groups
495==================================
496
497To get the number of placement groups in a pool, execute the following::
498
499 ceph osd pool get {pool-name} pg_num
500
501
502Get a Cluster's PG Statistics
503=============================
504
505To get the statistics for the placement groups in your cluster, execute the following::
506
507 ceph pg dump [--format {format}]
508
509Valid formats are ``plain`` (default) and ``json``.
510
511
512Get Statistics for Stuck PGs
513============================
514
515To get the statistics for all placement groups stuck in a specified state,
516execute the following::
517
518 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
519
520**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
521with the most up-to-date data to come up and in.
522
523**Unclean** Placement groups contain objects that are not replicated the desired number
524of times. They should be recovering.
525
526**Stale** Placement groups are in an unknown state - the OSDs that host them have not
527reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
528
529Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
530of seconds the placement group is stuck before including it in the returned statistics
531(default 300 seconds).
532
533
534Get a PG Map
535============
536
537To get the placement group map for a particular placement group, execute the following::
538
539 ceph pg map {pg-id}
540
541For example::
542
543 ceph pg map 1.6c
544
545Ceph will return the placement group map, the placement group, and the OSD status::
546
547 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
548
549
550Get a PGs Statistics
551====================
552
553To retrieve statistics for a particular placement group, execute the following::
554
555 ceph pg {pg-id} query
556
557
558Scrub a Placement Group
559=======================
560
561To scrub a placement group, execute the following::
562
563 ceph pg scrub {pg-id}
564
565Ceph checks the primary and any replica nodes, generates a catalog of all objects
566in the placement group and compares them to ensure that no objects are missing
567or mismatched, and their contents are consistent. Assuming the replicas all
568match, a final semantic sweep ensures that all of the snapshot-related object
569metadata is consistent. Errors are reported via logs.
570
11fdf7f2
TL
571To scrub all placement groups from a specific pool, execute the following::
572
573 ceph osd pool scrub {pool-name}
574
c07f9fc5
FG
575Prioritize backfill/recovery of a Placement Group(s)
576====================================================
577
578You may run into a situation where a bunch of placement groups will require
579recovery and/or backfill, and some particular groups hold data more important
580than others (for example, those PGs may hold data for images used by running
581machines and other PGs may be used by inactive machines/less relevant data).
582In that case, you may want to prioritize recovery of those groups so
583performance and/or availability of data stored on those groups is restored
11fdf7f2 584earlier. To do this (mark particular placement group(s) as prioritized during
c07f9fc5
FG
585backfill or recovery), execute the following::
586
587 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
588 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
589
590This will cause Ceph to perform recovery or backfill on specified placement
591groups first, before other placement groups. This does not interrupt currently
592ongoing backfills or recovery, but causes specified PGs to be processed
593as soon as possible. If you change your mind or prioritize wrong groups,
594use::
595
596 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
597 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
598
599This will remove "force" flag from those PGs and they will be processed
600in default order. Again, this doesn't affect currently processed placement
601group, only those that are still queued.
602
603The "force" flag is cleared automatically after recovery or backfill of group
604is done.
7c673cae 605
11fdf7f2
TL
606Similarly, you may use the following commands to force Ceph to perform recovery
607or backfill on all placement groups from a specified pool first::
608
609 ceph osd pool force-recovery {pool-name}
610 ceph osd pool force-backfill {pool-name}
611
612or::
613
614 ceph osd pool cancel-force-recovery {pool-name}
615 ceph osd pool cancel-force-backfill {pool-name}
616
617to restore to the default recovery or backfill priority if you change your mind.
618
619Note that these commands could possibly break the ordering of Ceph's internal
620priority computations, so use them with caution!
621Especially, if you have multiple pools that are currently sharing the same
622underlying OSDs, and some particular pools hold data more important than others,
623we recommend you use the following command to re-arrange all pools's
624recovery/backfill priority in a better order::
625
626 ceph osd pool set {pool-name} recovery_priority {value}
627
628For example, if you have 10 pools you could make the most important one priority 10,
629next 9, etc. Or you could leave most pools alone and have say 3 important pools
630all priority 1 or priorities 3, 2, 1 respectively.
631
7c673cae
FG
632Revert Lost
633===========
634
635If the cluster has lost one or more objects, and you have decided to
636abandon the search for the lost data, you must mark the unfound objects
637as ``lost``.
638
639If all possible locations have been queried and objects are still
640lost, you may have to give up on the lost objects. This is
641possible given unusual combinations of failures that allow the cluster
642to learn about writes that were performed before the writes themselves
643are recovered.
644
645Currently the only supported option is "revert", which will either roll back to
646a previous version of the object or (if it was a new object) forget about it
647entirely. To mark the "unfound" objects as "lost", execute the following::
648
649 ceph pg {pg-id} mark_unfound_lost revert|delete
650
651.. important:: Use this feature with caution, because it may confuse
652 applications that expect the object(s) to exist.
653
654
655.. toctree::
656 :hidden:
657
658 pg-states
659 pg-concepts
660
661
662.. _Create a Pool: ../pools#createpool
663.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
664.. _pgcalc: http://ceph.com/pgcalc/