ceph/doc/rados/operations/placement-groups.rst

   1 ==================
   2  Placement Groups
   3 ==================
   4
   5 .. _pg-autoscaler:
   6
   7 Autoscaling placement groups
   8 ============================
   9
  10 Placement groups (PGs) are an internal implementation detail of how
  11 Ceph distributes data.  You can allow the cluster to either make
  12 recommendations or automatically tune PGs based on how the cluster is
  13 used by enabling *pg-autoscaling*.
  14
  15 Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
  16
  17 * ``off``: Disable autoscaling for this pool.  It is up to the administrator to choose an appropriate PG number for each pool.  Please refer to :ref:`choosing-number-of-placement-groups` for more information.
  18 * ``on``: Enable automated adjustments of the PG count for the given pool.
  19 * ``warn``: Raise health alerts when the PG count should be adjusted
  20
  21 To set the autoscaling mode for existing pools,::
  22
  23   ceph osd pool set <pool-name> pg_autoscale_mode <mode>
  24
  25 For example to enable autoscaling on pool ``foo``,::
  26
  27   ceph osd pool set foo pg_autoscale_mode on
  28
  29 You can also configure the default ``pg_autoscale_mode`` that is
  30 applied to any pools that are created in the future with::
  31
  32   ceph config set global osd_pool_default_autoscale_mode <mode>
  33
  34 Viewing PG scaling recommendations
  35 ----------------------------------
  36
  37 You can view each pool, its relative utilization, and any suggested changes to
  38 the PG count with this command::
  39
  40   ceph osd pool autoscale-status
  41
  42 Output will be something like::
  43
  44    POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
  45    a     12900M                3.0        82431M  0.4695                     8         128  warn
  46    c         0                 3.0        82431M  0.0000        0.2000       1          64  warn
  47    b         0        953.6M   3.0        82431M  0.0347                     8              warn
  48
  49 **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
  50 present, is the amount of data the administrator has specified that
  51 they expect to eventually be stored in this pool.  The system uses
  52 the larger of the two values for its calculation.
  53
  54 **RATE** is the multiplier for the pool that determines how much raw
  55 storage capacity is consumed.  For example, a 3 replica pool will
  56 have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
  57 ratio of 1.5.
  58
  59 **RAW CAPACITY** is the total amount of raw storage capacity on the
  60 OSDs that are responsible for storing this pool's (and perhaps other
  61 pools') data.  **RATIO** is the ratio of that total capacity that
  62 this pool is consuming (i.e., ratio = size * rate / raw capacity).
  63
  64 **TARGET RATIO**, if present, is the ratio of storage that the
  65 administrator has specified that they expect this pool to consume.
  66 The system uses the larger of the actual ratio and the target ratio
  67 for its calculation.  If both target size bytes and ratio are specified, the
  68 ratio takes precedence.
  69
  70 **PG_NUM** is the current number of PGs for the pool (or the current
  71 number of PGs that the pool is working towards, if a ``pg_num``
  72 change is in progress).  **NEW PG_NUM**, if present, is what the
  73 system believes the pool's ``pg_num`` should be changed to.  It is
  74 always a power of 2, and will only be present if the "ideal" value
  75 varies from the current value by more than a factor of 3.
  76
  77 The final column, **AUTOSCALE**, is the pool ``pg_autoscale_mode``,
  78 and will be either ``on``, ``off``, or ``warn``.
  79
  80
  81 Automated scaling
  82 -----------------
  83
  84 Allowing the cluster to automatically scale PGs based on usage is the
  85 simplest approach.  Ceph will look at the total available storage and
  86 target number of PGs for the whole system, look at how much data is
  87 stored in each pool, and try to apportion the PGs accordingly.  The
  88 system is relatively conservative with its approach, only making
  89 changes to a pool when the current number of PGs (``pg_num``) is more
  90 than 3 times off from what it thinks it should be.
  91
  92 The target number of PGs per OSD is based on the
  93 ``mon_target_pg_per_osd`` configurable (default: 100), which can be
  94 adjusted with::
  95
  96   ceph config set global mon_target_pg_per_osd 100
  97
  98 The autoscaler analyzes pools and adjusts on a per-subtree basis.
  99 Because each pool may map to a different CRUSH rule, and each rule may
 100 distribute data across different devices, Ceph will consider
 101 utilization of each subtree of the hierarchy independently.  For
 102 example, a pool that maps to OSDs of class `ssd` and a pool that maps
 103 to OSDs of class `hdd` will each have optimal PG counts that depend on
 104 the number of those respective device types.
 105
 106
 107 .. _specifying_pool_target_size:
 108
 109 Specifying expected pool size
 110 -----------------------------
 111
 112 When a cluster or pool is first created, it will consume a small
 113 fraction of the total cluster capacity and will appear to the system
 114 as if it should only need a small number of placement groups.
 115 However, in most cases cluster administrators have a good idea which
 116 pools are expected to consume most of the system capacity over time.
 117 By providing this information to Ceph, a more appropriate number of
 118 PGs can be used from the beginning, preventing subsequent changes in
 119 ``pg_num`` and the overhead associated with moving data around when
 120 those adjustments are made.
 121
 122 The *target size** of a pool can be specified in two ways: either in
 123 terms of the absolute size of the pool (i.e., bytes), or as a ratio of
 124 the total cluster capacity.
 125
 126 For example,::
 127
 128   ceph osd pool set mypool target_size_bytes 100T
 129
 130 will tell the system that `mypool` is expected to consume 100 TiB of
 131 space.  Alternatively,::
 132
 133   ceph osd pool set mypool target_size_ratio .9
 134
 135 will tell the system that `mypool` is expected to consume 90% of the
 136 total cluster capacity.
 137
 138 You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
 139
 140 Note that if impossible target size values are specified (for example,
 141 a capacity larger than the total cluster, or ratio(s) that sum to more
 142 than 1.0) then a health warning
 143 (``POOL_TARET_SIZE_RATIO_OVERCOMMITTED`` or
 144 ``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
 145
 146 Specifying bounds on a pool's PGs
 147 ---------------------------------
 148
 149 It is also possible to specify a minimum number of PGs for a pool.
 150 This is useful for establishing a lower bound on the amount of
 151 parallelism client will see when doing IO, even when a pool is mostly
 152 empty.  Setting the lower bound prevents Ceph from reducing (or
 153 recommending you reduce) the PG number below the configured number.
 154
 155 You can set the minimum number of PGs for a pool with::
 156
 157   ceph osd pool set <pool-name> pg_num_min <num>
 158
 159 You can also specify the minimum PG count at pool creation time with
 160 the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
 161 create`` command.
 162
 163 .. _preselection:
 164
 165 A preselection of pg_num
 166 ========================
 167
 168 When creating a new pool with::
 169
 170         ceph osd pool create {pool-name} pg_num
 171
 172 it is mandatory to choose the value of ``pg_num`` because it cannot (currently) be
 173 calculated automatically. Here are a few values commonly used:
 174
 175 - Less than 5 OSDs set ``pg_num`` to 128
 176
 177 - Between 5 and 10 OSDs set ``pg_num`` to 512
 178
 179 - Between 10 and 50 OSDs set ``pg_num`` to 1024
 180
 181 - If you have more than 50 OSDs, you need to understand the tradeoffs
 182   and how to calculate the ``pg_num`` value by yourself
 183
 184 - For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
 185
 186 As the number of OSDs increases, choosing the right value for pg_num
 187 becomes more important because it has a significant influence on the
 188 behavior of the cluster as well as the durability of the data when
 189 something goes wrong (i.e. the probability that a catastrophic event
 190 leads to data loss).
 191
 192 How are Placement Groups used ?
 193 ===============================
 194
 195 A placement group (PG) aggregates objects within a pool because
 196 tracking object placement and object metadata on a per-object basis is
 197 computationally expensive--i.e., a system with millions of objects
 198 cannot realistically track placement on a per-object basis.
 199
 200 .. ditaa::
 201            /-----\  /-----\  /-----\  /-----\  /-----\
 202            | obj |  | obj |  | obj |  | obj |  | obj |
 203            \-----/  \-----/  \-----/  \-----/  \-----/
 204               |        |        |        |        |
 205               +--------+--------+        +---+----+
 206               |                              |
 207               v                              v
 208    +-----------------------+      +-----------------------+
 209    |  Placement Group #1   |      |  Placement Group #2   |
 210    |                       |      |                       |
 211    +-----------------------+      +-----------------------+
 212                |                              |
 213                +------------------------------+
 214                              |
 215                              v
 216                   +-----------------------+
 217                   |        Pool           |
 218                   |                       |
 219                   +-----------------------+
 220
 221 The Ceph client will calculate which placement group an object should
 222 be in. It does this by hashing the object ID and applying an operation
 223 based on the number of PGs in the defined pool and the ID of the pool.
 224 See `Mapping PGs to OSDs`_ for details.
 225
 226 The object's contents within a placement group are stored in a set of
 227 OSDs. For instance, in a replicated pool of size two, each placement
 228 group will store objects on two OSDs, as shown below.
 229
 230 .. ditaa::
 231
 232    +-----------------------+      +-----------------------+
 233    |  Placement Group #1   |      |  Placement Group #2   |
 234    |                       |      |                       |
 235    +-----------------------+      +-----------------------+
 236         |             |               |             |
 237         v             v               v             v
 238    /----------\  /----------\    /----------\  /----------\
 239    |          |  |          |    |          |  |          |
 240    |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
 241    |          |  |          |    |          |  |          |
 242    \----------/  \----------/    \----------/  \----------/
 243
 244
 245 Should OSD #2 fail, another will be assigned to Placement Group #1 and
 246 will be filled with copies of all objects in OSD #1. If the pool size
 247 is changed from two to three, an additional OSD will be assigned to
 248 the placement group and will receive copies of all objects in the
 249 placement group.
 250
 251 Placement groups do not own the OSD; they share it with other
 252 placement groups from the same pool or even other pools. If OSD #2
 253 fails, the Placement Group #2 will also have to restore copies of
 254 objects, using OSD #3.
 255
 256 When the number of placement groups increases, the new placement
 257 groups will be assigned OSDs. The result of the CRUSH function will
 258 also change and some objects from the former placement groups will be
 259 copied over to the new Placement Groups and removed from the old ones.
 260
 261 Placement Groups Tradeoffs
 262 ==========================
 263
 264 Data durability and even distribution among all OSDs call for more
 265 placement groups but their number should be reduced to the minimum to
 266 save CPU and memory.
 267
 268 .. _data durability:
 269
 270 Data durability
 271 ---------------
 272
 273 After an OSD fails, the risk of data loss increases until the data it
 274 contained is fully recovered. Let's imagine a scenario that causes
 275 permanent data loss in a single placement group:
 276
 277 - The OSD fails and all copies of the object it contains are lost.
 278   For all objects within the placement group the number of replica
 279   suddenly drops from three to two.
 280
 281 - Ceph starts recovery for this placement group by choosing a new OSD
 282   to re-create the third copy of all objects.
 283
 284 - Another OSD, within the same placement group, fails before the new
 285   OSD is fully populated with the third copy. Some objects will then
 286   only have one surviving copies.
 287
 288 - Ceph picks yet another OSD and keeps copying objects to restore the
 289   desired number of copies.
 290
 291 - A third OSD, within the same placement group, fails before recovery
 292   is complete. If this OSD contained the only remaining copy of an
 293   object, it is permanently lost.
 294
 295 In a cluster containing 10 OSDs with 512 placement groups in a three
 296 replica pool, CRUSH will give each placement groups three OSDs. In the
 297 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
 298 Groups. When the first OSD fails, the above scenario will therefore
 299 start recovery for all 150 placement groups at the same time.
 300
 301 The 150 placement groups being recovered are likely to be
 302 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
 303 therefore likely to send copies of objects to all others and also
 304 receive some new objects to be stored because they became part of a
 305 new placement group.
 306
 307 The amount of time it takes for this recovery to complete entirely
 308 depends on the architecture of the Ceph cluster. Let say each OSD is
 309 hosted by a 1TB SSD on a single machine and all of them are connected
 310 to a 10Gb/s switch and the recovery for a single OSD completes within
 311 M minutes. If there are two OSDs per machine using spinners with no
 312 SSD journal and a 1Gb/s switch, it will at least be an order of
 313 magnitude slower.
 314
 315 In a cluster of this size, the number of placement groups has almost
 316 no influence on data durability. It could be 128 or 8192 and the
 317 recovery would not be slower or faster.
 318
 319 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
 320 is likely to speed up recovery and therefore improve data durability
 321 significantly. Each OSD now participates in only ~75 placement groups
 322 instead of ~150 when there were only 10 OSDs and it will still require
 323 all 19 remaining OSDs to perform the same amount of object copies in
 324 order to recover. But where 10 OSDs had to copy approximately 100GB
 325 each, they now have to copy 50GB each instead. If the network was the
 326 bottleneck, recovery will happen twice as fast. In other words,
 327 recovery goes faster when the number of OSDs increases.
 328
 329 If this cluster grows to 40 OSDs, each of them will only host ~35
 330 placement groups. If an OSD dies, recovery will keep going faster
 331 unless it is blocked by another bottleneck. However, if this cluster
 332 grows to 200 OSDs, each of them will only host ~7 placement groups. If
 333 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
 334 in these placement groups: recovery will take longer than when there
 335 were 40 OSDs, meaning the number of placement groups should be
 336 increased.
 337
 338 No matter how short the recovery time is, there is a chance for a
 339 second OSD to fail while it is in progress. In the 10 OSDs cluster
 340 described above, if any of them fail, then ~17 placement groups
 341 (i.e. ~150 / 9 placement groups being recovered) will only have one
 342 surviving copy. And if any of the 8 remaining OSD fail, the last
 343 objects of two placement groups are likely to be lost (i.e. ~17 / 8
 344 placement groups with only one remaining copy being recovered).
 345
 346 When the size of the cluster grows to 20 OSDs, the number of Placement
 347 Groups damaged by the loss of three OSDs drops. The second OSD lost
 348 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
 349 instead of ~17 and the third OSD lost will only lose data if it is one
 350 of the four OSDs containing the surviving copy. In other words, if the
 351 probability of losing one OSD is 0.0001% during the recovery time
 352 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
 353 0.0001% in the cluster with 20 OSDs.
 354
 355 In a nutshell, more OSDs mean faster recovery and a lower risk of
 356 cascading failures leading to the permanent loss of a Placement
 357 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
 358 cluster with less than 50 OSDs as far as data durability is concerned.
 359
 360 Note: It may take a long time for a new OSD added to the cluster to be
 361 populated with placement groups that were assigned to it. However
 362 there is no degradation of any object and it has no impact on the
 363 durability of the data contained in the Cluster.
 364
 365 .. _object distribution:
 366
 367 Object distribution within a pool
 368 ---------------------------------
 369
 370 Ideally objects are evenly distributed in each placement group. Since
 371 CRUSH computes the placement group for each object, but does not
 372 actually know how much data is stored in each OSD within this
 373 placement group, the ratio between the number of placement groups and
 374 the number of OSDs may influence the distribution of the data
 375 significantly.
 376
 377 For instance, if there was a single placement group for ten OSDs in a
 378 three replica pool, only three OSD would be used because CRUSH would
 379 have no other choice. When more placement groups are available,
 380 objects are more likely to be evenly spread among them. CRUSH also
 381 makes every effort to evenly spread OSDs among all existing Placement
 382 Groups.
 383
 384 As long as there are one or two orders of magnitude more Placement
 385 Groups than OSDs, the distribution should be even. For instance, 256
 386 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
 387 etc.
 388
 389 Uneven data distribution can be caused by factors other than the ratio
 390 between OSDs and placement groups. Since CRUSH does not take into
 391 account the size of the objects, a few very large objects may create
 392 an imbalance. Let say one million 4K objects totaling 4GB are evenly
 393 spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
 394 = 400MB on each OSD. If one 400MB object is added to the pool, the
 395 three OSDs supporting the placement group in which the object has been
 396 placed will be filled with 400MB + 400MB = 800MB while the seven
 397 others will remain occupied with only 400MB.
 398
 399 .. _resource usage:
 400
 401 Memory, CPU and network usage
 402 -----------------------------
 403
 404 For each placement group, OSDs and MONs need memory, network and CPU
 405 at all times and even more during recovery. Sharing this overhead by
 406 clustering objects within a placement group is one of the main reasons
 407 they exist.
 408
 409 Minimizing the number of placement groups saves significant amounts of
 410 resources.
 411
 412 .. _choosing-number-of-placement-groups:
 413
 414 Choosing the number of Placement Groups
 415 =======================================
 416
 417 .. note: It is rarely necessary to do this math by hand.  Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties.  See :ref:`pg-autoscaler` for more information.
 418
 419 If you have more than 50 OSDs, we recommend approximately 50-100
 420 placement groups per OSD to balance out resource usage, data
 421 durability and distribution. If you have less than 50 OSDs, choosing
 422 among the `preselection`_ above is best. For a single pool of objects,
 423 you can use the following formula to get a baseline::
 424
 425                 (OSDs * 100)
 426    Total PGs =  ------------
 427                  pool size
 428
 429 Where **pool size** is either the number of replicas for replicated
 430 pools or the K+M sum for erasure coded pools (as returned by **ceph
 431 osd erasure-code-profile get**).
 432
 433 You should then check if the result makes sense with the way you
 434 designed your Ceph cluster to maximize `data durability`_,
 435 `object distribution`_ and minimize `resource usage`_.
 436
 437 The result should always be **rounded up to the nearest power of two**.
 438
 439 Only a power of two will evenly balance the number of objects among
 440 placement groups. Other values will result in an uneven distribution of
 441 data across your OSDs. Their use should be limited to incrementally
 442 stepping from one power of two to another.
 443
 444 As an example, for a cluster with 200 OSDs and a pool size of 3
 445 replicas, you would estimate your number of PGs as follows::
 446
 447    (200 * 100)
 448    ----------- = 6667. Nearest power of 2: 8192
 449         3
 450
 451 When using multiple data pools for storing objects, you need to ensure
 452 that you balance the number of placement groups per pool with the
 453 number of placement groups per OSD so that you arrive at a reasonable
 454 total number of placement groups that provides reasonably low variance
 455 per OSD without taxing system resources or making the peering process
 456 too slow.
 457
 458 For instance a cluster of 10 pools each with 512 placement groups on
 459 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
 460 that is 512 placement groups per OSD. That does not use too many
 461 resources. However, if 1,000 pools were created with 512 placement
 462 groups each, the OSDs will handle ~50,000 placement groups each and it
 463 would require significantly more resources and time for peering.
 464
 465 You may find the `PGCalc`_ tool helpful.
 466
 467
 468 .. _setting the number of placement groups:
 469
 470 Set the Number of Placement Groups
 471 ==================================
 472
 473 To set the number of placement groups in a pool, you must specify the
 474 number of placement groups at the time you create the pool.
 475 See `Create a Pool`_ for details.  Even after a pool is created you can also change the number of placement groups with::
 476
 477         ceph osd pool set {pool-name} pg_num {pg_num}
 478
 479 After you increase the number of placement groups, you must also
 480 increase the number of placement groups for placement (``pgp_num``)
 481 before your cluster will rebalance. The ``pgp_num`` will be the number of
 482 placement groups that will be considered for placement by the CRUSH
 483 algorithm. Increasing ``pg_num`` splits the placement groups but data
 484 will not be migrated to the newer placement groups until placement
 485 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
 486 should be equal to the ``pg_num``.  To increase the number of
 487 placement groups for placement, execute the following::
 488
 489         ceph osd pool set {pool-name} pgp_num {pgp_num}
 490
 491 When decreasing the number of PGs, ``pgp_num`` is adjusted
 492 automatically for you.
 493
 494 Get the Number of Placement Groups
 495 ==================================
 496
 497 To get the number of placement groups in a pool, execute the following::
 498
 499         ceph osd pool get {pool-name} pg_num
 500
 501
 502 Get a Cluster's PG Statistics
 503 =============================
 504
 505 To get the statistics for the placement groups in your cluster, execute the following::
 506
 507         ceph pg dump [--format {format}]
 508
 509 Valid formats are ``plain`` (default) and ``json``.
 510
 511
 512 Get Statistics for Stuck PGs
 513 ============================
 514
 515 To get the statistics for all placement groups stuck in a specified state,
 516 execute the following::
 517
 518         ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
 519
 520 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
 521 with the most up-to-date data to come up and in.
 522
 523 **Unclean** Placement groups contain objects that are not replicated the desired number
 524 of times. They should be recovering.
 525
 526 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
 527 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
 528
 529 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
 530 of seconds the placement group is stuck before including it in the returned statistics
 531 (default 300 seconds).
 532
 533
 534 Get a PG Map
 535 ============
 536
 537 To get the placement group map for a particular placement group, execute the following::
 538
 539         ceph pg map {pg-id}
 540
 541 For example::
 542
 543         ceph pg map 1.6c
 544
 545 Ceph will return the placement group map, the placement group, and the OSD status::
 546
 547         osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
 548
 549
 550 Get a PGs Statistics
 551 ====================
 552
 553 To retrieve statistics for a particular placement group, execute the following::
 554
 555         ceph pg {pg-id} query
 556
 557
 558 Scrub a Placement Group
 559 =======================
 560
 561 To scrub a placement group, execute the following::
 562
 563         ceph pg scrub {pg-id}
 564
 565 Ceph checks the primary and any replica nodes, generates a catalog of all objects
 566 in the placement group and compares them to ensure that no objects are missing
 567 or mismatched, and their contents are consistent.  Assuming the replicas all
 568 match, a final semantic sweep ensures that all of the snapshot-related object
 569 metadata is consistent. Errors are reported via logs.
 570
 571 To scrub all placement groups from a specific pool, execute the following::
 572
 573         ceph osd pool scrub {pool-name}
 574
 575 Prioritize backfill/recovery of a Placement Group(s)
 576 ====================================================
 577
 578 You may run into a situation where a bunch of placement groups will require
 579 recovery and/or backfill, and some particular groups hold data more important
 580 than others (for example, those PGs may hold data for images used by running
 581 machines and other PGs may be used by inactive machines/less relevant data).
 582 In that case, you may want to prioritize recovery of those groups so
 583 performance and/or availability of data stored on those groups is restored
 584 earlier. To do this (mark particular placement group(s) as prioritized during
 585 backfill or recovery), execute the following::
 586
 587         ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 588         ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 589
 590 This will cause Ceph to perform recovery or backfill on specified placement
 591 groups first, before other placement groups. This does not interrupt currently
 592 ongoing backfills or recovery, but causes specified PGs to be processed
 593 as soon as possible. If you change your mind or prioritize wrong groups,
 594 use::
 595
 596         ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 597         ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 598
 599 This will remove "force" flag from those PGs and they will be processed
 600 in default order. Again, this doesn't affect currently processed placement
 601 group, only those that are still queued.
 602
 603 The "force" flag is cleared automatically after recovery or backfill of group
 604 is done.
 605
 606 Similarly, you may use the following commands to force Ceph to perform recovery
 607 or backfill on all placement groups from a specified pool first::
 608
 609         ceph osd pool force-recovery {pool-name}
 610         ceph osd pool force-backfill {pool-name}
 611
 612 or::
 613
 614         ceph osd pool cancel-force-recovery {pool-name}
 615         ceph osd pool cancel-force-backfill {pool-name}
 616
 617 to restore to the default recovery or backfill priority if you change your mind.
 618
 619 Note that these commands could possibly break the ordering of Ceph's internal
 620 priority computations, so use them with caution!
 621 Especially, if you have multiple pools that are currently sharing the same
 622 underlying OSDs, and some particular pools hold data more important than others,
 623 we recommend you use the following command to re-arrange all pools's
 624 recovery/backfill priority in a better order::
 625
 626         ceph osd pool set {pool-name} recovery_priority {value}
 627
 628 For example, if you have 10 pools you could make the most important one priority 10,
 629 next 9, etc. Or you could leave most pools alone and have say 3 important pools
 630 all priority 1 or priorities 3, 2, 1 respectively.
 631
 632 Revert Lost
 633 ===========
 634
 635 If the cluster has lost one or more objects, and you have decided to
 636 abandon the search for the lost data, you must mark the unfound objects
 637 as ``lost``.
 638
 639 If all possible locations have been queried and objects are still
 640 lost, you may have to give up on the lost objects. This is
 641 possible given unusual combinations of failures that allow the cluster
 642 to learn about writes that were performed before the writes themselves
 643 are recovered.
 644
 645 Currently the only supported option is "revert", which will either roll back to
 646 a previous version of the object or (if it was a new object) forget about it
 647 entirely. To mark the "unfound" objects as "lost", execute the following::
 648
 649         ceph pg {pg-id} mark_unfound_lost revert|delete
 650
 651 .. important:: Use this feature with caution, because it may confuse
 652    applications that expect the object(s) to exist.
 653
 654
 655 .. toctree::
 656         :hidden:
 657
 658         pg-states
 659         pg-concepts
 660
 661
 662 .. _Create a Pool: ../pools#createpool
 663 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
 664 .. _pgcalc: http://ceph.com/pgcalc/