ceph/doc/rados/operations/placement-groups.rst

   1 ==================
   2  Placement Groups
   3 ==================
   4
   5 .. _pg-autoscaler:
   6
   7 Autoscaling placement groups
   8 ============================
   9
  10 Placement groups (PGs) are an internal implementation detail of how
  11 Ceph distributes data.  You can allow the cluster to either make
  12 recommendations or automatically tune PGs based on how the cluster is
  13 used by enabling *pg-autoscaling*.
  14
  15 Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
  16
  17 * ``off``: Disable autoscaling for this pool.  It is up to the administrator to choose an appropriate PG number for each pool.  Please refer to :ref:`choosing-number-of-placement-groups` for more information.
  18 * ``on``: Enable automated adjustments of the PG count for the given pool.
  19 * ``warn``: Raise health alerts when the PG count should be adjusted
  20
  21 To set the autoscaling mode for existing pools,::
  22
  23   ceph osd pool set <pool-name> pg_autoscale_mode <mode>
  24
  25 For example to enable autoscaling on pool ``foo``,::
  26
  27   ceph osd pool set foo pg_autoscale_mode on
  28
  29 You can also configure the default ``pg_autoscale_mode`` that is
  30 applied to any pools that are created in the future with::
  31
  32   ceph config set global osd_pool_default_pg_autoscale_mode <mode>
  33
  34 Viewing PG scaling recommendations
  35 ----------------------------------
  36
  37 You can view each pool, its relative utilization, and any suggested changes to
  38 the PG count with this command::
  39
  40   ceph osd pool autoscale-status
  41
  42 Output will be something like::
  43
  44    POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO BIAS PG_NUM  NEW PG_NUM  AUTOSCALE PROFILE
  45    a     12900M                3.0        82431M  0.4695                                          8         128  warn      scale-up
  46    c         0                 3.0        82431M  0.0000        0.2000           0.9884  1.0      1          64  warn      scale-down
  47    b         0        953.6M   3.0        82431M  0.0347                                          8              warn      scale-down
  48
  49 **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
  50 present, is the amount of data the administrator has specified that
  51 they expect to eventually be stored in this pool.  The system uses
  52 the larger of the two values for its calculation.
  53
  54 **RATE** is the multiplier for the pool that determines how much raw
  55 storage capacity is consumed.  For example, a 3 replica pool will
  56 have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
  57 ratio of 1.5.
  58
  59 **RAW CAPACITY** is the total amount of raw storage capacity on the
  60 OSDs that are responsible for storing this pool's (and perhaps other
  61 pools') data.  **RATIO** is the ratio of that total capacity that
  62 this pool is consuming (i.e., ratio = size * rate / raw capacity).
  63
  64 **TARGET RATIO**, if present, is the ratio of storage that the
  65 administrator has specified that they expect this pool to consume
  66 relative to other pools with target ratios set.
  67 If both target size bytes and ratio are specified, the
  68 ratio takes precedence.
  69
  70 **EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
  71
  72 1. subtracting any capacity expected to be used by pools with target size set
  73 2. normalizing the target ratios among pools with target ratio set so
  74    they collectively target the rest of the space. For example, 4
  75    pools with target_ratio 1.0 would have an effective ratio of 0.25.
  76
  77 The system uses the larger of the actual ratio and the effective ratio
  78 for its calculation.
  79
  80 **BIAS** is used as a multiplier to manually adjust a pool's PG based
  81 on prior information about how much PGs a specific pool is expected
  82 to have.
  83
  84 **PG_NUM** is the current number of PGs for the pool (or the current
  85 number of PGs that the pool is working towards, if a ``pg_num``
  86 change is in progress).  **NEW PG_NUM**, if present, is what the
  87 system believes the pool's ``pg_num`` should be changed to.  It is
  88 always a power of 2, and will only be present if the "ideal" value
  89 varies from the current value by more than a factor of 3.
  90
  91 **AUTOSCALE**, is the pool ``pg_autoscale_mode``
  92 and will be either ``on``, ``off``, or ``warn``.
  93
  94 The final column, **PROFILE** shows the autoscale profile
  95 used by each pool. ``scale-up`` and ``scale-down`` are the
  96 currently available profiles.
  97
  98
  99 Automated scaling
 100 -----------------
 101
 102 Allowing the cluster to automatically scale PGs based on usage is the
 103 simplest approach.  Ceph will look at the total available storage and
 104 target number of PGs for the whole system, look at how much data is
 105 stored in each pool, and try to apportion the PGs accordingly.  The
 106 system is relatively conservative with its approach, only making
 107 changes to a pool when the current number of PGs (``pg_num``) is more
 108 than 3 times off from what it thinks it should be.
 109
 110 The target number of PGs per OSD is based on the
 111 ``mon_target_pg_per_osd`` configurable (default: 100), which can be
 112 adjusted with::
 113
 114   ceph config set global mon_target_pg_per_osd 100
 115
 116 The autoscaler analyzes pools and adjusts on a per-subtree basis.
 117 Because each pool may map to a different CRUSH rule, and each rule may
 118 distribute data across different devices, Ceph will consider
 119 utilization of each subtree of the hierarchy independently.  For
 120 example, a pool that maps to OSDs of class `ssd` and a pool that maps
 121 to OSDs of class `hdd` will each have optimal PG counts that depend on
 122 the number of those respective device types.
 123
 124 The autoscaler uses the `scale-up` profile by default,
 125 where it starts out each pool with minimal PGs and scales
 126 up PGs when there is more usage in each pool. However, it also has
 127 a `scale-down` profile, where each pool starts out with a full complements
 128 of PGs and only scales down when the usage ratio across the pools is not even.
 129
 130 With only the `scale-down` profile, the autoscaler identifies
 131 any overlapping roots and prevents the pools with such roots
 132 from scaling because overlapping roots can cause problems
 133 with the scaling process.
 134
 135 To use the `scale-down` profile::
 136
 137   ceph osd pool set autoscale-profile scale-down
 138
 139 To switch back to the default `scale-up` profile::
 140
 141   ceph osd pool set autoscale-profile scale-up
 142
 143 Existing clusters will continue to use the `scale-up` profile.
 144 To use the `scale-down` profile, users will need to set autoscale-profile `scale-down`,
 145 after upgrading to a version of Ceph that provides the `scale-down` feature.
 146
 147 .. _specifying_pool_target_size:
 148
 149 Specifying expected pool size
 150 -----------------------------
 151
 152 When a cluster or pool is first created, it will consume a small
 153 fraction of the total cluster capacity and will appear to the system
 154 as if it should only need a small number of placement groups.
 155 However, in most cases cluster administrators have a good idea which
 156 pools are expected to consume most of the system capacity over time.
 157 By providing this information to Ceph, a more appropriate number of
 158 PGs can be used from the beginning, preventing subsequent changes in
 159 ``pg_num`` and the overhead associated with moving data around when
 160 those adjustments are made.
 161
 162 The *target size* of a pool can be specified in two ways: either in
 163 terms of the absolute size of the pool (i.e., bytes), or as a weight
 164 relative to other pools with a ``target_size_ratio`` set.
 165
 166 For example,::
 167
 168   ceph osd pool set mypool target_size_bytes 100T
 169
 170 will tell the system that `mypool` is expected to consume 100 TiB of
 171 space.  Alternatively,::
 172
 173   ceph osd pool set mypool target_size_ratio 1.0
 174
 175 will tell the system that `mypool` is expected to consume 1.0 relative
 176 to the other pools with ``target_size_ratio`` set. If `mypool` is the
 177 only pool in the cluster, this means an expected use of 100% of the
 178 total capacity. If there is a second pool with ``target_size_ratio``
 179 1.0, both pools would expect to use 50% of the cluster capacity.
 180
 181 You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
 182
 183 Note that if impossible target size values are specified (for example,
 184 a capacity larger than the total cluster) then a health warning
 185 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
 186
 187 If both ``target_size_ratio`` and ``target_size_bytes`` are specified
 188 for a pool, only the ratio will be considered, and a health warning
 189 (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
 190
 191 Specifying bounds on a pool's PGs
 192 ---------------------------------
 193
 194 It is also possible to specify a minimum number of PGs for a pool.
 195 This is useful for establishing a lower bound on the amount of
 196 parallelism client will see when doing IO, even when a pool is mostly
 197 empty.  Setting the lower bound prevents Ceph from reducing (or
 198 recommending you reduce) the PG number below the configured number.
 199
 200 You can set the minimum number of PGs for a pool with::
 201
 202   ceph osd pool set <pool-name> pg_num_min <num>
 203
 204 You can also specify the minimum PG count at pool creation time with
 205 the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool
 206 create`` command.
 207
 208 .. _preselection:
 209
 210 A preselection of pg_num
 211 ========================
 212
 213 When creating a new pool with::
 214
 215         ceph osd pool create {pool-name} [pg_num]
 216
 217 it is optional to choose the value of ``pg_num``.  If you do not
 218 specify ``pg_num``, the cluster can (by default) automatically tune it
 219 for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
 220
 221 Alternatively, ``pg_num`` can be explicitly provided.  However,
 222 whether you specify a ``pg_num`` value or not does not affect whether
 223 the value is automatically tuned by the cluster after the fact.  To
 224 enable or disable auto-tuning,::
 225
 226   ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
 227
 228 The "rule of thumb" for PGs per OSD has traditionally be 100.  With
 229 the additional of the balancer (which is also enabled by default), a
 230 value of more like 50 PGs per OSD is probably reasonable.  The
 231 challenge (which the autoscaler normally does for you), is to:
 232
 233 - have the PGs per pool proportional to the data in the pool, and
 234 - end up with 50-100 PGs per OSDs, after the replication or
 235   erasuring-coding fan-out of each PG across OSDs is taken into
 236   consideration
 237
 238 How are Placement Groups used ?
 239 ===============================
 240
 241 A placement group (PG) aggregates objects within a pool because
 242 tracking object placement and object metadata on a per-object basis is
 243 computationally expensive--i.e., a system with millions of objects
 244 cannot realistically track placement on a per-object basis.
 245
 246 .. ditaa::
 247            /-----\  /-----\  /-----\  /-----\  /-----\
 248            | obj |  | obj |  | obj |  | obj |  | obj |
 249            \-----/  \-----/  \-----/  \-----/  \-----/
 250               |        |        |        |        |
 251               +--------+--------+        +---+----+
 252               |                              |
 253               v                              v
 254    +-----------------------+      +-----------------------+
 255    |  Placement Group #1   |      |  Placement Group #2   |
 256    |                       |      |                       |
 257    +-----------------------+      +-----------------------+
 258                |                              |
 259                +------------------------------+
 260                              |
 261                              v
 262                   +-----------------------+
 263                   |        Pool           |
 264                   |                       |
 265                   +-----------------------+
 266
 267 The Ceph client will calculate which placement group an object should
 268 be in. It does this by hashing the object ID and applying an operation
 269 based on the number of PGs in the defined pool and the ID of the pool.
 270 See `Mapping PGs to OSDs`_ for details.
 271
 272 The object's contents within a placement group are stored in a set of
 273 OSDs. For instance, in a replicated pool of size two, each placement
 274 group will store objects on two OSDs, as shown below.
 275
 276 .. ditaa::
 277    +-----------------------+      +-----------------------+
 278    |  Placement Group #1   |      |  Placement Group #2   |
 279    |                       |      |                       |
 280    +-----------------------+      +-----------------------+
 281         |             |               |             |
 282         v             v               v             v
 283    /----------\  /----------\    /----------\  /----------\
 284    |          |  |          |    |          |  |          |
 285    |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
 286    |          |  |          |    |          |  |          |
 287    \----------/  \----------/    \----------/  \----------/
 288
 289
 290 Should OSD #2 fail, another will be assigned to Placement Group #1 and
 291 will be filled with copies of all objects in OSD #1. If the pool size
 292 is changed from two to three, an additional OSD will be assigned to
 293 the placement group and will receive copies of all objects in the
 294 placement group.
 295
 296 Placement groups do not own the OSD; they share it with other
 297 placement groups from the same pool or even other pools. If OSD #2
 298 fails, the Placement Group #2 will also have to restore copies of
 299 objects, using OSD #3.
 300
 301 When the number of placement groups increases, the new placement
 302 groups will be assigned OSDs. The result of the CRUSH function will
 303 also change and some objects from the former placement groups will be
 304 copied over to the new Placement Groups and removed from the old ones.
 305
 306 Placement Groups Tradeoffs
 307 ==========================
 308
 309 Data durability and even distribution among all OSDs call for more
 310 placement groups but their number should be reduced to the minimum to
 311 save CPU and memory.
 312
 313 .. _data durability:
 314
 315 Data durability
 316 ---------------
 317
 318 After an OSD fails, the risk of data loss increases until the data it
 319 contained is fully recovered. Let's imagine a scenario that causes
 320 permanent data loss in a single placement group:
 321
 322 - The OSD fails and all copies of the object it contains are lost.
 323   For all objects within the placement group the number of replica
 324   suddenly drops from three to two.
 325
 326 - Ceph starts recovery for this placement group by choosing a new OSD
 327   to re-create the third copy of all objects.
 328
 329 - Another OSD, within the same placement group, fails before the new
 330   OSD is fully populated with the third copy. Some objects will then
 331   only have one surviving copies.
 332
 333 - Ceph picks yet another OSD and keeps copying objects to restore the
 334   desired number of copies.
 335
 336 - A third OSD, within the same placement group, fails before recovery
 337   is complete. If this OSD contained the only remaining copy of an
 338   object, it is permanently lost.
 339
 340 In a cluster containing 10 OSDs with 512 placement groups in a three
 341 replica pool, CRUSH will give each placement groups three OSDs. In the
 342 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
 343 Groups. When the first OSD fails, the above scenario will therefore
 344 start recovery for all 150 placement groups at the same time.
 345
 346 The 150 placement groups being recovered are likely to be
 347 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
 348 therefore likely to send copies of objects to all others and also
 349 receive some new objects to be stored because they became part of a
 350 new placement group.
 351
 352 The amount of time it takes for this recovery to complete entirely
 353 depends on the architecture of the Ceph cluster. Let say each OSD is
 354 hosted by a 1TB SSD on a single machine and all of them are connected
 355 to a 10Gb/s switch and the recovery for a single OSD completes within
 356 M minutes. If there are two OSDs per machine using spinners with no
 357 SSD journal and a 1Gb/s switch, it will at least be an order of
 358 magnitude slower.
 359
 360 In a cluster of this size, the number of placement groups has almost
 361 no influence on data durability. It could be 128 or 8192 and the
 362 recovery would not be slower or faster.
 363
 364 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
 365 is likely to speed up recovery and therefore improve data durability
 366 significantly. Each OSD now participates in only ~75 placement groups
 367 instead of ~150 when there were only 10 OSDs and it will still require
 368 all 19 remaining OSDs to perform the same amount of object copies in
 369 order to recover. But where 10 OSDs had to copy approximately 100GB
 370 each, they now have to copy 50GB each instead. If the network was the
 371 bottleneck, recovery will happen twice as fast. In other words,
 372 recovery goes faster when the number of OSDs increases.
 373
 374 If this cluster grows to 40 OSDs, each of them will only host ~35
 375 placement groups. If an OSD dies, recovery will keep going faster
 376 unless it is blocked by another bottleneck. However, if this cluster
 377 grows to 200 OSDs, each of them will only host ~7 placement groups. If
 378 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
 379 in these placement groups: recovery will take longer than when there
 380 were 40 OSDs, meaning the number of placement groups should be
 381 increased.
 382
 383 No matter how short the recovery time is, there is a chance for a
 384 second OSD to fail while it is in progress. In the 10 OSDs cluster
 385 described above, if any of them fail, then ~17 placement groups
 386 (i.e. ~150 / 9 placement groups being recovered) will only have one
 387 surviving copy. And if any of the 8 remaining OSD fail, the last
 388 objects of two placement groups are likely to be lost (i.e. ~17 / 8
 389 placement groups with only one remaining copy being recovered).
 390
 391 When the size of the cluster grows to 20 OSDs, the number of Placement
 392 Groups damaged by the loss of three OSDs drops. The second OSD lost
 393 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
 394 instead of ~17 and the third OSD lost will only lose data if it is one
 395 of the four OSDs containing the surviving copy. In other words, if the
 396 probability of losing one OSD is 0.0001% during the recovery time
 397 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
 398 0.0001% in the cluster with 20 OSDs.
 399
 400 In a nutshell, more OSDs mean faster recovery and a lower risk of
 401 cascading failures leading to the permanent loss of a Placement
 402 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
 403 cluster with less than 50 OSDs as far as data durability is concerned.
 404
 405 Note: It may take a long time for a new OSD added to the cluster to be
 406 populated with placement groups that were assigned to it. However
 407 there is no degradation of any object and it has no impact on the
 408 durability of the data contained in the Cluster.
 409
 410 .. _object distribution:
 411
 412 Object distribution within a pool
 413 ---------------------------------
 414
 415 Ideally objects are evenly distributed in each placement group. Since
 416 CRUSH computes the placement group for each object, but does not
 417 actually know how much data is stored in each OSD within this
 418 placement group, the ratio between the number of placement groups and
 419 the number of OSDs may influence the distribution of the data
 420 significantly.
 421
 422 For instance, if there was a single placement group for ten OSDs in a
 423 three replica pool, only three OSD would be used because CRUSH would
 424 have no other choice. When more placement groups are available,
 425 objects are more likely to be evenly spread among them. CRUSH also
 426 makes every effort to evenly spread OSDs among all existing Placement
 427 Groups.
 428
 429 As long as there are one or two orders of magnitude more Placement
 430 Groups than OSDs, the distribution should be even. For instance, 256
 431 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
 432 etc.
 433
 434 Uneven data distribution can be caused by factors other than the ratio
 435 between OSDs and placement groups. Since CRUSH does not take into
 436 account the size of the objects, a few very large objects may create
 437 an imbalance. Let say one million 4K objects totaling 4GB are evenly
 438 spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
 439 = 400MB on each OSD. If one 400MB object is added to the pool, the
 440 three OSDs supporting the placement group in which the object has been
 441 placed will be filled with 400MB + 400MB = 800MB while the seven
 442 others will remain occupied with only 400MB.
 443
 444 .. _resource usage:
 445
 446 Memory, CPU and network usage
 447 -----------------------------
 448
 449 For each placement group, OSDs and MONs need memory, network and CPU
 450 at all times and even more during recovery. Sharing this overhead by
 451 clustering objects within a placement group is one of the main reasons
 452 they exist.
 453
 454 Minimizing the number of placement groups saves significant amounts of
 455 resources.
 456
 457 .. _choosing-number-of-placement-groups:
 458
 459 Choosing the number of Placement Groups
 460 =======================================
 461
 462 .. note: It is rarely necessary to do this math by hand.  Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties.  See :ref:`pg-autoscaler` for more information.
 463
 464 If you have more than 50 OSDs, we recommend approximately 50-100
 465 placement groups per OSD to balance out resource usage, data
 466 durability and distribution. If you have less than 50 OSDs, choosing
 467 among the `preselection`_ above is best. For a single pool of objects,
 468 you can use the following formula to get a baseline
 469
 470   Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
 471
 472 Where **pool size** is either the number of replicas for replicated
 473 pools or the K+M sum for erasure coded pools (as returned by **ceph
 474 osd erasure-code-profile get**).
 475
 476 You should then check if the result makes sense with the way you
 477 designed your Ceph cluster to maximize `data durability`_,
 478 `object distribution`_ and minimize `resource usage`_.
 479
 480 The result should always be **rounded up to the nearest power of two**.
 481
 482 Only a power of two will evenly balance the number of objects among
 483 placement groups. Other values will result in an uneven distribution of
 484 data across your OSDs. Their use should be limited to incrementally
 485 stepping from one power of two to another.
 486
 487 As an example, for a cluster with 200 OSDs and a pool size of 3
 488 replicas, you would estimate your number of PGs as follows
 489
 490   :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
 491
 492 When using multiple data pools for storing objects, you need to ensure
 493 that you balance the number of placement groups per pool with the
 494 number of placement groups per OSD so that you arrive at a reasonable
 495 total number of placement groups that provides reasonably low variance
 496 per OSD without taxing system resources or making the peering process
 497 too slow.
 498
 499 For instance a cluster of 10 pools each with 512 placement groups on
 500 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
 501 that is 512 placement groups per OSD. That does not use too many
 502 resources. However, if 1,000 pools were created with 512 placement
 503 groups each, the OSDs will handle ~50,000 placement groups each and it
 504 would require significantly more resources and time for peering.
 505
 506 You may find the `PGCalc`_ tool helpful.
 507
 508
 509 .. _setting the number of placement groups:
 510
 511 Set the Number of Placement Groups
 512 ==================================
 513
 514 To set the number of placement groups in a pool, you must specify the
 515 number of placement groups at the time you create the pool.
 516 See `Create a Pool`_ for details.  Even after a pool is created you can also change the number of placement groups with::
 517
 518         ceph osd pool set {pool-name} pg_num {pg_num}
 519
 520 After you increase the number of placement groups, you must also
 521 increase the number of placement groups for placement (``pgp_num``)
 522 before your cluster will rebalance. The ``pgp_num`` will be the number of
 523 placement groups that will be considered for placement by the CRUSH
 524 algorithm. Increasing ``pg_num`` splits the placement groups but data
 525 will not be migrated to the newer placement groups until placement
 526 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
 527 should be equal to the ``pg_num``.  To increase the number of
 528 placement groups for placement, execute the following::
 529
 530         ceph osd pool set {pool-name} pgp_num {pgp_num}
 531
 532 When decreasing the number of PGs, ``pgp_num`` is adjusted
 533 automatically for you.
 534
 535 Get the Number of Placement Groups
 536 ==================================
 537
 538 To get the number of placement groups in a pool, execute the following::
 539
 540         ceph osd pool get {pool-name} pg_num
 541
 542
 543 Get a Cluster's PG Statistics
 544 =============================
 545
 546 To get the statistics for the placement groups in your cluster, execute the following::
 547
 548         ceph pg dump [--format {format}]
 549
 550 Valid formats are ``plain`` (default) and ``json``.
 551
 552
 553 Get Statistics for Stuck PGs
 554 ============================
 555
 556 To get the statistics for all placement groups stuck in a specified state,
 557 execute the following::
 558
 559         ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
 560
 561 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
 562 with the most up-to-date data to come up and in.
 563
 564 **Unclean** Placement groups contain objects that are not replicated the desired number
 565 of times. They should be recovering.
 566
 567 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
 568 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
 569
 570 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
 571 of seconds the placement group is stuck before including it in the returned statistics
 572 (default 300 seconds).
 573
 574
 575 Get a PG Map
 576 ============
 577
 578 To get the placement group map for a particular placement group, execute the following::
 579
 580         ceph pg map {pg-id}
 581
 582 For example::
 583
 584         ceph pg map 1.6c
 585
 586 Ceph will return the placement group map, the placement group, and the OSD status::
 587
 588         osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
 589
 590
 591 Get a PGs Statistics
 592 ====================
 593
 594 To retrieve statistics for a particular placement group, execute the following::
 595
 596         ceph pg {pg-id} query
 597
 598
 599 Scrub a Placement Group
 600 =======================
 601
 602 To scrub a placement group, execute the following::
 603
 604         ceph pg scrub {pg-id}
 605
 606 Ceph checks the primary and any replica nodes, generates a catalog of all objects
 607 in the placement group and compares them to ensure that no objects are missing
 608 or mismatched, and their contents are consistent.  Assuming the replicas all
 609 match, a final semantic sweep ensures that all of the snapshot-related object
 610 metadata is consistent. Errors are reported via logs.
 611
 612 To scrub all placement groups from a specific pool, execute the following::
 613
 614         ceph osd pool scrub {pool-name}
 615
 616 Prioritize backfill/recovery of a Placement Group(s)
 617 ====================================================
 618
 619 You may run into a situation where a bunch of placement groups will require
 620 recovery and/or backfill, and some particular groups hold data more important
 621 than others (for example, those PGs may hold data for images used by running
 622 machines and other PGs may be used by inactive machines/less relevant data).
 623 In that case, you may want to prioritize recovery of those groups so
 624 performance and/or availability of data stored on those groups is restored
 625 earlier. To do this (mark particular placement group(s) as prioritized during
 626 backfill or recovery), execute the following::
 627
 628         ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 629         ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 630
 631 This will cause Ceph to perform recovery or backfill on specified placement
 632 groups first, before other placement groups. This does not interrupt currently
 633 ongoing backfills or recovery, but causes specified PGs to be processed
 634 as soon as possible. If you change your mind or prioritize wrong groups,
 635 use::
 636
 637         ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 638         ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 639
 640 This will remove "force" flag from those PGs and they will be processed
 641 in default order. Again, this doesn't affect currently processed placement
 642 group, only those that are still queued.
 643
 644 The "force" flag is cleared automatically after recovery or backfill of group
 645 is done.
 646
 647 Similarly, you may use the following commands to force Ceph to perform recovery
 648 or backfill on all placement groups from a specified pool first::
 649
 650         ceph osd pool force-recovery {pool-name}
 651         ceph osd pool force-backfill {pool-name}
 652
 653 or::
 654
 655         ceph osd pool cancel-force-recovery {pool-name}
 656         ceph osd pool cancel-force-backfill {pool-name}
 657
 658 to restore to the default recovery or backfill priority if you change your mind.
 659
 660 Note that these commands could possibly break the ordering of Ceph's internal
 661 priority computations, so use them with caution!
 662 Especially, if you have multiple pools that are currently sharing the same
 663 underlying OSDs, and some particular pools hold data more important than others,
 664 we recommend you use the following command to re-arrange all pools's
 665 recovery/backfill priority in a better order::
 666
 667         ceph osd pool set {pool-name} recovery_priority {value}
 668
 669 For example, if you have 10 pools you could make the most important one priority 10,
 670 next 9, etc. Or you could leave most pools alone and have say 3 important pools
 671 all priority 1 or priorities 3, 2, 1 respectively.
 672
 673 Revert Lost
 674 ===========
 675
 676 If the cluster has lost one or more objects, and you have decided to
 677 abandon the search for the lost data, you must mark the unfound objects
 678 as ``lost``.
 679
 680 If all possible locations have been queried and objects are still
 681 lost, you may have to give up on the lost objects. This is
 682 possible given unusual combinations of failures that allow the cluster
 683 to learn about writes that were performed before the writes themselves
 684 are recovered.
 685
 686 Currently the only supported option is "revert", which will either roll back to
 687 a previous version of the object or (if it was a new object) forget about it
 688 entirely. To mark the "unfound" objects as "lost", execute the following::
 689
 690         ceph pg {pg-id} mark_unfound_lost revert|delete
 691
 692 .. important:: Use this feature with caution, because it may confuse
 693    applications that expect the object(s) to exist.
 694
 695
 696 .. toctree::
 697         :hidden:
 698
 699         pg-states
 700         pg-concepts
 701
 702
 703 .. _Create a Pool: ../pools#createpool
 704 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
 705 .. _pgcalc: http://ceph.com/pgcalc/