ceph/doc/rados/operations/placement-groups.rst

   1 .. _placement groups:
   2
   3 ==================
   4  Placement Groups
   5 ==================
   6
   7 .. _pg-autoscaler:
   8
   9 Autoscaling placement groups
  10 ============================
  11
  12 Placement groups (PGs) are an internal implementation detail of how
  13 Ceph distributes data.  You may enable *pg-autoscaling* to allow the cluster to
  14 make recommendations or automatically adjust the numbers of PGs (``pgp_num``)
  15 for each pool based on expected cluster and pool utilization.
  16
  17 Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
  18
  19 * ``off``: Disable autoscaling for this pool.  It is up to the administrator to choose an appropriate ``pgp_num`` for each pool.  Please refer to :ref:`choosing-number-of-placement-groups` for more information.
  20 * ``on``: Enable automated adjustments of the PG count for the given pool.
  21 * ``warn``: Raise health alerts when the PG count should be adjusted
  22
  23 To set the autoscaling mode for an existing pool::
  24
  25   ceph osd pool set <pool-name> pg_autoscale_mode <mode>
  26
  27 For example to enable autoscaling on pool ``foo``::
  28
  29   ceph osd pool set foo pg_autoscale_mode on
  30
  31 You can also configure the default ``pg_autoscale_mode`` that is
  32 set on any pools that are subsequently created::
  33
  34   ceph config set global osd_pool_default_pg_autoscale_mode <mode>
  35
  36 You can disable or enable the autoscaler for all pools with
  37 the ``noautoscale`` flag. By default this flag is set to  be ``off``,
  38 but you can turn it ``on`` by using the command::
  39
  40   ceph osd pool set noautoscale
  41
  42 You can turn it ``off`` using the command::
  43
  44   ceph osd pool unset noautoscale
  45
  46 To ``get`` the value of the flag use the command::
  47
  48   ceph osd pool get noautoscale
  49
  50 Viewing PG scaling recommendations
  51 ----------------------------------
  52
  53 You can view each pool, its relative utilization, and any suggested changes to
  54 the PG count with this command::
  55
  56   ceph osd pool autoscale-status
  57
  58 Output will be something like::
  59
  60    POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO BIAS PG_NUM  NEW PG_NUM  AUTOSCALE BULK
  61    a     12900M                3.0        82431M  0.4695                                          8         128  warn      True
  62    c         0                 3.0        82431M  0.0000        0.2000           0.9884  1.0      1          64  warn      True
  63    b         0        953.6M   3.0        82431M  0.0347                                          8              warn      False
  64
  65 **SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
  66 present, is the amount of data the administrator has specified that
  67 they expect to eventually be stored in this pool.  The system uses
  68 the larger of the two values for its calculation.
  69
  70 **RATE** is the multiplier for the pool that determines how much raw
  71 storage capacity is consumed.  For example, a 3 replica pool will
  72 have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a
  73 ratio of 1.5.
  74
  75 **RAW CAPACITY** is the total amount of raw storage capacity on the
  76 OSDs that are responsible for storing this pool's (and perhaps other
  77 pools') data.  **RATIO** is the ratio of that total capacity that
  78 this pool is consuming (i.e., ratio = size * rate / raw capacity).
  79
  80 **TARGET RATIO**, if present, is the ratio of storage that the
  81 administrator has specified that they expect this pool to consume
  82 relative to other pools with target ratios set.
  83 If both target size bytes and ratio are specified, the
  84 ratio takes precedence.
  85
  86 **EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
  87
  88 1. Subtracting any capacity expected to be used by pools with target size set
  89 2. Normalizing the target ratios among pools with target ratio set so
  90    they collectively target the rest of the space. For example, 4
  91    pools with target_ratio 1.0 would have an effective ratio of 0.25.
  92
  93 The system uses the larger of the actual ratio and the effective ratio
  94 for its calculation.
  95
  96 **BIAS** is used as a multiplier to manually adjust a pool's PG based
  97 on prior information about how much PGs a specific pool is expected
  98 to have.
  99
 100 **PG_NUM** is the current number of PGs for the pool (or the current
 101 number of PGs that the pool is working towards, if a ``pg_num``
 102 change is in progress).  **NEW PG_NUM**, if present, is what the
 103 system believes the pool's ``pg_num`` should be changed to.  It is
 104 always a power of 2, and will only be present if the "ideal" value
 105 varies from the current value by more than a factor of 3 by default.
 106 This factor can be be adjusted with::
 107
 108   ceph osd pool set threshold 2.0
 109
 110 **AUTOSCALE**, is the pool ``pg_autoscale_mode``
 111 and will be either ``on``, ``off``, or ``warn``.
 112
 113 The final column, **BULK** determines if the pool is ``bulk``
 114 and will be either ``True`` or ``False``. A ``bulk`` pool
 115 means that the pool is expected to be large and should start out
 116 with large amount of PGs for performance purposes. On the other hand,
 117 pools without the ``bulk`` flag are expected to be smaller e.g.,
 118 .mgr or meta pools.
 119
 120
 121 Automated scaling
 122 -----------------
 123
 124 Allowing the cluster to automatically scale ``pgp_num`` based on usage is the
 125 simplest approach.  Ceph will look at the total available storage and
 126 target number of PGs for the whole system, look at how much data is
 127 stored in each pool, and try to apportion PGs accordingly.  The
 128 system is relatively conservative with its approach, only making
 129 changes to a pool when the current number of PGs (``pg_num``) is more
 130 than a factor of 3 off from what it thinks it should be.
 131
 132 The target number of PGs per OSD is based on the
 133 ``mon_target_pg_per_osd`` configurable (default: 100), which can be
 134 adjusted with::
 135
 136   ceph config set global mon_target_pg_per_osd 100
 137
 138 The autoscaler analyzes pools and adjusts on a per-subtree basis.
 139 Because each pool may map to a different CRUSH rule, and each rule may
 140 distribute data across different devices, Ceph will consider
 141 utilization of each subtree of the hierarchy independently.  For
 142 example, a pool that maps to OSDs of class `ssd` and a pool that maps
 143 to OSDs of class `hdd` will each have optimal PG counts that depend on
 144 the number of those respective device types.
 145
 146 The autoscaler uses the `bulk` flag to determine which pool
 147 should start out with a full complement of PGs and only
 148 scales down when the usage ratio across the pool is not even.
 149 However, if the pool doesn't have the `bulk` flag, the pool will
 150 start out with minimal PGs and only when there is more usage in the pool.
 151
 152 The autoscaler identifies any overlapping roots and prevents the pools
 153 with such roots from scaling because overlapping roots can cause problems
 154 with the scaling process.
 155
 156 To create pool with `bulk` flag::
 157
 158   ceph osd pool create <pool-name> --bulk
 159
 160 To set/unset `bulk` flag of existing pool::
 161
 162   ceph osd pool set <pool-name> bulk <true/false/1/0>
 163
 164 To get `bulk` flag of existing pool::
 165
 166   ceph osd pool get <pool-name> bulk
 167
 168 .. _specifying_pool_target_size:
 169
 170 Specifying expected pool size
 171 -----------------------------
 172
 173 When a cluster or pool is first created, it will consume a small
 174 fraction of the total cluster capacity and will appear to the system
 175 as if it should only need a small number of placement groups.
 176 However, in most cases cluster administrators have a good idea which
 177 pools are expected to consume most of the system capacity over time.
 178 By providing this information to Ceph, a more appropriate number of
 179 PGs can be used from the beginning, preventing subsequent changes in
 180 ``pg_num`` and the overhead associated with moving data around when
 181 those adjustments are made.
 182
 183 The *target size* of a pool can be specified in two ways: either in
 184 terms of the absolute size of the pool (i.e., bytes), or as a weight
 185 relative to other pools with a ``target_size_ratio`` set.
 186
 187 For example::
 188
 189   ceph osd pool set mypool target_size_bytes 100T
 190
 191 will tell the system that `mypool` is expected to consume 100 TiB of
 192 space.  Alternatively::
 193
 194   ceph osd pool set mypool target_size_ratio 1.0
 195
 196 will tell the system that `mypool` is expected to consume 1.0 relative
 197 to the other pools with ``target_size_ratio`` set. If `mypool` is the
 198 only pool in the cluster, this means an expected use of 100% of the
 199 total capacity. If there is a second pool with ``target_size_ratio``
 200 1.0, both pools would expect to use 50% of the cluster capacity.
 201
 202 You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
 203
 204 Note that if impossible target size values are specified (for example,
 205 a capacity larger than the total cluster) then a health warning
 206 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
 207
 208 If both ``target_size_ratio`` and ``target_size_bytes`` are specified
 209 for a pool, only the ratio will be considered, and a health warning
 210 (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
 211
 212 Specifying bounds on a pool's PGs
 213 ---------------------------------
 214
 215 It is also possible to specify a minimum number of PGs for a pool.
 216 This is useful for establishing a lower bound on the amount of
 217 parallelism client will see when doing IO, even when a pool is mostly
 218 empty.  Setting the lower bound prevents Ceph from reducing (or
 219 recommending you reduce) the PG number below the configured number.
 220
 221 You can set the minimum or maximum number of PGs for a pool with::
 222
 223   ceph osd pool set <pool-name> pg_num_min <num>
 224   ceph osd pool set <pool-name> pg_num_max <num>
 225
 226 You can also specify the minimum or maximum PG count at pool creation
 227 time with the optional ``--pg-num-min <num>`` or ``--pg-num-max
 228 <num>`` arguments to the ``ceph osd pool create`` command.
 229
 230 .. _preselection:
 231
 232 A preselection of pg_num
 233 ========================
 234
 235 When creating a new pool with::
 236
 237         ceph osd pool create {pool-name} [pg_num]
 238
 239 it is optional to choose the value of ``pg_num``.  If you do not
 240 specify ``pg_num``, the cluster can (by default) automatically tune it
 241 for you based on how much data is stored in the pool (see above, :ref:`pg-autoscaler`).
 242
 243 Alternatively, ``pg_num`` can be explicitly provided.  However,
 244 whether you specify a ``pg_num`` value or not does not affect whether
 245 the value is automatically tuned by the cluster after the fact.  To
 246 enable or disable auto-tuning::
 247
 248   ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
 249
 250 The "rule of thumb" for PGs per OSD has traditionally be 100.  With
 251 the additional of the balancer (which is also enabled by default), a
 252 value of more like 50 PGs per OSD is probably reasonable.  The
 253 challenge (which the autoscaler normally does for you), is to:
 254
 255 - have the PGs per pool proportional to the data in the pool, and
 256 - end up with 50-100 PGs per OSDs, after the replication or
 257   erasuring-coding fan-out of each PG across OSDs is taken into
 258   consideration
 259
 260 How are Placement Groups used ?
 261 ===============================
 262
 263 A placement group (PG) aggregates objects within a pool because
 264 tracking object placement and object metadata on a per-object basis is
 265 computationally expensive--i.e., a system with millions of objects
 266 cannot realistically track placement on a per-object basis.
 267
 268 .. ditaa::
 269            /-----\  /-----\  /-----\  /-----\  /-----\
 270            | obj |  | obj |  | obj |  | obj |  | obj |
 271            \-----/  \-----/  \-----/  \-----/  \-----/
 272               |        |        |        |        |
 273               +--------+--------+        +---+----+
 274               |                              |
 275               v                              v
 276    +-----------------------+      +-----------------------+
 277    |  Placement Group #1   |      |  Placement Group #2   |
 278    |                       |      |                       |
 279    +-----------------------+      +-----------------------+
 280                |                              |
 281                +------------------------------+
 282                              |
 283                              v
 284                   +-----------------------+
 285                   |        Pool           |
 286                   |                       |
 287                   +-----------------------+
 288
 289 The Ceph client will calculate which placement group an object should
 290 be in. It does this by hashing the object ID and applying an operation
 291 based on the number of PGs in the defined pool and the ID of the pool.
 292 See `Mapping PGs to OSDs`_ for details.
 293
 294 The object's contents within a placement group are stored in a set of
 295 OSDs. For instance, in a replicated pool of size two, each placement
 296 group will store objects on two OSDs, as shown below.
 297
 298 .. ditaa::
 299    +-----------------------+      +-----------------------+
 300    |  Placement Group #1   |      |  Placement Group #2   |
 301    |                       |      |                       |
 302    +-----------------------+      +-----------------------+
 303         |             |               |             |
 304         v             v               v             v
 305    /----------\  /----------\    /----------\  /----------\
 306    |          |  |          |    |          |  |          |
 307    |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
 308    |          |  |          |    |          |  |          |
 309    \----------/  \----------/    \----------/  \----------/
 310
 311
 312 Should OSD #2 fail, another will be assigned to Placement Group #1 and
 313 will be filled with copies of all objects in OSD #1. If the pool size
 314 is changed from two to three, an additional OSD will be assigned to
 315 the placement group and will receive copies of all objects in the
 316 placement group.
 317
 318 Placement groups do not own the OSD; they share it with other
 319 placement groups from the same pool or even other pools. If OSD #2
 320 fails, the Placement Group #2 will also have to restore copies of
 321 objects, using OSD #3.
 322
 323 When the number of placement groups increases, the new placement
 324 groups will be assigned OSDs. The result of the CRUSH function will
 325 also change and some objects from the former placement groups will be
 326 copied over to the new Placement Groups and removed from the old ones.
 327
 328 Placement Groups Tradeoffs
 329 ==========================
 330
 331 Data durability and even distribution among all OSDs call for more
 332 placement groups but their number should be reduced to the minimum to
 333 save CPU and memory.
 334
 335 .. _data durability:
 336
 337 Data durability
 338 ---------------
 339
 340 After an OSD fails, the risk of data loss increases until the data it
 341 contained is fully recovered. Let's imagine a scenario that causes
 342 permanent data loss in a single placement group:
 343
 344 - The OSD fails and all copies of the object it contains are lost.
 345   For all objects within the placement group the number of replica
 346   suddenly drops from three to two.
 347
 348 - Ceph starts recovery for this placement group by choosing a new OSD
 349   to re-create the third copy of all objects.
 350
 351 - Another OSD, within the same placement group, fails before the new
 352   OSD is fully populated with the third copy. Some objects will then
 353   only have one surviving copies.
 354
 355 - Ceph picks yet another OSD and keeps copying objects to restore the
 356   desired number of copies.
 357
 358 - A third OSD, within the same placement group, fails before recovery
 359   is complete. If this OSD contained the only remaining copy of an
 360   object, it is permanently lost.
 361
 362 In a cluster containing 10 OSDs with 512 placement groups in a three
 363 replica pool, CRUSH will give each placement groups three OSDs. In the
 364 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
 365 Groups. When the first OSD fails, the above scenario will therefore
 366 start recovery for all 150 placement groups at the same time.
 367
 368 The 150 placement groups being recovered are likely to be
 369 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
 370 therefore likely to send copies of objects to all others and also
 371 receive some new objects to be stored because they became part of a
 372 new placement group.
 373
 374 The amount of time it takes for this recovery to complete entirely
 375 depends on the architecture of the Ceph cluster. Let say each OSD is
 376 hosted by a 1TB SSD on a single machine and all of them are connected
 377 to a 10Gb/s switch and the recovery for a single OSD completes within
 378 M minutes. If there are two OSDs per machine using spinners with no
 379 SSD journal and a 1Gb/s switch, it will at least be an order of
 380 magnitude slower.
 381
 382 In a cluster of this size, the number of placement groups has almost
 383 no influence on data durability. It could be 128 or 8192 and the
 384 recovery would not be slower or faster.
 385
 386 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
 387 is likely to speed up recovery and therefore improve data durability
 388 significantly. Each OSD now participates in only ~75 placement groups
 389 instead of ~150 when there were only 10 OSDs and it will still require
 390 all 19 remaining OSDs to perform the same amount of object copies in
 391 order to recover. But where 10 OSDs had to copy approximately 100GB
 392 each, they now have to copy 50GB each instead. If the network was the
 393 bottleneck, recovery will happen twice as fast. In other words,
 394 recovery goes faster when the number of OSDs increases.
 395
 396 If this cluster grows to 40 OSDs, each of them will only host ~35
 397 placement groups. If an OSD dies, recovery will keep going faster
 398 unless it is blocked by another bottleneck. However, if this cluster
 399 grows to 200 OSDs, each of them will only host ~7 placement groups. If
 400 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
 401 in these placement groups: recovery will take longer than when there
 402 were 40 OSDs, meaning the number of placement groups should be
 403 increased.
 404
 405 No matter how short the recovery time is, there is a chance for a
 406 second OSD to fail while it is in progress. In the 10 OSDs cluster
 407 described above, if any of them fail, then ~17 placement groups
 408 (i.e. ~150 / 9 placement groups being recovered) will only have one
 409 surviving copy. And if any of the 8 remaining OSD fail, the last
 410 objects of two placement groups are likely to be lost (i.e. ~17 / 8
 411 placement groups with only one remaining copy being recovered).
 412
 413 When the size of the cluster grows to 20 OSDs, the number of Placement
 414 Groups damaged by the loss of three OSDs drops. The second OSD lost
 415 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
 416 instead of ~17 and the third OSD lost will only lose data if it is one
 417 of the four OSDs containing the surviving copy. In other words, if the
 418 probability of losing one OSD is 0.0001% during the recovery time
 419 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
 420 0.0001% in the cluster with 20 OSDs.
 421
 422 In a nutshell, more OSDs mean faster recovery and a lower risk of
 423 cascading failures leading to the permanent loss of a Placement
 424 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
 425 cluster with less than 50 OSDs as far as data durability is concerned.
 426
 427 Note: It may take a long time for a new OSD added to the cluster to be
 428 populated with placement groups that were assigned to it. However
 429 there is no degradation of any object and it has no impact on the
 430 durability of the data contained in the Cluster.
 431
 432 .. _object distribution:
 433
 434 Object distribution within a pool
 435 ---------------------------------
 436
 437 Ideally objects are evenly distributed in each placement group. Since
 438 CRUSH computes the placement group for each object, but does not
 439 actually know how much data is stored in each OSD within this
 440 placement group, the ratio between the number of placement groups and
 441 the number of OSDs may influence the distribution of the data
 442 significantly.
 443
 444 For instance, if there was a single placement group for ten OSDs in a
 445 three replica pool, only three OSD would be used because CRUSH would
 446 have no other choice. When more placement groups are available,
 447 objects are more likely to be evenly spread among them. CRUSH also
 448 makes every effort to evenly spread OSDs among all existing Placement
 449 Groups.
 450
 451 As long as there are one or two orders of magnitude more Placement
 452 Groups than OSDs, the distribution should be even. For instance, 256
 453 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs
 454 etc.
 455
 456 Uneven data distribution can be caused by factors other than the ratio
 457 between OSDs and placement groups. Since CRUSH does not take into
 458 account the size of the objects, a few very large objects may create
 459 an imbalance. Let say one million 4K objects totaling 4GB are evenly
 460 spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10
 461 = 400MB on each OSD. If one 400MB object is added to the pool, the
 462 three OSDs supporting the placement group in which the object has been
 463 placed will be filled with 400MB + 400MB = 800MB while the seven
 464 others will remain occupied with only 400MB.
 465
 466 .. _resource usage:
 467
 468 Memory, CPU and network usage
 469 -----------------------------
 470
 471 For each placement group, OSDs and MONs need memory, network and CPU
 472 at all times and even more during recovery. Sharing this overhead by
 473 clustering objects within a placement group is one of the main reasons
 474 they exist.
 475
 476 Minimizing the number of placement groups saves significant amounts of
 477 resources.
 478
 479 .. _choosing-number-of-placement-groups:
 480
 481 Choosing the number of Placement Groups
 482 =======================================
 483
 484 .. note: It is rarely necessary to do this math by hand.  Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties.  See :ref:`pg-autoscaler` for more information.
 485
 486 If you have more than 50 OSDs, we recommend approximately 50-100
 487 placement groups per OSD to balance out resource usage, data
 488 durability and distribution. If you have less than 50 OSDs, choosing
 489 among the `preselection`_ above is best. For a single pool of objects,
 490 you can use the following formula to get a baseline
 491
 492   Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
 493
 494 Where **pool size** is either the number of replicas for replicated
 495 pools or the K+M sum for erasure coded pools (as returned by **ceph
 496 osd erasure-code-profile get**).
 497
 498 You should then check if the result makes sense with the way you
 499 designed your Ceph cluster to maximize `data durability`_,
 500 `object distribution`_ and minimize `resource usage`_.
 501
 502 The result should always be **rounded up to the nearest power of two**.
 503
 504 Only a power of two will evenly balance the number of objects among
 505 placement groups. Other values will result in an uneven distribution of
 506 data across your OSDs. Their use should be limited to incrementally
 507 stepping from one power of two to another.
 508
 509 As an example, for a cluster with 200 OSDs and a pool size of 3
 510 replicas, you would estimate your number of PGs as follows
 511
 512   :math:`\frac{200 \times 100}{3} = 6667`. Nearest power of 2: 8192
 513
 514 When using multiple data pools for storing objects, you need to ensure
 515 that you balance the number of placement groups per pool with the
 516 number of placement groups per OSD so that you arrive at a reasonable
 517 total number of placement groups that provides reasonably low variance
 518 per OSD without taxing system resources or making the peering process
 519 too slow.
 520
 521 For instance a cluster of 10 pools each with 512 placement groups on
 522 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
 523 that is 512 placement groups per OSD. That does not use too many
 524 resources. However, if 1,000 pools were created with 512 placement
 525 groups each, the OSDs will handle ~50,000 placement groups each and it
 526 would require significantly more resources and time for peering.
 527
 528 You may find the `PGCalc`_ tool helpful.
 529
 530
 531 .. _setting the number of placement groups:
 532
 533 Set the Number of Placement Groups
 534 ==================================
 535
 536 To set the number of placement groups in a pool, you must specify the
 537 number of placement groups at the time you create the pool.
 538 See `Create a Pool`_ for details.  Even after a pool is created you can also change the number of placement groups with::
 539
 540         ceph osd pool set {pool-name} pg_num {pg_num}
 541
 542 After you increase the number of placement groups, you must also
 543 increase the number of placement groups for placement (``pgp_num``)
 544 before your cluster will rebalance. The ``pgp_num`` will be the number of
 545 placement groups that will be considered for placement by the CRUSH
 546 algorithm. Increasing ``pg_num`` splits the placement groups but data
 547 will not be migrated to the newer placement groups until placement
 548 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
 549 should be equal to the ``pg_num``.  To increase the number of
 550 placement groups for placement, execute the following::
 551
 552         ceph osd pool set {pool-name} pgp_num {pgp_num}
 553
 554 When decreasing the number of PGs, ``pgp_num`` is adjusted
 555 automatically for you.
 556
 557 Get the Number of Placement Groups
 558 ==================================
 559
 560 To get the number of placement groups in a pool, execute the following::
 561
 562         ceph osd pool get {pool-name} pg_num
 563
 564
 565 Get a Cluster's PG Statistics
 566 =============================
 567
 568 To get the statistics for the placement groups in your cluster, execute the following::
 569
 570         ceph pg dump [--format {format}]
 571
 572 Valid formats are ``plain`` (default) and ``json``.
 573
 574
 575 Get Statistics for Stuck PGs
 576 ============================
 577
 578 To get the statistics for all placement groups stuck in a specified state,
 579 execute the following::
 580
 581         ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
 582
 583 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
 584 with the most up-to-date data to come up and in.
 585
 586 **Unclean** Placement groups contain objects that are not replicated the desired number
 587 of times. They should be recovering.
 588
 589 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
 590 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
 591
 592 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
 593 of seconds the placement group is stuck before including it in the returned statistics
 594 (default 300 seconds).
 595
 596
 597 Get a PG Map
 598 ============
 599
 600 To get the placement group map for a particular placement group, execute the following::
 601
 602         ceph pg map {pg-id}
 603
 604 For example::
 605
 606         ceph pg map 1.6c
 607
 608 Ceph will return the placement group map, the placement group, and the OSD status::
 609
 610         osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
 611
 612
 613 Get a PGs Statistics
 614 ====================
 615
 616 To retrieve statistics for a particular placement group, execute the following::
 617
 618         ceph pg {pg-id} query
 619
 620
 621 Scrub a Placement Group
 622 =======================
 623
 624 To scrub a placement group, execute the following::
 625
 626         ceph pg scrub {pg-id}
 627
 628 Ceph checks the primary and any replica nodes, generates a catalog of all objects
 629 in the placement group and compares them to ensure that no objects are missing
 630 or mismatched, and their contents are consistent.  Assuming the replicas all
 631 match, a final semantic sweep ensures that all of the snapshot-related object
 632 metadata is consistent. Errors are reported via logs.
 633
 634 To scrub all placement groups from a specific pool, execute the following::
 635
 636         ceph osd pool scrub {pool-name}
 637
 638 Prioritize backfill/recovery of a Placement Group(s)
 639 ====================================================
 640
 641 You may run into a situation where a bunch of placement groups will require
 642 recovery and/or backfill, and some particular groups hold data more important
 643 than others (for example, those PGs may hold data for images used by running
 644 machines and other PGs may be used by inactive machines/less relevant data).
 645 In that case, you may want to prioritize recovery of those groups so
 646 performance and/or availability of data stored on those groups is restored
 647 earlier. To do this (mark particular placement group(s) as prioritized during
 648 backfill or recovery), execute the following::
 649
 650         ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 651         ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 652
 653 This will cause Ceph to perform recovery or backfill on specified placement
 654 groups first, before other placement groups. This does not interrupt currently
 655 ongoing backfills or recovery, but causes specified PGs to be processed
 656 as soon as possible. If you change your mind or prioritize wrong groups,
 657 use::
 658
 659         ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 660         ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 661
 662 This will remove "force" flag from those PGs and they will be processed
 663 in default order. Again, this doesn't affect currently processed placement
 664 group, only those that are still queued.
 665
 666 The "force" flag is cleared automatically after recovery or backfill of group
 667 is done.
 668
 669 Similarly, you may use the following commands to force Ceph to perform recovery
 670 or backfill on all placement groups from a specified pool first::
 671
 672         ceph osd pool force-recovery {pool-name}
 673         ceph osd pool force-backfill {pool-name}
 674
 675 or::
 676
 677         ceph osd pool cancel-force-recovery {pool-name}
 678         ceph osd pool cancel-force-backfill {pool-name}
 679
 680 to restore to the default recovery or backfill priority if you change your mind.
 681
 682 Note that these commands could possibly break the ordering of Ceph's internal
 683 priority computations, so use them with caution!
 684 Especially, if you have multiple pools that are currently sharing the same
 685 underlying OSDs, and some particular pools hold data more important than others,
 686 we recommend you use the following command to re-arrange all pools's
 687 recovery/backfill priority in a better order::
 688
 689         ceph osd pool set {pool-name} recovery_priority {value}
 690
 691 For example, if you have 10 pools you could make the most important one priority 10,
 692 next 9, etc. Or you could leave most pools alone and have say 3 important pools
 693 all priority 1 or priorities 3, 2, 1 respectively.
 694
 695 Revert Lost
 696 ===========
 697
 698 If the cluster has lost one or more objects, and you have decided to
 699 abandon the search for the lost data, you must mark the unfound objects
 700 as ``lost``.
 701
 702 If all possible locations have been queried and objects are still
 703 lost, you may have to give up on the lost objects. This is
 704 possible given unusual combinations of failures that allow the cluster
 705 to learn about writes that were performed before the writes themselves
 706 are recovered.
 707
 708 Currently the only supported option is "revert", which will either roll back to
 709 a previous version of the object or (if it was a new object) forget about it
 710 entirely. To mark the "unfound" objects as "lost", execute the following::
 711
 712         ceph pg {pg-id} mark_unfound_lost revert|delete
 713
 714 .. important:: Use this feature with caution, because it may confuse
 715    applications that expect the object(s) to exist.
 716
 717
 718 .. toctree::
 719         :hidden:
 720
 721         pg-states
 722         pg-concepts
 723
 724
 725 .. _Create a Pool: ../pools#createpool
 726 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
 727 .. _pgcalc: https://old.ceph.com/pgcalc/