ceph/doc/rados/operations/placement-groups.rst

   1 .. _placement groups:
   2
   3 ==================
   4  Placement Groups
   5 ==================
   6
   7 .. _pg-autoscaler:
   8
   9 Autoscaling placement groups
  10 ============================
  11
  12 Placement groups (PGs) are an internal implementation detail of how Ceph
  13 distributes data. Autoscaling provides a way to manage PGs, and especially to
  14 manage the number of PGs present in different pools.  When *pg-autoscaling* is
  15 enabled, the cluster is allowed to make recommendations or automatic
  16 adjustments with respect to the number of PGs for each pool (``pgp_num``) in
  17 accordance with expected cluster utilization and expected pool utilization.
  18
  19 Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
  20 ``on``, or ``warn``:
  21
  22 * ``off``: Disable autoscaling for this pool. It is up to the administrator to
  23   choose an appropriate ``pgp_num`` for each pool. For more information, see
  24   :ref:`choosing-number-of-placement-groups`.
  25 * ``on``: Enable automated adjustments of the PG count for the given pool.
  26 * ``warn``: Raise health checks when the PG count is in need of adjustment.
  27
  28 To set the autoscaling mode for an existing pool, run a command of the
  29 following form:
  30
  31 .. prompt:: bash #
  32
  33    ceph osd pool set <pool-name> pg_autoscale_mode <mode>
  34
  35 For example, to enable autoscaling on pool ``foo``, run the following command:
  36
  37 .. prompt:: bash #
  38
  39    ceph osd pool set foo pg_autoscale_mode on
  40
  41 There is also a ``pg_autoscale_mode`` setting for any pools that are created
  42 after the initial setup of the cluster. To change this setting, run a command
  43 of the following form:
  44
  45 .. prompt:: bash #
  46
  47    ceph config set global osd_pool_default_pg_autoscale_mode <mode>
  48
  49 You can disable or enable the autoscaler for all pools with the ``noautoscale``
  50 flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
  51 running the following command:
  52
  53 .. prompt:: bash #
  54
  55    ceph osd pool set noautoscale
  56
  57 To set the ``noautoscale`` flag to ``off``, run the following command:
  58
  59 .. prompt:: bash #
  60
  61    ceph osd pool unset noautoscale
  62
  63 To get the value of the flag, run the following command:
  64
  65 .. prompt:: bash #
  66
  67    ceph osd pool get noautoscale
  68
  69 Viewing PG scaling recommendations
  70 ----------------------------------
  71
  72 To view each pool, its relative utilization, and any recommended changes to the
  73 PG count, run the following command:
  74
  75 .. prompt:: bash #
  76
  77    ceph osd pool autoscale-status
  78
  79 The output will resemble the following::
  80
  81    POOL    SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO BIAS PG_NUM  NEW PG_NUM  AUTOSCALE BULK
  82    a     12900M                3.0        82431M  0.4695                                          8         128  warn      True
  83    c         0                 3.0        82431M  0.0000        0.2000           0.9884  1.0      1          64  warn      True
  84    b         0        953.6M   3.0        82431M  0.0347                                          8              warn      False
  85
  86 - **POOL** is the name of the pool.
  87
  88 - **SIZE** is the amount of data stored in the pool.
  89
  90 - **TARGET SIZE** (if present) is the amount of data that is expected to be
  91   stored in the pool, as specified by the administrator. The system uses the
  92   greater of the two values for its calculation.
  93
  94 - **RATE** is the multiplier for the pool that determines how much raw storage
  95   capacity is consumed. For example, a three-replica pool will have a ratio of
  96   3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
  97
  98 - **RAW CAPACITY** is the total amount of raw storage capacity on the specific
  99   OSDs that are responsible for storing the data of the pool (and perhaps the
 100   data of other pools).
 101
 102 - **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
 103   total raw storage capacity. In order words, RATIO is defined as
 104   (SIZE * RATE) / RAW CAPACITY.
 105
 106 - **TARGET RATIO** (if present) is the ratio of the expected storage of this
 107   pool (that is, the amount of storage that this pool is expected to consume,
 108   as specified by the administrator) to the expected storage of all other pools
 109   that have target ratios set.  If both ``target_size_bytes`` and
 110   ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
 111   precedence.
 112
 113 - **EFFECTIVE RATIO** is the result of making two adjustments to the target
 114   ratio:
 115
 116   #. Subtracting any capacity expected to be used by pools that have target
 117      size set.
 118
 119   #. Normalizing the target ratios among pools that have target ratio set so
 120      that collectively they target cluster capacity. For example, four pools
 121      with target_ratio 1.0 would have an effective ratio of 0.25.
 122
 123   The system's calculations use whichever of these two ratios (that is, the
 124   target ratio and the effective ratio) is greater.
 125
 126 - **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
 127   with prior information about how many PGs a specific pool is expected to
 128   have.
 129
 130 - **PG_NUM** is either the current number of PGs associated with the pool or,
 131   if a ``pg_num`` change is in progress, the current number of PGs that the
 132   pool is working towards.
 133
 134 - **NEW PG_NUM** (if present) is the value that the system is recommending the
 135   ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
 136   present only if the recommended value varies from the current value by more
 137   than the default factor of ``3``. To adjust this factor (in the following
 138   example, it is changed to ``2``), run the following command:
 139
 140   .. prompt:: bash #
 141
 142      ceph osd pool set threshold 2.0
 143
 144 - **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
 145   ``off``, or ``warn``.
 146
 147 - **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
 148   or ``False``. A ``bulk`` pool is expected to be large and should initially
 149   have a large number of PGs so that performance does not suffer]. On the other
 150   hand, a pool that is not ``bulk`` is expected to be small (for example, a
 151   ``.mgr`` pool or a meta pool).
 152
 153 .. note::
 154
 155    If the ``ceph osd pool autoscale-status`` command returns no output at all,
 156    there is probably at least one pool that spans multiple CRUSH roots.  This
 157    'spanning pool' issue can happen in scenarios like the following:
 158    when a new deployment auto-creates the ``.mgr`` pool on the ``default``
 159    CRUSH root, subsequent pools are created with rules that constrain them to a
 160    specific shadow CRUSH tree. For example, if you create an RBD metadata pool
 161    that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
 162    constrained to ``deviceclass = hdd``, you will encounter this issue. To
 163    remedy this issue, constrain the spanning pool to only one device class. In
 164    the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
 165    effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
 166    running the following commands:
 167
 168    .. prompt:: bash #
 169
 170       ceph osd pool set .mgr crush_rule replicated-ssd
 171       ceph osd pool set pool 1 crush_rule to replicated-ssd
 172
 173    This intervention will result in a small amount of backfill, but
 174    typically this traffic completes quickly.
 175
 176
 177 Automated scaling
 178 -----------------
 179
 180 In the simplest approach to automated scaling, the cluster is allowed to
 181 automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
 182 total available storage and the target number of PGs for the whole system,
 183 considers how much data is stored in each pool, and apportions PGs accordingly.
 184 The system is conservative with its approach, making changes to a pool only
 185 when the current number of PGs (``pg_num``) varies by more than a factor of 3
 186 from the recommended number.
 187
 188 The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
 189 parameter (default: 100), which can be adjusted by running the following
 190 command:
 191
 192 .. prompt:: bash #
 193
 194    ceph config set global mon_target_pg_per_osd 100
 195
 196 The autoscaler analyzes pools and adjusts on a per-subtree basis.  Because each
 197 pool might map to a different CRUSH rule, and each rule might distribute data
 198 across different devices, Ceph will consider the utilization of each subtree of
 199 the hierarchy independently. For example, a pool that maps to OSDs of class
 200 ``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
 201 counts that are determined by how many of these two different device types
 202 there are.
 203
 204 If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
 205 with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
 206 user in the manager log. The warning states the name of the pool and the set of
 207 roots that overlap each other. The autoscaler does not scale any pools with
 208 overlapping roots because this condition can cause problems with the scaling
 209 process. We recommend constraining each pool so that it belongs to only one
 210 root (that is, one OSD class) to silence the warning and ensure a successful
 211 scaling process.
 212
 213 .. _managing_bulk_flagged_pools:
 214
 215 Managing pools that are flagged with ``bulk``
 216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 217
 218 If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
 219 complement of PGs and then scales down the number of PGs only if the usage
 220 ratio across the pool is uneven.  However, if a pool is not flagged ``bulk``,
 221 then the autoscaler starts the pool with minimal PGs and creates additional PGs
 222 only if there is more usage in the pool.
 223
 224 To create a pool that will be flagged ``bulk``, run the following command:
 225
 226 .. prompt:: bash #
 227
 228    ceph osd pool create <pool-name> --bulk
 229
 230 To set or unset the ``bulk`` flag of an existing pool, run the following
 231 command:
 232
 233 .. prompt:: bash #
 234
 235    ceph osd pool set <pool-name> bulk <true/false/1/0>
 236
 237 To get the ``bulk`` flag of an existing pool, run the following command:
 238
 239 .. prompt:: bash #
 240
 241    ceph osd pool get <pool-name> bulk
 242
 243 .. _specifying_pool_target_size:
 244
 245 Specifying expected pool size
 246 -----------------------------
 247
 248 When a cluster or pool is first created, it consumes only a small fraction of
 249 the total cluster capacity and appears to the system as if it should need only
 250 a small number of PGs. However, in some cases, cluster administrators know
 251 which pools are likely to consume most of the system capacity in the long run.
 252 When Ceph is provided with this information, a more appropriate number of PGs
 253 can be used from the beginning, obviating subsequent changes in ``pg_num`` and
 254 the associated overhead cost of relocating data.
 255
 256 The *target size* of a pool can be specified in two ways: either in relation to
 257 the absolute size (in bytes) of the pool, or as a weight relative to all other
 258 pools that have ``target_size_ratio`` set.
 259
 260 For example, to tell the system that ``mypool`` is expected to consume 100 TB,
 261 run the following command:
 262
 263 .. prompt:: bash #
 264
 265    ceph osd pool set mypool target_size_bytes 100T
 266
 267 Alternatively, to tell the system that ``mypool`` is expected to consume a
 268 ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
 269 adjust the ``target_size_ratio`` setting of ``my pool`` by running the
 270 following command:
 271
 272 .. prompt:: bash #
 273
 274    ceph osd pool set mypool target_size_ratio 1.0
 275
 276 If `mypool` is the only pool in the cluster, then it is expected to use 100% of
 277 the total cluster capacity. However, if the cluster contains a second pool that
 278 has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
 279 of the total cluster capacity.
 280
 281 The ``ceph osd pool create`` command has two command-line options that can be
 282 used to set the target size of a pool at creation time: ``--target-size-bytes
 283 <bytes>`` and ``--target-size-ratio <ratio>``.
 284
 285 Note that if the target-size values that have been specified are impossible
 286 (for example, a capacity larger than the total cluster), then a health check
 287 (``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
 288
 289 If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
 290 pool, then the latter will be ignored, the former will be used in system
 291 calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
 292 will be raised.
 293
 294 Specifying bounds on a pool's PGs
 295 ---------------------------------
 296
 297 It is possible to specify both the minimum number and the maximum number of PGs
 298 for a pool.
 299
 300 Setting a Minimum Number of PGs and a Maximum Number of PGs
 301 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 302
 303 If a minimum is set, then Ceph will not itself reduce (nor recommend that you
 304 reduce) the number of PGs to a value below the configured value. Setting a
 305 minimum serves to establish a lower bound on the amount of parallelism enjoyed
 306 by a client during I/O, even if a pool is mostly empty.
 307
 308 If a maximum is set, then Ceph will not itself increase (or recommend that you
 309 increase) the number of PGs to a value above the configured value.
 310
 311 To set the minimum number of PGs for a pool, run a command of the following
 312 form:
 313
 314 .. prompt:: bash #
 315
 316    ceph osd pool set <pool-name> pg_num_min <num>
 317
 318 To set the maximum number of PGs for a pool, run a command of the following
 319 form:
 320
 321 .. prompt:: bash #
 322
 323    ceph osd pool set <pool-name> pg_num_max <num>
 324
 325 In addition, the ``ceph osd pool create`` command has two command-line options
 326 that can be used to specify the minimum or maximum PG count of a pool at
 327 creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.
 328
 329 .. _preselection:
 330
 331 Preselecting pg_num
 332 ===================
 333
 334 When creating a pool with the following command, you have the option to
 335 preselect the value of the ``pg_num`` parameter:
 336
 337 .. prompt:: bash #
 338
 339    ceph osd pool create {pool-name} [pg_num]
 340
 341 If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
 342 autoscaler to automatically configure the parameter in accordance with the
 343 amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).
 344
 345 However, your decision of whether or not to specify ``pg_num`` at creation time
 346 has no effect on whether the parameter will be automatically tuned by the
 347 cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
 348 running a command of the following form:
 349
 350 .. prompt:: bash #
 351
 352    ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
 353
 354 Without the balancer, the suggested target is approximately 100 PG replicas on
 355 each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
 356 reasonable.
 357
 358 The autoscaler attempts to satisfy the following conditions:
 359
 360 - the number of PGs per OSD should be proportional to the amount of data in the
 361   pool
 362 - there should be 50-100 PGs per pool, taking into account the replication
 363   overhead or erasure-coding fan-out of each PG's replicas across OSDs
 364
 365 Use of Placement Groups
 366 =======================
 367
 368 A placement group aggregates objects within a pool. The tracking of RADOS
 369 object placement and object metadata on a per-object basis is computationally
 370 expensive. It would be infeasible for a system with millions of RADOS
 371 objects to efficiently track placement on a per-object basis.
 372
 373 .. ditaa::
 374            /-----\  /-----\  /-----\  /-----\  /-----\
 375            | obj |  | obj |  | obj |  | obj |  | obj |
 376            \-----/  \-----/  \-----/  \-----/  \-----/
 377               |        |        |        |        |
 378               +--------+--------+        +---+----+
 379               |                              |
 380               v                              v
 381    +-----------------------+      +-----------------------+
 382    |  Placement Group #1   |      |  Placement Group #2   |
 383    |                       |      |                       |
 384    +-----------------------+      +-----------------------+
 385                |                              |
 386                +------------------------------+
 387                              |
 388                              v
 389                   +-----------------------+
 390                   |        Pool           |
 391                   |                       |
 392                   +-----------------------+
 393
 394 The Ceph client calculates which PG a RADOS object should be in. As part of
 395 this calculation, the client hashes the object ID and performs an operation
 396 involving both the number of PGs in the specified pool and the pool ID. For
 397 details, see `Mapping PGs to OSDs`_.
 398
 399 The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
 400 For example, in a replicated pool of size two, each PG will store objects on
 401 two OSDs, as shown below:
 402
 403 .. ditaa::
 404    +-----------------------+      +-----------------------+
 405    |  Placement Group #1   |      |  Placement Group #2   |
 406    |                       |      |                       |
 407    +-----------------------+      +-----------------------+
 408         |             |               |             |
 409         v             v               v             v
 410    /----------\  /----------\    /----------\  /----------\
 411    |          |  |          |    |          |  |          |
 412    |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
 413    |          |  |          |    |          |  |          |
 414    \----------/  \----------/    \----------/  \----------/
 415
 416
 417 If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
 418 filled with copies of all objects in OSD #1. If the pool size is changed from
 419 two to three, an additional OSD will be assigned to the PG and will receive
 420 copies of all objects in the PG.
 421
 422 An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
 423 shared with other PGs either from the same pool or from other pools. In our
 424 example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
 425 #2 fails, then Placement Group #2 must restore copies of objects (by making use
 426 of OSD #3).
 427
 428 When the number of PGs increases, several consequences ensue. The new PGs are
 429 assigned OSDs. The result of the CRUSH function changes, which means that some
 430 objects from the already-existing PGs are copied to the new PGs and removed
 431 from the old ones.
 432
 433 Factors Relevant To Specifying pg_num
 434 =====================================
 435
 436 On the one hand, the criteria of data durability and even distribution across
 437 OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
 438 saving CPU resources and minimizing memory usage weigh in favor of a low number
 439 of PGs.
 440
 441 .. _data durability:
 442
 443 Data durability
 444 ---------------
 445
 446 When an OSD fails, the risk of data loss is increased until replication of the
 447 data it hosted is restored to the configured level. To illustrate this point,
 448 let's imagine a scenario that results in permanent data loss in a single PG:
 449
 450 #. The OSD fails and all copies of the object that it contains are lost.  For
 451    each object within the PG, the number of its replicas suddenly drops from
 452    three to two.
 453
 454 #. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
 455    the third copy of each object.
 456
 457 #. Another OSD within the same PG fails before the new OSD is fully populated
 458    with the third copy. Some objects will then only have one surviving copy.
 459
 460 #. Ceph selects yet another OSD and continues copying objects in order to
 461    restore the desired number of copies.
 462
 463 #. A third OSD within the same PG fails before recovery is complete. If this
 464    OSD happened to contain the only remaining copy of an object, the object is
 465    permanently lost.
 466
 467 In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
 468 will give each PG three OSDs.  Ultimately, each OSD hosts :math:`\frac{(512 *
 469 3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
 470 recovery will begin for all 150 PGs at the same time.
 471
 472 The 150 PGs that are being recovered are likely to be homogeneously distributed
 473 across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
 474 copies of objects to all other OSDs and also likely to receive some new objects
 475 to be stored because it has become part of a new PG.
 476
 477 The amount of time it takes for this recovery to complete depends on the
 478 architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
 479 a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
 480 switch, and the recovery of a single OSD completes within a certain number of
 481 minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
 482 a 1 Gb/s switch. In the second setup, recovery will be at least one order of
 483 magnitude slower.
 484
 485 In such a cluster, the number of PGs has almost no effect on data durability.
 486 Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
 487 slower or faster.
 488
 489 However, an increase in the number of OSDs can increase the speed of recovery.
 490 Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs.  Each OSD now
 491 participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
 492 still be required to replicate the same number of objects in order to recover.
 493 But instead of there being only 10 OSDs that have to copy ~100 GB each, there
 494 are now 20 OSDs that have to copy only 50 GB each. If the network had
 495 previously been a bottleneck, recovery now happens twice as fast.
 496
 497 Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
 498 ~38 PGs. And if an OSD dies, recovery will take place faster than before unless
 499 it is blocked by another bottleneck. Now, however, suppose that our cluster
 500 grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
 501 will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
 502 associated with these PGs. This means that recovery will take longer than when
 503 there were only 40 OSDs. For this reason, the number of PGs should be
 504 increased.
 505
 506 No matter how brief the recovery time is, there is always a chance that an
 507 additional OSD will fail while recovery is in progress.  Consider the cluster
 508 with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
 509 (approximately 150 divided by 9) PGs will have only one remaining copy. And if
 510 any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
 511 are likely to lose their remaining objects. This is one reason why setting
 512 ``size=2`` is risky.
 513
 514 When the number of OSDs in the cluster increases to 20, the number of PGs that
 515 would be damaged by the loss of three OSDs significantly decreases. The loss of
 516 a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
 517 PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
 518 data loss only if it is one of the 4 OSDs that contains the remaining copy.
 519 This means -- assuming that the probability of losing one OSD during recovery
 520 is 0.0001% -- that the probability of data loss when three OSDs are lost is
 521 :math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
 522 only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.
 523
 524 In summary, the greater the number of OSDs, the faster the recovery and the
 525 lower the risk of permanently losing a PG due to cascading failures. As far as
 526 data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
 527 much matter whether there are 512 or 4096 PGs.
 528
 529 .. note::  It can take a long time for an OSD that has been recently added to
 530    the cluster to be populated with the PGs assigned to it.  However, no object
 531    degradation or impact on data durability will result from the slowness of
 532    this process since Ceph populates data into the new PGs before removing it
 533    from the old PGs.
 534
 535 .. _object distribution:
 536
 537 Object distribution within a pool
 538 ---------------------------------
 539
 540 Under ideal conditions, objects are evenly distributed across PGs. Because
 541 CRUSH computes the PG for each object but does not know how much data is stored
 542 in each OSD associated with the PG, the ratio between the number of PGs and the
 543 number of OSDs can have a significant influence on data distribution.
 544
 545 For example, suppose that there is only a single PG for ten OSDs in a
 546 three-replica pool. In that case, only three OSDs would be used because CRUSH
 547 would have no other option. However, if more PGs are available, RADOS objects are
 548 more likely to be evenly distributed across OSDs.  CRUSH makes every effort to
 549 distribute OSDs evenly across all existing PGs.
 550
 551 As long as there are one or two orders of magnitude more PGs than OSDs, the
 552 distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
 553 10 OSDs, or 1024 PGs for 10 OSDs.
 554
 555 However, uneven data distribution can emerge due to factors other than the
 556 ratio of PGs to OSDs. For example, since CRUSH does not take into account the
 557 size of the RADOS objects, the presence of a few very large RADOS objects can
 558 create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
 559 are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
 560 consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
 561 added to the pool, the three OSDs supporting the PG in which the RADOS object
 562 has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
 563 other OSDs will still contain only 400 MB.
 564
 565 .. _resource usage:
 566
 567 Memory, CPU and network usage
 568 -----------------------------
 569
 570 Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
 571 MONs. These needs must be met at all times and are increased during recovery.
 572 Indeed, one of the main reasons PGs were developed was to share this overhead
 573 by clustering objects together.
 574
 575 For this reason, minimizing the number of PGs saves significant resources.
 576
 577 .. _choosing-number-of-placement-groups:
 578
 579 Choosing the Number of PGs
 580 ==========================
 581
 582 .. note: It is rarely necessary to do the math in this section by hand.
 583    Instead, use the ``ceph osd pool autoscale-status`` command in combination
 584    with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
 585    more information, see :ref:`pg-autoscaler`.
 586
 587 If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
 588 order to balance resource usage, data durability, and data distribution. If you
 589 have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
 590 For a single pool, use the following formula to get a baseline value:
 591
 592   Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
 593
 594 Here **pool size** is either the number of replicas for replicated pools or the
 595 K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
 596 osd erasure-code-profile get``.
 597
 598 Next, check whether the resulting baseline value is consistent with the way you
 599 designed your Ceph cluster to maximize `data durability`_ and `object
 600 distribution`_ and to minimize `resource usage`_.
 601
 602 This value should be **rounded up to the nearest power of two**.
 603
 604 Each pool's ``pg_num`` should be a power of two. Other values are likely to
 605 result in uneven distribution of data across OSDs. It is best to increase
 606 ``pg_num`` for a pool only when it is feasible and desirable to set the next
 607 highest power of two. Note that this power of two rule is per-pool; it is
 608 neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
 609 of two.
 610
 611 For example, if you have a cluster with 200 OSDs and a single pool with a size
 612 of 3 replicas, estimate the number of PGs as follows:
 613
 614   :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.
 615
 616 When using multiple data pools to store objects, make sure that you balance the
 617 number of PGs per pool against the number of PGs per OSD so that you arrive at
 618 a reasonable total number of PGs. It is important to find a number that
 619 provides reasonably low variance per OSD without taxing system resources or
 620 making the peering process too slow.
 621
 622 For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
 623 OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
 624 This cluster will not use too many resources. However, in a cluster of 1,000
 625 pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
 626 each. This cluster will require significantly more resources and significantly
 627 more time for peering.
 628
 629 For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
 630 tool.
 631
 632
 633 .. _setting the number of placement groups:
 634
 635 Setting the Number of PGs
 636 =========================
 637
 638 Setting the initial number of PGs in a pool must be done at the time you create
 639 the pool. See `Create a Pool`_ for details.
 640
 641 However, even after a pool is created, if the ``pg_autoscaler`` is not being
 642 used to manage ``pg_num`` values, you can change the number of PGs by running a
 643 command of the following form:
 644
 645 .. prompt:: bash #
 646
 647    ceph osd pool set {pool-name} pg_num {pg_num}
 648
 649 If you increase the number of PGs, your cluster will not rebalance until you
 650 increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
 651 parameter specifies the number of PGs that are to be considered for placement
 652 by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
 653 but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
 654 The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
 655 increase the number of PGs for placement, run a command of the following form:
 656
 657 .. prompt:: bash #
 658
 659    ceph osd pool set {pool-name} pgp_num {pgp_num}
 660
 661 If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
 662 In releases of Ceph that are Nautilus and later (inclusive), when the
 663 ``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
 664 ``pg_num``. This process manifests as periods of remapping of PGs and of
 665 backfill, and is expected behavior and normal.
 666
 667 .. _rados_ops_pgs_get_pg_num:
 668
 669 Get the Number of PGs
 670 =====================
 671
 672 To get the number of PGs in a pool, run a command of the following form:
 673
 674 .. prompt:: bash #
 675
 676    ceph osd pool get {pool-name} pg_num
 677
 678
 679 Get a Cluster's PG Statistics
 680 =============================
 681
 682 To see the details of the PGs in your cluster, run a command of the following
 683 form:
 684
 685 .. prompt:: bash #
 686
 687    ceph pg dump [--format {format}]
 688
 689 Valid formats are ``plain`` (default) and ``json``.
 690
 691
 692 Get Statistics for Stuck PGs
 693 ============================
 694
 695 To see the statistics for all PGs that are stuck in a specified state, run a
 696 command of the following form:
 697
 698 .. prompt:: bash #
 699
 700    ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
 701
 702 - **Inactive** PGs cannot process reads or writes because they are waiting for
 703   enough OSDs with the most up-to-date data to come ``up`` and ``in``.
 704
 705 - **Undersized** PGs contain objects that have not been replicated the desired
 706   number of times. Under normal conditions, it can be assumed that these PGs
 707   are recovering.
 708
 709 - **Stale** PGs are in an unknown state -- the OSDs that host them have not
 710   reported to the monitor cluster for a certain period of time (determined by
 711   ``mon_osd_report_timeout``).
 712
 713 Valid formats are ``plain`` (default) and ``json``. The threshold defines the
 714 minimum number of seconds the PG is stuck before it is included in the returned
 715 statistics (default: 300).
 716
 717
 718 Get a PG Map
 719 ============
 720
 721 To get the PG map for a particular PG, run a command of the following form:
 722
 723 .. prompt:: bash #
 724
 725    ceph pg map {pg-id}
 726
 727 For example:
 728
 729 .. prompt:: bash #
 730
 731    ceph pg map 1.6c
 732
 733 Ceph will return the PG map, the PG, and the OSD status. The output resembles
 734 the following:
 735
 736 .. prompt:: bash #
 737
 738    osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
 739
 740
 741 Get a PG's Statistics
 742 =====================
 743
 744 To see statistics for a particular PG, run a command of the following form:
 745
 746 .. prompt:: bash #
 747
 748    ceph pg {pg-id} query
 749
 750
 751 Scrub a PG
 752 ==========
 753
 754 To scrub a PG, run a command of the following form:
 755
 756 .. prompt:: bash #
 757
 758    ceph pg scrub {pg-id}
 759
 760 Ceph checks the primary and replica OSDs, generates a catalog of all objects in
 761 the PG, and compares the objects against each other in order to ensure that no
 762 objects are missing or mismatched and that their contents are consistent. If
 763 the replicas all match, then a final semantic sweep takes place to ensure that
 764 all snapshot-related object metadata is consistent.  Errors are reported in
 765 logs.
 766
 767 To scrub all PGs from a specific pool, run a command of the following form:
 768
 769 .. prompt:: bash #
 770
 771    ceph osd pool scrub {pool-name}
 772
 773
 774 Prioritize backfill/recovery of PG(s)
 775 =====================================
 776
 777 You might encounter a situation in which multiple PGs require recovery or
 778 backfill, but the data in some PGs is more important than the data in others
 779 (for example, some PGs hold data for images that are used by running machines
 780 and other PGs are used by inactive machines and hold data that is less
 781 relevant). In that case, you might want to prioritize recovery or backfill of
 782 the PGs with especially important data so that the performance of the cluster
 783 and the availability of their data are restored sooner. To designate specific
 784 PG(s) as prioritized during recovery, run a command of the following form:
 785
 786 .. prompt:: bash #
 787
 788    ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 789
 790 To mark specific PG(s) as prioritized during backfill, run a command of the
 791 following form:
 792
 793 .. prompt:: bash #
 794
 795    ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 796
 797 These commands instruct Ceph to perform recovery or backfill on the specified
 798 PGs before processing the other PGs. Prioritization does not interrupt current
 799 backfills or recovery, but places the specified PGs at the top of the queue so
 800 that they will be acted upon next.  If you change your mind or realize that you
 801 have prioritized the wrong PGs, run one or both of the following commands:
 802
 803 .. prompt:: bash #
 804
 805    ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 806    ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
 807
 808 These commands remove the ``force`` flag from the specified PGs, so that the
 809 PGs will be processed in their usual order. As in the case of adding the
 810 ``force`` flag, this affects only those PGs that are still queued but does not
 811 affect PGs currently undergoing recovery.
 812
 813 The ``force`` flag is cleared automatically after recovery or backfill of the
 814 PGs is complete.
 815
 816 Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
 817 is, to perform recovery or backfill on those PGs first), run one or both of the
 818 following commands:
 819
 820 .. prompt:: bash #
 821
 822    ceph osd pool force-recovery {pool-name}
 823    ceph osd pool force-backfill {pool-name}
 824
 825 These commands can also be cancelled. To revert to the default order, run one
 826 or both of the following commands:
 827
 828 .. prompt:: bash #
 829
 830    ceph osd pool cancel-force-recovery {pool-name}
 831    ceph osd pool cancel-force-backfill {pool-name}
 832
 833 .. warning:: These commands can break the order of Ceph's internal priority
 834    computations, so use them with caution! If you have multiple pools that are
 835    currently sharing the same underlying OSDs, and if the data held by certain
 836    pools is more important than the data held by other pools, then we recommend
 837    that you run a command of the following form to arrange a custom
 838    recovery/backfill priority for all pools:
 839
 840 .. prompt:: bash #
 841
 842    ceph osd pool set {pool-name} recovery_priority {value}
 843
 844 For example, if you have twenty pools, you could make the most important pool
 845 priority ``20``, and the next most important pool priority ``19``, and so on.
 846
 847 Another option is to set the recovery/backfill priority for only a proper
 848 subset of pools. In such a scenario, three important pools might (all) be
 849 assigned priority ``1`` and all other pools would be left without an assigned
 850 recovery/backfill priority.  Another possibility is to select three important
 851 pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
 852 respectively.
 853
 854 .. important:: Numbers of greater value have higher priority than numbers of
 855    lesser value when using ``ceph osd pool set {pool-name} recovery_priority
 856    {value}`` to set their recovery/backfill priority. For example, a pool with
 857    the recovery/backfill priority ``30`` has a higher priority than a pool with
 858    the recovery/backfill priority ``15``.
 859
 860 Reverting Lost RADOS Objects
 861 ============================
 862
 863 If the cluster has lost one or more RADOS objects and you have decided to
 864 abandon the search for the lost data, you must mark the unfound objects
 865 ``lost``.
 866
 867 If every possible location has been queried and all OSDs are ``up`` and ``in``,
 868 but certain RADOS objects are still lost, you might have to give up on those
 869 objects. This situation can arise when rare and unusual combinations of
 870 failures allow the cluster to learn about writes that were performed before the
 871 writes themselves were recovered.
 872
 873 The command to mark a RADOS object ``lost`` has only one supported option:
 874 ``revert``. The ``revert`` option will either roll back to a previous version
 875 of the RADOS object (if it is old enough to have a previous version) or forget
 876 about it entirely (if it is too new to have a previous version). To mark the
 877 "unfound" objects ``lost``, run a command of the following form:
 878
 879
 880 .. prompt:: bash #
 881
 882    ceph pg {pg-id} mark_unfound_lost revert|delete
 883
 884 .. important:: Use this feature with caution. It might confuse applications
 885    that expect the object(s) to exist.
 886
 887
 888 .. toctree::
 889         :hidden:
 890
 891         pg-states
 892         pg-concepts
 893
 894
 895 .. _Create a Pool: ../pools#createpool
 896 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
 897 .. _pgcalc: https://old.ceph.com/pgcalc/