]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
20effc67
TL
1.. _placement groups:
2
7c673cae
FG
3==================
4 Placement Groups
5==================
6
11fdf7f2
TL
7.. _pg-autoscaler:
8
9Autoscaling placement groups
10============================
11
1e59de90
TL
12Placement groups (PGs) are an internal implementation detail of how Ceph
13distributes data. Autoscaling provides a way to manage PGs, and especially to
14manage the number of PGs present in different pools. When *pg-autoscaling* is
15enabled, the cluster is allowed to make recommendations or automatic
16adjustments with respect to the number of PGs for each pool (``pgp_num``) in
17accordance with expected cluster utilization and expected pool utilization.
18
19Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
20``on``, or ``warn``:
21
22* ``off``: Disable autoscaling for this pool. It is up to the administrator to
23 choose an appropriate ``pgp_num`` for each pool. For more information, see
24 :ref:`choosing-number-of-placement-groups`.
11fdf7f2 25* ``on``: Enable automated adjustments of the PG count for the given pool.
1e59de90 26* ``warn``: Raise health checks when the PG count is in need of adjustment.
11fdf7f2 27
1e59de90
TL
28To set the autoscaling mode for an existing pool, run a command of the
29following form:
11fdf7f2 30
39ae355f 31.. prompt:: bash #
11fdf7f2 32
39ae355f 33 ceph osd pool set <pool-name> pg_autoscale_mode <mode>
11fdf7f2 34
1e59de90 35For example, to enable autoscaling on pool ``foo``, run the following command:
39ae355f
TL
36
37.. prompt:: bash #
38
39 ceph osd pool set foo pg_autoscale_mode on
11fdf7f2 40
1e59de90
TL
41There is also a ``pg_autoscale_mode`` setting for any pools that are created
42after the initial setup of the cluster. To change this setting, run a command
43of the following form:
11fdf7f2 44
39ae355f
TL
45.. prompt:: bash #
46
47 ceph config set global osd_pool_default_pg_autoscale_mode <mode>
11fdf7f2 48
1e59de90
TL
49You can disable or enable the autoscaler for all pools with the ``noautoscale``
50flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
51running the following command:
39ae355f 52
1e59de90 53.. prompt:: bash #
20effc67 54
39ae355f 55 ceph osd pool set noautoscale
20effc67 56
1e59de90 57To set the ``noautoscale`` flag to ``off``, run the following command:
20effc67 58
39ae355f 59.. prompt:: bash #
20effc67 60
39ae355f 61 ceph osd pool unset noautoscale
20effc67 62
1e59de90 63To get the value of the flag, run the following command:
39ae355f
TL
64
65.. prompt:: bash #
66
67 ceph osd pool get noautoscale
20effc67 68
11fdf7f2
TL
69Viewing PG scaling recommendations
70----------------------------------
71
1e59de90
TL
72To view each pool, its relative utilization, and any recommended changes to the
73PG count, run the following command:
11fdf7f2 74
39ae355f
TL
75.. prompt:: bash #
76
77 ceph osd pool autoscale-status
11fdf7f2 78
1e59de90 79The output will resemble the following::
11fdf7f2 80
20effc67
TL
81 POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
82 a 12900M 3.0 82431M 0.4695 8 128 warn True
83 c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
84 b 0 953.6M 3.0 82431M 0.0347 8 warn False
11fdf7f2 85
1e59de90 86- **POOL** is the name of the pool.
11fdf7f2 87
1e59de90
TL
88- **SIZE** is the amount of data stored in the pool.
89
90- **TARGET SIZE** (if present) is the amount of data that is expected to be
91 stored in the pool, as specified by the administrator. The system uses the
92 greater of the two values for its calculation.
11fdf7f2 93
1e59de90
TL
94- **RATE** is the multiplier for the pool that determines how much raw storage
95 capacity is consumed. For example, a three-replica pool will have a ratio of
96 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
11fdf7f2 97
1e59de90
TL
98- **RAW CAPACITY** is the total amount of raw storage capacity on the specific
99 OSDs that are responsible for storing the data of the pool (and perhaps the
100 data of other pools).
11fdf7f2 101
1e59de90
TL
102- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
103 total raw storage capacity. In order words, RATIO is defined as
104 (SIZE * RATE) / RAW CAPACITY.
9f95a23c 105
1e59de90
TL
106- **TARGET RATIO** (if present) is the ratio of the expected storage of this
107 pool (that is, the amount of storage that this pool is expected to consume,
108 as specified by the administrator) to the expected storage of all other pools
109 that have target ratios set. If both ``target_size_bytes`` and
110 ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
111 precedence.
9f95a23c 112
1e59de90
TL
113- **EFFECTIVE RATIO** is the result of making two adjustments to the target
114 ratio:
9f95a23c 115
1e59de90
TL
116 #. Subtracting any capacity expected to be used by pools that have target
117 size set.
522d829b 118
1e59de90
TL
119 #. Normalizing the target ratios among pools that have target ratio set so
120 that collectively they target cluster capacity. For example, four pools
121 with target_ratio 1.0 would have an effective ratio of 0.25.
39ae355f 122
1e59de90
TL
123 The system's calculations use whichever of these two ratios (that is, the
124 target ratio and the effective ratio) is greater.
125
126- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
127 with prior information about how many PGs a specific pool is expected to
128 have.
129
130- **PG_NUM** is either the current number of PGs associated with the pool or,
131 if a ``pg_num`` change is in progress, the current number of PGs that the
132 pool is working towards.
133
134- **NEW PG_NUM** (if present) is the value that the system is recommending the
135 ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
136 present only if the recommended value varies from the current value by more
137 than the default factor of ``3``. To adjust this factor (in the following
138 example, it is changed to ``2``), run the following command:
139
140 .. prompt:: bash #
141
142 ceph osd pool set threshold 2.0
143
144- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
145 ``off``, or ``warn``.
146
147- **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
148 or ``False``. A ``bulk`` pool is expected to be large and should initially
149 have a large number of PGs so that performance does not suffer]. On the other
150 hand, a pool that is not ``bulk`` is expected to be small (for example, a
151 ``.mgr`` pool or a meta pool).
152
153.. note::
20effc67 154
1e59de90
TL
155 If the ``ceph osd pool autoscale-status`` command returns no output at all,
156 there is probably at least one pool that spans multiple CRUSH roots. This
157 'spanning pool' issue can happen in scenarios like the following:
158 when a new deployment auto-creates the ``.mgr`` pool on the ``default``
159 CRUSH root, subsequent pools are created with rules that constrain them to a
160 specific shadow CRUSH tree. For example, if you create an RBD metadata pool
161 that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
162 constrained to ``deviceclass = hdd``, you will encounter this issue. To
163 remedy this issue, constrain the spanning pool to only one device class. In
164 the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
165 effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
166 running the following commands:
11fdf7f2 167
1e59de90 168 .. prompt:: bash #
11fdf7f2 169
1e59de90
TL
170 ceph osd pool set .mgr crush_rule replicated-ssd
171 ceph osd pool set pool 1 crush_rule to replicated-ssd
172
173 This intervention will result in a small amount of backfill, but
174 typically this traffic completes quickly.
522d829b 175
11fdf7f2
TL
176
177Automated scaling
178-----------------
179
1e59de90
TL
180In the simplest approach to automated scaling, the cluster is allowed to
181automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
182total available storage and the target number of PGs for the whole system,
183considers how much data is stored in each pool, and apportions PGs accordingly.
184The system is conservative with its approach, making changes to a pool only
185when the current number of PGs (``pg_num``) varies by more than a factor of 3
186from the recommended number.
11fdf7f2 187
1e59de90
TL
188The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
189parameter (default: 100), which can be adjusted by running the following
190command:
11fdf7f2 191
39ae355f
TL
192.. prompt:: bash #
193
194 ceph config set global mon_target_pg_per_osd 100
11fdf7f2 195
1e59de90
TL
196The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
197pool might map to a different CRUSH rule, and each rule might distribute data
198across different devices, Ceph will consider the utilization of each subtree of
199the hierarchy independently. For example, a pool that maps to OSDs of class
200``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
201counts that are determined by how many of these two different device types
202there are.
203
204If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
205with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
206user in the manager log. The warning states the name of the pool and the set of
207roots that overlap each other. The autoscaler does not scale any pools with
208overlapping roots because this condition can cause problems with the scaling
209process. We recommend constraining each pool so that it belongs to only one
210root (that is, one OSD class) to silence the warning and ensure a successful
2a845540
TL
211scaling process.
212
aee94f69
TL
213.. _managing_bulk_flagged_pools:
214
215Managing pools that are flagged with ``bulk``
216~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
1e59de90
TL
218If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
219complement of PGs and then scales down the number of PGs only if the usage
220ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
221then the autoscaler starts the pool with minimal PGs and creates additional PGs
222only if there is more usage in the pool.
522d829b 223
1e59de90 224To create a pool that will be flagged ``bulk``, run the following command:
39ae355f
TL
225
226.. prompt:: bash #
227
228 ceph osd pool create <pool-name> --bulk
229
1e59de90
TL
230To set or unset the ``bulk`` flag of an existing pool, run the following
231command:
20effc67 232
39ae355f 233.. prompt:: bash #
522d829b 234
39ae355f 235 ceph osd pool set <pool-name> bulk <true/false/1/0>
522d829b 236
1e59de90 237To get the ``bulk`` flag of an existing pool, run the following command:
522d829b 238
39ae355f 239.. prompt:: bash #
522d829b 240
39ae355f 241 ceph osd pool get <pool-name> bulk
11fdf7f2
TL
242
243.. _specifying_pool_target_size:
244
245Specifying expected pool size
246-----------------------------
247
1e59de90
TL
248When a cluster or pool is first created, it consumes only a small fraction of
249the total cluster capacity and appears to the system as if it should need only
250a small number of PGs. However, in some cases, cluster administrators know
251which pools are likely to consume most of the system capacity in the long run.
252When Ceph is provided with this information, a more appropriate number of PGs
253can be used from the beginning, obviating subsequent changes in ``pg_num`` and
254the associated overhead cost of relocating data.
11fdf7f2 255
1e59de90
TL
256The *target size* of a pool can be specified in two ways: either in relation to
257the absolute size (in bytes) of the pool, or as a weight relative to all other
258pools that have ``target_size_ratio`` set.
11fdf7f2 259
1e59de90
TL
260For example, to tell the system that ``mypool`` is expected to consume 100 TB,
261run the following command:
11fdf7f2 262
39ae355f
TL
263.. prompt:: bash #
264
265 ceph osd pool set mypool target_size_bytes 100T
11fdf7f2 266
1e59de90
TL
267Alternatively, to tell the system that ``mypool`` is expected to consume a
268ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
269adjust the ``target_size_ratio`` setting of ``my pool`` by running the
270following command:
39ae355f
TL
271
272.. prompt:: bash #
11fdf7f2 273
39ae355f 274 ceph osd pool set mypool target_size_ratio 1.0
11fdf7f2 275
1e59de90
TL
276If `mypool` is the only pool in the cluster, then it is expected to use 100% of
277the total cluster capacity. However, if the cluster contains a second pool that
278has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
279of the total cluster capacity.
11fdf7f2 280
1e59de90
TL
281The ``ceph osd pool create`` command has two command-line options that can be
282used to set the target size of a pool at creation time: ``--target-size-bytes
283<bytes>`` and ``--target-size-ratio <ratio>``.
11fdf7f2 284
1e59de90
TL
285Note that if the target-size values that have been specified are impossible
286(for example, a capacity larger than the total cluster), then a health check
9f95a23c
TL
287(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
288
1e59de90
TL
289If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
290pool, then the latter will be ignored, the former will be used in system
291calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
292will be raised.
11fdf7f2
TL
293
294Specifying bounds on a pool's PGs
295---------------------------------
296
1e59de90
TL
297It is possible to specify both the minimum number and the maximum number of PGs
298for a pool.
299
300Setting a Minimum Number of PGs and a Maximum Number of PGs
301~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302
303If a minimum is set, then Ceph will not itself reduce (nor recommend that you
304reduce) the number of PGs to a value below the configured value. Setting a
305minimum serves to establish a lower bound on the amount of parallelism enjoyed
306by a client during I/O, even if a pool is mostly empty.
11fdf7f2 307
1e59de90
TL
308If a maximum is set, then Ceph will not itself increase (or recommend that you
309increase) the number of PGs to a value above the configured value.
310
311To set the minimum number of PGs for a pool, run a command of the following
312form:
11fdf7f2 313
39ae355f
TL
314.. prompt:: bash #
315
316 ceph osd pool set <pool-name> pg_num_min <num>
1e59de90
TL
317
318To set the maximum number of PGs for a pool, run a command of the following
319form:
320
321.. prompt:: bash #
322
39ae355f 323 ceph osd pool set <pool-name> pg_num_max <num>
11fdf7f2 324
1e59de90
TL
325In addition, the ``ceph osd pool create`` command has two command-line options
326that can be used to specify the minimum or maximum PG count of a pool at
327creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.
11fdf7f2 328
7c673cae
FG
329.. _preselection:
330
1e59de90
TL
331Preselecting pg_num
332===================
7c673cae 333
1e59de90
TL
334When creating a pool with the following command, you have the option to
335preselect the value of the ``pg_num`` parameter:
39ae355f
TL
336
337.. prompt:: bash #
7c673cae 338
39ae355f 339 ceph osd pool create {pool-name} [pg_num]
7c673cae 340
1e59de90
TL
341If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
342autoscaler to automatically configure the parameter in accordance with the
343amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).
344
345However, your decision of whether or not to specify ``pg_num`` at creation time
346has no effect on whether the parameter will be automatically tuned by the
347cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
348running a command of the following form:
349
350.. prompt:: bash #
7c673cae 351
1e59de90 352 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
7c673cae 353
1e59de90
TL
354Without the balancer, the suggested target is approximately 100 PG replicas on
355each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
356reasonable.
7c673cae 357
1e59de90 358The autoscaler attempts to satisfy the following conditions:
7c673cae 359
1e59de90
TL
360- the number of PGs per OSD should be proportional to the amount of data in the
361 pool
362- there should be 50-100 PGs per pool, taking into account the replication
363 overhead or erasure-coding fan-out of each PG's replicas across OSDs
7c673cae 364
1e59de90
TL
365Use of Placement Groups
366=======================
7c673cae 367
1e59de90
TL
368A placement group aggregates objects within a pool. The tracking of RADOS
369object placement and object metadata on a per-object basis is computationally
370expensive. It would be infeasible for a system with millions of RADOS
371objects to efficiently track placement on a per-object basis.
7c673cae
FG
372
373.. ditaa::
374 /-----\ /-----\ /-----\ /-----\ /-----\
375 | obj | | obj | | obj | | obj | | obj |
376 \-----/ \-----/ \-----/ \-----/ \-----/
377 | | | | |
378 +--------+--------+ +---+----+
379 | |
380 v v
381 +-----------------------+ +-----------------------+
382 | Placement Group #1 | | Placement Group #2 |
383 | | | |
384 +-----------------------+ +-----------------------+
385 | |
386 +------------------------------+
387 |
388 v
389 +-----------------------+
390 | Pool |
391 | |
392 +-----------------------+
393
1e59de90
TL
394The Ceph client calculates which PG a RADOS object should be in. As part of
395this calculation, the client hashes the object ID and performs an operation
396involving both the number of PGs in the specified pool and the pool ID. For
397details, see `Mapping PGs to OSDs`_.
7c673cae 398
1e59de90
TL
399The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
400For example, in a replicated pool of size two, each PG will store objects on
401two OSDs, as shown below:
7c673cae
FG
402
403.. ditaa::
7c673cae
FG
404 +-----------------------+ +-----------------------+
405 | Placement Group #1 | | Placement Group #2 |
406 | | | |
407 +-----------------------+ +-----------------------+
408 | | | |
409 v v v v
410 /----------\ /----------\ /----------\ /----------\
411 | | | | | | | |
412 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
413 | | | | | | | |
414 \----------/ \----------/ \----------/ \----------/
415
416
1e59de90
TL
417If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
418filled with copies of all objects in OSD #1. If the pool size is changed from
419two to three, an additional OSD will be assigned to the PG and will receive
420copies of all objects in the PG.
7c673cae 421
1e59de90
TL
422An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
423shared with other PGs either from the same pool or from other pools. In our
424example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
425#2 fails, then Placement Group #2 must restore copies of objects (by making use
426of OSD #3).
7c673cae 427
1e59de90
TL
428When the number of PGs increases, several consequences ensue. The new PGs are
429assigned OSDs. The result of the CRUSH function changes, which means that some
430objects from the already-existing PGs are copied to the new PGs and removed
431from the old ones.
7c673cae 432
1e59de90
TL
433Factors Relevant To Specifying pg_num
434=====================================
7c673cae 435
1e59de90
TL
436On the one hand, the criteria of data durability and even distribution across
437OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
438saving CPU resources and minimizing memory usage weigh in favor of a low number
439of PGs.
7c673cae
FG
440
441.. _data durability:
442
443Data durability
444---------------
445
1e59de90
TL
446When an OSD fails, the risk of data loss is increased until replication of the
447data it hosted is restored to the configured level. To illustrate this point,
448let's imagine a scenario that results in permanent data loss in a single PG:
449
450#. The OSD fails and all copies of the object that it contains are lost. For
451 each object within the PG, the number of its replicas suddenly drops from
452 three to two.
453
454#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
455 the third copy of each object.
456
457#. Another OSD within the same PG fails before the new OSD is fully populated
458 with the third copy. Some objects will then only have one surviving copy.
459
460#. Ceph selects yet another OSD and continues copying objects in order to
461 restore the desired number of copies.
462
463#. A third OSD within the same PG fails before recovery is complete. If this
464 OSD happened to contain the only remaining copy of an object, the object is
465 permanently lost.
466
467In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
468will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
4693)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
470recovery will begin for all 150 PGs at the same time.
471
472The 150 PGs that are being recovered are likely to be homogeneously distributed
473across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
474copies of objects to all other OSDs and also likely to receive some new objects
475to be stored because it has become part of a new PG.
476
477The amount of time it takes for this recovery to complete depends on the
478architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
479a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
480switch, and the recovery of a single OSD completes within a certain number of
481minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
482a 1 Gb/s switch. In the second setup, recovery will be at least one order of
7c673cae
FG
483magnitude slower.
484
1e59de90
TL
485In such a cluster, the number of PGs has almost no effect on data durability.
486Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
487slower or faster.
488
489However, an increase in the number of OSDs can increase the speed of recovery.
490Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now
491participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
492still be required to replicate the same number of objects in order to recover.
493But instead of there being only 10 OSDs that have to copy ~100 GB each, there
494are now 20 OSDs that have to copy only 50 GB each. If the network had
495previously been a bottleneck, recovery now happens twice as fast.
496
497Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
498~38 PGs. And if an OSD dies, recovery will take place faster than before unless
499it is blocked by another bottleneck. Now, however, suppose that our cluster
500grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
501will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
502associated with these PGs. This means that recovery will take longer than when
503there were only 40 OSDs. For this reason, the number of PGs should be
7c673cae
FG
504increased.
505
1e59de90
TL
506No matter how brief the recovery time is, there is always a chance that an
507additional OSD will fail while recovery is in progress. Consider the cluster
508with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
509(approximately 150 divided by 9) PGs will have only one remaining copy. And if
510any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
511are likely to lose their remaining objects. This is one reason why setting
512``size=2`` is risky.
513
514When the number of OSDs in the cluster increases to 20, the number of PGs that
515would be damaged by the loss of three OSDs significantly decreases. The loss of
516a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
517PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
518data loss only if it is one of the 4 OSDs that contains the remaining copy.
519This means -- assuming that the probability of losing one OSD during recovery
520is 0.0001% -- that the probability of data loss when three OSDs are lost is
521:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
522only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.
523
524In summary, the greater the number of OSDs, the faster the recovery and the
525lower the risk of permanently losing a PG due to cascading failures. As far as
526data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
527much matter whether there are 512 or 4096 PGs.
528
529.. note:: It can take a long time for an OSD that has been recently added to
530 the cluster to be populated with the PGs assigned to it. However, no object
531 degradation or impact on data durability will result from the slowness of
532 this process since Ceph populates data into the new PGs before removing it
533 from the old PGs.
7c673cae
FG
534
535.. _object distribution:
536
537Object distribution within a pool
538---------------------------------
539
1e59de90
TL
540Under ideal conditions, objects are evenly distributed across PGs. Because
541CRUSH computes the PG for each object but does not know how much data is stored
542in each OSD associated with the PG, the ratio between the number of PGs and the
543number of OSDs can have a significant influence on data distribution.
544
545For example, suppose that there is only a single PG for ten OSDs in a
546three-replica pool. In that case, only three OSDs would be used because CRUSH
547would have no other option. However, if more PGs are available, RADOS objects are
548more likely to be evenly distributed across OSDs. CRUSH makes every effort to
549distribute OSDs evenly across all existing PGs.
550
551As long as there are one or two orders of magnitude more PGs than OSDs, the
552distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
55310 OSDs, or 1024 PGs for 10 OSDs.
554
555However, uneven data distribution can emerge due to factors other than the
556ratio of PGs to OSDs. For example, since CRUSH does not take into account the
557size of the RADOS objects, the presence of a few very large RADOS objects can
558create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
559are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
560consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
561added to the pool, the three OSDs supporting the PG in which the RADOS object
562has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
563other OSDs will still contain only 400 MB.
7c673cae
FG
564
565.. _resource usage:
566
567Memory, CPU and network usage
568-----------------------------
569
1e59de90
TL
570Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
571MONs. These needs must be met at all times and are increased during recovery.
572Indeed, one of the main reasons PGs were developed was to share this overhead
573by clustering objects together.
7c673cae 574
1e59de90 575For this reason, minimizing the number of PGs saves significant resources.
7c673cae 576
11fdf7f2
TL
577.. _choosing-number-of-placement-groups:
578
1e59de90
TL
579Choosing the Number of PGs
580==========================
7c673cae 581
1e59de90
TL
582.. note: It is rarely necessary to do the math in this section by hand.
583 Instead, use the ``ceph osd pool autoscale-status`` command in combination
584 with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
585 more information, see :ref:`pg-autoscaler`.
11fdf7f2 586
1e59de90
TL
587If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
588order to balance resource usage, data durability, and data distribution. If you
589have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
590For a single pool, use the following formula to get a baseline value:
7c673cae 591
f67539c2 592 Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
7c673cae 593
1e59de90
TL
594Here **pool size** is either the number of replicas for replicated pools or the
595K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
596osd erasure-code-profile get``.
7c673cae 597
1e59de90
TL
598Next, check whether the resulting baseline value is consistent with the way you
599designed your Ceph cluster to maximize `data durability`_ and `object
600distribution`_ and to minimize `resource usage`_.
7c673cae 601
1e59de90 602This value should be **rounded up to the nearest power of two**.
eafe8130 603
1e59de90
TL
604Each pool's ``pg_num`` should be a power of two. Other values are likely to
605result in uneven distribution of data across OSDs. It is best to increase
606``pg_num`` for a pool only when it is feasible and desirable to set the next
607highest power of two. Note that this power of two rule is per-pool; it is
608neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
609of two.
7c673cae 610
1e59de90
TL
611For example, if you have a cluster with 200 OSDs and a single pool with a size
612of 3 replicas, estimate the number of PGs as follows:
7c673cae 613
1e59de90 614 :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.
7c673cae 615
1e59de90
TL
616When using multiple data pools to store objects, make sure that you balance the
617number of PGs per pool against the number of PGs per OSD so that you arrive at
618a reasonable total number of PGs. It is important to find a number that
619provides reasonably low variance per OSD without taxing system resources or
620making the peering process too slow.
7c673cae 621
1e59de90
TL
622For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
623OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
624This cluster will not use too many resources. However, in a cluster of 1,000
625pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
626each. This cluster will require significantly more resources and significantly
627more time for peering.
7c673cae 628
1e59de90
TL
629For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
630tool.
224ce89b
WB
631
632
7c673cae
FG
633.. _setting the number of placement groups:
634
1e59de90
TL
635Setting the Number of PGs
636=========================
7c673cae 637
1e59de90
TL
638Setting the initial number of PGs in a pool must be done at the time you create
639the pool. See `Create a Pool`_ for details.
640
641However, even after a pool is created, if the ``pg_autoscaler`` is not being
642used to manage ``pg_num`` values, you can change the number of PGs by running a
643command of the following form:
7c673cae 644
39ae355f
TL
645.. prompt:: bash #
646
647 ceph osd pool set {pool-name} pg_num {pg_num}
7c673cae 648
1e59de90
TL
649If you increase the number of PGs, your cluster will not rebalance until you
650increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
651parameter specifies the number of PGs that are to be considered for placement
652by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
653but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
654The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
655increase the number of PGs for placement, run a command of the following form:
39ae355f
TL
656
657.. prompt:: bash #
7c673cae 658
39ae355f 659 ceph osd pool set {pool-name} pgp_num {pgp_num}
7c673cae 660
1e59de90
TL
661If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
662In releases of Ceph that are Nautilus and later (inclusive), when the
663``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
664``pg_num``. This process manifests as periods of remapping of PGs and of
665backfill, and is expected behavior and normal.
666
aee94f69 667.. _rados_ops_pgs_get_pg_num:
7c673cae 668
1e59de90
TL
669Get the Number of PGs
670=====================
7c673cae 671
1e59de90 672To get the number of PGs in a pool, run a command of the following form:
7c673cae 673
39ae355f 674.. prompt:: bash #
1e59de90 675
39ae355f 676 ceph osd pool get {pool-name} pg_num
7c673cae
FG
677
678
679Get a Cluster's PG Statistics
680=============================
681
1e59de90
TL
682To see the details of the PGs in your cluster, run a command of the following
683form:
7c673cae 684
39ae355f
TL
685.. prompt:: bash #
686
687 ceph pg dump [--format {format}]
7c673cae
FG
688
689Valid formats are ``plain`` (default) and ``json``.
690
691
692Get Statistics for Stuck PGs
693============================
694
1e59de90
TL
695To see the statistics for all PGs that are stuck in a specified state, run a
696command of the following form:
39ae355f
TL
697
698.. prompt:: bash #
7c673cae 699
39ae355f 700 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
7c673cae 701
1e59de90
TL
702- **Inactive** PGs cannot process reads or writes because they are waiting for
703 enough OSDs with the most up-to-date data to come ``up`` and ``in``.
7c673cae 704
1e59de90
TL
705- **Undersized** PGs contain objects that have not been replicated the desired
706 number of times. Under normal conditions, it can be assumed that these PGs
707 are recovering.
7c673cae 708
1e59de90
TL
709- **Stale** PGs are in an unknown state -- the OSDs that host them have not
710 reported to the monitor cluster for a certain period of time (determined by
711 ``mon_osd_report_timeout``).
7c673cae 712
1e59de90
TL
713Valid formats are ``plain`` (default) and ``json``. The threshold defines the
714minimum number of seconds the PG is stuck before it is included in the returned
715statistics (default: 300).
7c673cae
FG
716
717
718Get a PG Map
719============
720
1e59de90 721To get the PG map for a particular PG, run a command of the following form:
7c673cae 722
39ae355f 723.. prompt:: bash #
7c673cae 724
39ae355f 725 ceph pg map {pg-id}
7c673cae 726
39ae355f 727For example:
7c673cae 728
39ae355f 729.. prompt:: bash #
7c673cae 730
39ae355f
TL
731 ceph pg map 1.6c
732
1e59de90
TL
733Ceph will return the PG map, the PG, and the OSD status. The output resembles
734the following:
39ae355f
TL
735
736.. prompt:: bash #
737
738 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
7c673cae
FG
739
740
1e59de90
TL
741Get a PG's Statistics
742=====================
7c673cae 743
1e59de90 744To see statistics for a particular PG, run a command of the following form:
39ae355f
TL
745
746.. prompt:: bash #
7c673cae 747
39ae355f 748 ceph pg {pg-id} query
7c673cae
FG
749
750
1e59de90
TL
751Scrub a PG
752==========
7c673cae 753
1e59de90 754To scrub a PG, run a command of the following form:
7c673cae 755
39ae355f
TL
756.. prompt:: bash #
757
758 ceph pg scrub {pg-id}
7c673cae 759
1e59de90
TL
760Ceph checks the primary and replica OSDs, generates a catalog of all objects in
761the PG, and compares the objects against each other in order to ensure that no
762objects are missing or mismatched and that their contents are consistent. If
763the replicas all match, then a final semantic sweep takes place to ensure that
764all snapshot-related object metadata is consistent. Errors are reported in
765logs.
7c673cae 766
1e59de90 767To scrub all PGs from a specific pool, run a command of the following form:
39ae355f
TL
768
769.. prompt:: bash #
11fdf7f2 770
39ae355f 771 ceph osd pool scrub {pool-name}
11fdf7f2 772
c07f9fc5 773
1e59de90
TL
774Prioritize backfill/recovery of PG(s)
775=====================================
776
777You might encounter a situation in which multiple PGs require recovery or
778backfill, but the data in some PGs is more important than the data in others
779(for example, some PGs hold data for images that are used by running machines
780and other PGs are used by inactive machines and hold data that is less
781relevant). In that case, you might want to prioritize recovery or backfill of
782the PGs with especially important data so that the performance of the cluster
783and the availability of their data are restored sooner. To designate specific
784PG(s) as prioritized during recovery, run a command of the following form:
c07f9fc5 785
39ae355f
TL
786.. prompt:: bash #
787
788 ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
1e59de90
TL
789
790To mark specific PG(s) as prioritized during backfill, run a command of the
791following form:
792
793.. prompt:: bash #
794
39ae355f 795 ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
c07f9fc5 796
1e59de90
TL
797These commands instruct Ceph to perform recovery or backfill on the specified
798PGs before processing the other PGs. Prioritization does not interrupt current
799backfills or recovery, but places the specified PGs at the top of the queue so
800that they will be acted upon next. If you change your mind or realize that you
801have prioritized the wrong PGs, run one or both of the following commands:
39ae355f
TL
802
803.. prompt:: bash #
c07f9fc5 804
39ae355f
TL
805 ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
806 ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
c07f9fc5 807
1e59de90
TL
808These commands remove the ``force`` flag from the specified PGs, so that the
809PGs will be processed in their usual order. As in the case of adding the
810``force`` flag, this affects only those PGs that are still queued but does not
811affect PGs currently undergoing recovery.
c07f9fc5 812
1e59de90
TL
813The ``force`` flag is cleared automatically after recovery or backfill of the
814PGs is complete.
7c673cae 815
1e59de90
TL
816Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
817is, to perform recovery or backfill on those PGs first), run one or both of the
818following commands:
11fdf7f2 819
39ae355f 820.. prompt:: bash #
11fdf7f2 821
39ae355f
TL
822 ceph osd pool force-recovery {pool-name}
823 ceph osd pool force-backfill {pool-name}
11fdf7f2 824
1e59de90
TL
825These commands can also be cancelled. To revert to the default order, run one
826or both of the following commands:
39ae355f
TL
827
828.. prompt:: bash #
829
830 ceph osd pool cancel-force-recovery {pool-name}
831 ceph osd pool cancel-force-backfill {pool-name}
11fdf7f2 832
1e59de90
TL
833.. warning:: These commands can break the order of Ceph's internal priority
834 computations, so use them with caution! If you have multiple pools that are
835 currently sharing the same underlying OSDs, and if the data held by certain
836 pools is more important than the data held by other pools, then we recommend
837 that you run a command of the following form to arrange a custom
838 recovery/backfill priority for all pools:
11fdf7f2 839
39ae355f
TL
840.. prompt:: bash #
841
842 ceph osd pool set {pool-name} recovery_priority {value}
11fdf7f2 843
1e59de90
TL
844For example, if you have twenty pools, you could make the most important pool
845priority ``20``, and the next most important pool priority ``19``, and so on.
846
847Another option is to set the recovery/backfill priority for only a proper
848subset of pools. In such a scenario, three important pools might (all) be
849assigned priority ``1`` and all other pools would be left without an assigned
850recovery/backfill priority. Another possibility is to select three important
851pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
852respectively.
11fdf7f2 853
1e59de90
TL
854.. important:: Numbers of greater value have higher priority than numbers of
855 lesser value when using ``ceph osd pool set {pool-name} recovery_priority
856 {value}`` to set their recovery/backfill priority. For example, a pool with
857 the recovery/backfill priority ``30`` has a higher priority than a pool with
858 the recovery/backfill priority ``15``.
7c673cae 859
1e59de90
TL
860Reverting Lost RADOS Objects
861============================
862
863If the cluster has lost one or more RADOS objects and you have decided to
7c673cae 864abandon the search for the lost data, you must mark the unfound objects
1e59de90
TL
865``lost``.
866
867If every possible location has been queried and all OSDs are ``up`` and ``in``,
868but certain RADOS objects are still lost, you might have to give up on those
869objects. This situation can arise when rare and unusual combinations of
870failures allow the cluster to learn about writes that were performed before the
871writes themselves were recovered.
7c673cae 872
1e59de90
TL
873The command to mark a RADOS object ``lost`` has only one supported option:
874``revert``. The ``revert`` option will either roll back to a previous version
875of the RADOS object (if it is old enough to have a previous version) or forget
876about it entirely (if it is too new to have a previous version). To mark the
877"unfound" objects ``lost``, run a command of the following form:
7c673cae 878
39ae355f
TL
879
880.. prompt:: bash #
7c673cae 881
39ae355f 882 ceph pg {pg-id} mark_unfound_lost revert|delete
7c673cae 883
1e59de90
TL
884.. important:: Use this feature with caution. It might confuse applications
885 that expect the object(s) to exist.
7c673cae
FG
886
887
888.. toctree::
889 :hidden:
890
891 pg-states
892 pg-concepts
893
894
895.. _Create a Pool: ../pools#createpool
896.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
33c7a0ef 897.. _pgcalc: https://old.ceph.com/pgcalc/