[ceph.git] / ceph / doc / rados / operations / placement-groups.rst

==================
 Placement Groups
==================

.. _preselection:

A preselection of pg_num
========================

When creating a new pool with::

        ceph osd pool create {pool-name} pg_num

it is mandatory to choose the value of ``pg_num`` because it cannot be
calculated automatically. Here are a few values commonly used:

- Less than 5 OSDs set ``pg_num`` to 128

- Between 5 and 10 OSDs set ``pg_num`` to 512

- Between 10 and 50 OSDs set ``pg_num`` to 1024

- If you have more than 50 OSDs, you need to understand the tradeoffs
  and how to calculate the ``pg_num`` value by yourself

- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool 

As the number of OSDs increases, chosing the right value for pg_num
becomes more important because it has a significant influence on the
behavior of the cluster as well as the durability of the data when
something goes wrong (i.e. the probability that a catastrophic event
leads to data loss).

How are Placement Groups used ?
===============================

A placement group (PG) aggregates objects within a pool because
tracking object placement and object metadata on a per-object basis is
computationally expensive--i.e., a system with millions of objects
cannot realistically track placement on a per-object basis.

.. ditaa::
           /-----\  /-----\  /-----\  /-----\  /-----\
           | obj |  | obj |  | obj |  | obj |  | obj |
           \-----/  \-----/  \-----/  \-----/  \-----/
              |        |        |        |        |
              +--------+--------+        +---+----+
              |                              |
              v                              v
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
               |                              |
               +------------------------------+
                             |
                             v
                  +-----------------------+
                  |        Pool           |
                  |                       |
                  +-----------------------+

The Ceph client will calculate which placement group an object should
be in. It does this by hashing the object ID and applying an operation
based on the number of PGs in the defined pool and the ID of the pool.
See `Mapping PGs to OSDs`_ for details.

The object's contents within a placement group are stored in a set of
OSDs. For instance, in a replicated pool of size two, each placement
group will store objects on two OSDs, as shown below.

.. ditaa::

   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
        |             |               |             |
        v             v               v             v
   /----------\  /----------\    /----------\  /----------\
   |          |  |          |    |          |  |          |
   |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
   |          |  |          |    |          |  |          |
   \----------/  \----------/    \----------/  \----------/


Should OSD #2 fail, another will be assigned to Placement Group #1 and
will be filled with copies of all objects in OSD #1. If the pool size
is changed from two to three, an additional OSD will be assigned to
the placement group and will receive copies of all objects in the
placement group.

Placement groups do not own the OSD, they share it with other
placement groups from the same pool or even other pools. If OSD #2
fails, the Placement Group #2 will also have to restore copies of
objects, using OSD #3.

When the number of placement groups increases, the new placement
groups will be assigned OSDs. The result of the CRUSH function will
also change and some objects from the former placement groups will be
copied over to the new Placement Groups and removed from the old ones.

Placement Groups Tradeoffs
==========================

Data durability and even distribution among all OSDs call for more
placement groups but their number should be reduced to the minimum to
save CPU and memory.

.. _data durability:

Data durability
---------------

After an OSD fails, the risk of data loss increases until the data it
contained is fully recovered. Let's imagine a scenario that causes
permanent data loss in a single placement group:

- The OSD fails and all copies of the object it contains are lost.
  For all objects within the placement group the number of replica
  suddently drops from three to two.

- Ceph starts recovery for this placement group by chosing a new OSD
  to re-create the third copy of all objects.

- Another OSD, within the same placement group, fails before the new
  OSD is fully populated with the third copy. Some objects will then
  only have one surviving copies.

- Ceph picks yet another OSD and keeps copying objects to restore the
  desired number of copies.

- A third OSD, within the same placement group, fails before recovery
  is complete. If this OSD contained the only remaining copy of an
  object, it is permanently lost.

In a cluster containing 10 OSDs with 512 placement groups in a three
replica pool, CRUSH will give each placement groups three OSDs. In the
end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
Groups. When the first OSD fails, the above scenario will therefore
start recovery for all 150 placement groups at the same time.

The 150 placement groups being recovered are likely to be
homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
therefore likely to send copies of objects to all others and also
receive some new objects to be stored because they became part of a
new placement group.

The amount of time it takes for this recovery to complete entirely
depends on the architecture of the Ceph cluster. Let say each OSD is
hosted by a 1TB SSD on a single machine and all of them are connected
to a 10Gb/s switch and the recovery for a single OSD completes within
M minutes. If there are two OSDs per machine using spinners with no
SSD journal and a 1Gb/s switch, it will at least be an order of
magnitude slower.

In a cluster of this size, the number of placement groups has almost
no influence on data durability. It could be 128 or 8192 and the
recovery would not be slower or faster.

However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
is likely to speed up recovery and therefore improve data durability
significantly. Each OSD now participates in only ~75 placement groups
instead of ~150 when there were only 10 OSDs and it will still require
all 19 remaining OSDs to perform the same amount of object copies in
order to recover. But where 10 OSDs had to copy approximately 100GB
each, they now have to copy 50GB each instead. If the network was the
bottleneck, recovery will happen twice as fast. In other words,
recovery goes faster when the number of OSDs increases.

If this cluster grows to 40 OSDs, each of them will only host ~35
placement groups. If an OSD dies, recovery will keep going faster
unless it is blocked by another bottleneck. However, if this cluster
grows to 200 OSDs, each of them will only host ~7 placement groups. If
an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
in these placement groups: recovery will take longer than when there
were 40 OSDs, meaning the number of placement groups should be
increased.

No matter how short the recovery time is, there is a chance for a
second OSD to fail while it is in progress. In the 10 OSDs cluster
described above, if any of them fail, then ~17 placement groups
(i.e. ~150 / 9 placement groups being recovered) will only have one
surviving copy. And if any of the 8 remaining OSD fail, the last
objects of two placement groups are likely to be lost (i.e. ~17 / 8
placement groups with only one remaining copy being recovered).

When the size of the cluster grows to 20 OSDs, the number of Placement
Groups damaged by the loss of three OSDs drops. The second OSD lost
will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
instead of ~17 and the third OSD lost will only lose data if it is one
of the four OSDs containing the surviving copy. In other words, if the
probability of losing one OSD is 0.0001% during the recovery time
frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * 
0.0001% in the cluster with 20 OSDs.

In a nutshell, more OSDs mean faster recovery and a lower risk of
cascading failures leading to the permanent loss of a Placement
Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
cluster with less than 50 OSDs as far as data durability is concerned.

Note: It may take a long time for a new OSD added to the cluster to be
populated with placement groups that were assigned to it. However
there is no degradation of any object and it has no impact on the
durability of the data contained in the Cluster.

.. _object distribution:

Object distribution within a pool
---------------------------------

Ideally objects are evenly distributed in each placement group. Since
CRUSH computes the placement group for each object, but does not
actually know how much data is stored in each OSD within this
placement group, the ratio between the number of placement groups and
the number of OSDs may influence the distribution of the data
significantly.

For instance, if there was single a placement group for ten OSDs in a
three replica pool, only three OSD would be used because CRUSH would
have no other choice. When more placement groups are available,
objects are more likely to be evenly spread among them. CRUSH also
makes every effort to evenly spread OSDs among all existing Placement
Groups.

As long as there are one or two orders of magnitude more Placement
Groups than OSDs, the distribution should be even. For instance, 300
placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.

Uneven data distribution can be caused by factors other than the ratio
between OSDs and placement groups. Since CRUSH does not take into
account the size of the objects, a few very large objects may create
an imbalance. Let say one million 4K objects totaling 4GB are evenly
spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
= 400MB on each OSD. If one 400MB object is added to the pool, the
three OSDs supporting the placement group in which the object has been
placed will be filled with 400MB + 400MB = 800MB while the seven
others will remain occupied with only 400MB.

.. _resource usage:

Memory, CPU and network usage
-----------------------------

For each placement group, OSDs and MONs need memory, network and CPU
at all times and even more during recovery. Sharing this overhead by
clustering objects within a placement group is one of the main reasons
they exist.

Minimizing the number of placement groups saves significant amounts of
resources.

Choosing the number of Placement Groups
=======================================

If you have more than 50 OSDs, we recommend approximately 50-100
placement groups per OSD to balance out resource usage, data
durability and distribution. If you have less than 50 OSDs, chosing
among the `preselection`_ above is best. For a single pool of objects,
you can use the following formula to get a baseline::

                (OSDs * 100)
   Total PGs =  ------------
                 pool size

Where **pool size** is either the number of replicas for replicated
pools or the K+M sum for erasure coded pools (as returned by **ceph
osd erasure-code-profile get**).

You should then check if the result makes sense with the way you
designed your Ceph cluster to maximize `data durability`_,
`object distribution`_ and minimize `resource usage`_.

The result should be **rounded up to the nearest power of two.**
Rounding up is optional, but recommended for CRUSH to evenly balance
the number of objects among placement groups.

As an example, for a cluster with 200 OSDs and a pool size of 3
replicas, you would estimate your number of PGs as follows::

   (200 * 100)
   ----------- = 6667. Nearest power of 2: 8192
        3

When using multiple data pools for storing objects, you need to ensure
that you balance the number of placement groups per pool with the
number of placement groups per OSD so that you arrive at a reasonable
total number of placement groups that provides reasonably low variance
per OSD without taxing system resources or making the peering process
too slow.

For instance a cluster of 10 pools each with 512 placement groups on
ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
that is 512 placement groups per OSD. That does not use too many
resources. However, if 1,000 pools were created with 512 placement
groups each, the OSDs will handle ~50,000 placement groups each and it
would require significantly more resources and time for peering.

You may find the `PGCalc`_ tool helpful.


.. _setting the number of placement groups:

Set the Number of Placement Groups
==================================

To set the number of placement groups in a pool, you must specify the
number of placement groups at the time you create the pool.
See `Create a Pool`_ for details. Once you've set placement groups for a
pool, you may increase the number of placement groups (but you cannot
decrease the number of placement groups). To increase the number of
placement groups, execute the following::

        ceph osd pool set {pool-name} pg_num {pg_num}

Once you increase the number of placement groups, you must also
increase the number of placement groups for placement (``pgp_num``)
before your cluster will rebalance. The ``pgp_num`` will be the number of
placement groups that will be considered for placement by the CRUSH
algorithm. Increasing ``pg_num`` splits the placement groups but data
will not be migrated to the newer placement groups until placement
groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
should be equal to the ``pg_num``.  To increase the number of
placement groups for placement, execute the following::

        ceph osd pool set {pool-name} pgp_num {pgp_num}


Get the Number of Placement Groups
==================================

To get the number of placement groups in a pool, execute the following::

        ceph osd pool get {pool-name} pg_num


Get a Cluster's PG Statistics
=============================

To get the statistics for the placement groups in your cluster, execute the following::

        ceph pg dump [--format {format}]

Valid formats are ``plain`` (default) and ``json``.


Get Statistics for Stuck PGs
============================

To get the statistics for all placement groups stuck in a specified state,
execute the following::

        ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]

**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
with the most up-to-date data to come up and in.

**Unclean** Placement groups contain objects that are not replicated the desired number
of times. They should be recovering.

**Stale** Placement groups are in an unknown state - the OSDs that host them have not
reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).

Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
of seconds the placement group is stuck before including it in the returned statistics
(default 300 seconds).


Get a PG Map
============

To get the placement group map for a particular placement group, execute the following::

        ceph pg map {pg-id}

For example::

        ceph pg map 1.6c

Ceph will return the placement group map, the placement group, and the OSD status::

        osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]


Get a PGs Statistics
====================

To retrieve statistics for a particular placement group, execute the following::

        ceph pg {pg-id} query


Scrub a Placement Group
=======================

To scrub a placement group, execute the following::

        ceph pg scrub {pg-id}

Ceph checks the primary and any replica nodes, generates a catalog of all objects
in the placement group and compares them to ensure that no objects are missing
or mismatched, and their contents are consistent.  Assuming the replicas all
match, a final semantic sweep ensures that all of the snapshot-related object
metadata is consistent. Errors are reported via logs.


Revert Lost
===========

If the cluster has lost one or more objects, and you have decided to
abandon the search for the lost data, you must mark the unfound objects
as ``lost``.

If all possible locations have been queried and objects are still
lost, you may have to give up on the lost objects. This is
possible given unusual combinations of failures that allow the cluster
to learn about writes that were performed before the writes themselves
are recovered.

Currently the only supported option is "revert", which will either roll back to
a previous version of the object or (if it was a new object) forget about it
entirely. To mark the "unfound" objects as "lost", execute the following::

        ceph pg {pg-id} mark_unfound_lost revert|delete

.. important:: Use this feature with caution, because it may confuse
   applications that expect the object(s) to exist.


.. toctree::
        :hidden:

        pg-states
        pg-concepts


.. _Create a Pool: ../pools#createpool
.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
.. _pgcalc: http://ceph.com/pgcalc/
Commit	Line	Data
7c673cae FG	1	==================
	2	Placement Groups
	3	==================
	4
	5	.. _preselection:
	6
	7	A preselection of pg_num
	8	========================
	9
	10	When creating a new pool with::
	11
	12	ceph osd pool create {pool-name} pg_num
	13
	14	it is mandatory to choose the value of ``pg_num`` because it cannot be
	15	calculated automatically. Here are a few values commonly used:
	16
	17	- Less than 5 OSDs set ``pg_num`` to 128
	18
	19	- Between 5 and 10 OSDs set ``pg_num`` to 512
	20
	21	- Between 10 and 50 OSDs set ``pg_num`` to 1024
	22
	23	- If you have more than 50 OSDs, you need to understand the tradeoffs
	24	and how to calculate the ``pg_num`` value by yourself
	25
	26	- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
	27
	28	As the number of OSDs increases, chosing the right value for pg_num
	29	becomes more important because it has a significant influence on the
	30	behavior of the cluster as well as the durability of the data when
	31	something goes wrong (i.e. the probability that a catastrophic event
	32	leads to data loss).
	33
	34	How are Placement Groups used ?
	35	===============================
	36
	37	A placement group (PG) aggregates objects within a pool because
	38	tracking object placement and object metadata on a per-object basis is
	39	computationally expensive--i.e., a system with millions of objects
	40	cannot realistically track placement on a per-object basis.
	41
	42	.. ditaa::
	43	/-----\ /-----\ /-----\ /-----\ /-----\
	44	\| obj \| \| obj \| \| obj \| \| obj \| \| obj \|
	45	\-----/ \-----/ \-----/ \-----/ \-----/
	46	\| \| \| \| \|
	47	+--------+--------+ +---+----+
	48	\| \|
	49	v v
	50	+-----------------------+ +-----------------------+
	51	\| Placement Group #1 \| \| Placement Group #2 \|
	52	\| \| \| \|
	53	+-----------------------+ +-----------------------+
	54	\| \|
	55	+------------------------------+
	56	\|
	57	v
	58	+-----------------------+
	59	\| Pool \|
	60	\| \|
	61	+-----------------------+
	62
	63	The Ceph client will calculate which placement group an object should
	64	be in. It does this by hashing the object ID and applying an operation
65	based on the number of PGs in the defined pool and the ID of the pool.
66	See `Mapping PGs to OSDs`_ for details.
67
68	The object's contents within a placement group are stored in a set of
69	OSDs. For instance, in a replicated pool of size two, each placement
70	group will store objects on two OSDs, as shown below.
71
72	.. ditaa::
73
74	+-----------------------+ +-----------------------+
75	\| Placement Group #1 \| \| Placement Group #2 \|
76	\| \| \| \|
77	+-----------------------+ +-----------------------+
78	\| \| \| \|
79	v v v v
80	/----------\ /----------\ /----------\ /----------\
81	\| \| \| \| \| \| \| \|
82	\| OSD #1 \| \| OSD #2 \| \| OSD #2 \| \| OSD #3 \|
83	\| \| \| \| \| \| \| \|
84	\----------/ \----------/ \----------/ \----------/
85
86
87	Should OSD #2 fail, another will be assigned to Placement Group #1 and
88	will be filled with copies of all objects in OSD #1. If the pool size
89	is changed from two to three, an additional OSD will be assigned to
90	the placement group and will receive copies of all objects in the
91	placement group.
92
93	Placement groups do not own the OSD, they share it with other
94	placement groups from the same pool or even other pools. If OSD #2
95	fails, the Placement Group #2 will also have to restore copies of
96	objects, using OSD #3.
97
98	When the number of placement groups increases, the new placement
99	groups will be assigned OSDs. The result of the CRUSH function will
100	also change and some objects from the former placement groups will be
101	copied over to the new Placement Groups and removed from the old ones.
102
103	Placement Groups Tradeoffs
104	==========================
105
106	Data durability and even distribution among all OSDs call for more
107	placement groups but their number should be reduced to the minimum to
108	save CPU and memory.
109
110	.. _data durability:
111
112	Data durability
113	---------------
114
115	After an OSD fails, the risk of data loss increases until the data it
116	contained is fully recovered. Let's imagine a scenario that causes
117	permanent data loss in a single placement group:
118
119	- The OSD fails and all copies of the object it contains are lost.
120	For all objects within the placement group the number of replica
121	suddently drops from three to two.
122
123	- Ceph starts recovery for this placement group by chosing a new OSD
124	to re-create the third copy of all objects.
125
126	- Another OSD, within the same placement group, fails before the new
127	OSD is fully populated with the third copy. Some objects will then
128	only have one surviving copies.
129
130	- Ceph picks yet another OSD and keeps copying objects to restore the
131	desired number of copies.
132
133	- A third OSD, within the same placement group, fails before recovery
134	is complete. If this OSD contained the only remaining copy of an
135	object, it is permanently lost.
136
137	In a cluster containing 10 OSDs with 512 placement groups in a three
138	replica pool, CRUSH will give each placement groups three OSDs. In the
139	end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
140	Groups. When the first OSD fails, the above scenario will therefore
141	start recovery for all 150 placement groups at the same time.
142
143	The 150 placement groups being recovered are likely to be
144	homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
145	therefore likely to send copies of objects to all others and also
146	receive some new objects to be stored because they became part of a
147	new placement group.
148
149	The amount of time it takes for this recovery to complete entirely
150	depends on the architecture of the Ceph cluster. Let say each OSD is
151	hosted by a 1TB SSD on a single machine and all of them are connected
152	to a 10Gb/s switch and the recovery for a single OSD completes within
153	M minutes. If there are two OSDs per machine using spinners with no
154	SSD journal and a 1Gb/s switch, it will at least be an order of
155	magnitude slower.
156
157	In a cluster of this size, the number of placement groups has almost
158	no influence on data durability. It could be 128 or 8192 and the
159	recovery would not be slower or faster.
160
161	However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
162	is likely to speed up recovery and therefore improve data durability
163	significantly. Each OSD now participates in only ~75 placement groups
164	instead of ~150 when there were only 10 OSDs and it will still require
165	all 19 remaining OSDs to perform the same amount of object copies in
166	order to recover. But where 10 OSDs had to copy approximately 100GB
167	each, they now have to copy 50GB each instead. If the network was the
168	bottleneck, recovery will happen twice as fast. In other words,
169	recovery goes faster when the number of OSDs increases.
170
171	If this cluster grows to 40 OSDs, each of them will only host ~35
172	placement groups. If an OSD dies, recovery will keep going faster
173	unless it is blocked by another bottleneck. However, if this cluster
174	grows to 200 OSDs, each of them will only host ~7 placement groups. If
175	an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
176	in these placement groups: recovery will take longer than when there
177	were 40 OSDs, meaning the number of placement groups should be
178	increased.
179
180	No matter how short the recovery time is, there is a chance for a
181	second OSD to fail while it is in progress. In the 10 OSDs cluster
182	described above, if any of them fail, then ~17 placement groups
183	(i.e. ~150 / 9 placement groups being recovered) will only have one
184	surviving copy. And if any of the 8 remaining OSD fail, the last
185	objects of two placement groups are likely to be lost (i.e. ~17 / 8
186	placement groups with only one remaining copy being recovered).
187
188	When the size of the cluster grows to 20 OSDs, the number of Placement
189	Groups damaged by the loss of three OSDs drops. The second OSD lost
190	will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
191	instead of ~17 and the third OSD lost will only lose data if it is one
192	of the four OSDs containing the surviving copy. In other words, if the
193	probability of losing one OSD is 0.0001% during the recovery time
194	frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
195	0.0001% in the cluster with 20 OSDs.
196
197	In a nutshell, more OSDs mean faster recovery and a lower risk of
198	cascading failures leading to the permanent loss of a Placement
199	Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
200	cluster with less than 50 OSDs as far as data durability is concerned.
201
202	Note: It may take a long time for a new OSD added to the cluster to be
203	populated with placement groups that were assigned to it. However
204	there is no degradation of any object and it has no impact on the
205	durability of the data contained in the Cluster.
206
207	.. _object distribution:
208
209	Object distribution within a pool
210	---------------------------------
211
212	Ideally objects are evenly distributed in each placement group. Since
213	CRUSH computes the placement group for each object, but does not
214	actually know how much data is stored in each OSD within this
215	placement group, the ratio between the number of placement groups and
216	the number of OSDs may influence the distribution of the data
217	significantly.
218
219	For instance, if there was single a placement group for ten OSDs in a
220	three replica pool, only three OSD would be used because CRUSH would
221	have no other choice. When more placement groups are available,
222	objects are more likely to be evenly spread among them. CRUSH also
223	makes every effort to evenly spread OSDs among all existing Placement
224	Groups.
225
226	As long as there are one or two orders of magnitude more Placement
227	Groups than OSDs, the distribution should be even. For instance, 300
228	placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.
229
230	Uneven data distribution can be caused by factors other than the ratio
231	between OSDs and placement groups. Since CRUSH does not take into
232	account the size of the objects, a few very large objects may create
233	an imbalance. Let say one million 4K objects totaling 4GB are evenly
234	spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
235	= 400MB on each OSD. If one 400MB object is added to the pool, the
236	three OSDs supporting the placement group in which the object has been
237	placed will be filled with 400MB + 400MB = 800MB while the seven
238	others will remain occupied with only 400MB.
239
240	.. _resource usage:
241
242	Memory, CPU and network usage
243	-----------------------------
244
245	For each placement group, OSDs and MONs need memory, network and CPU
246	at all times and even more during recovery. Sharing this overhead by
247	clustering objects within a placement group is one of the main reasons
248	they exist.
249
250	Minimizing the number of placement groups saves significant amounts of
251	resources.
252
253	Choosing the number of Placement Groups
254	=======================================
255
256	If you have more than 50 OSDs, we recommend approximately 50-100
257	placement groups per OSD to balance out resource usage, data
258	durability and distribution. If you have less than 50 OSDs, chosing
259	among the `preselection`_ above is best. For a single pool of objects,
260	you can use the following formula to get a baseline::
261
262	(OSDs * 100)
263	Total PGs = ------------
264	pool size
265
266	Where pool size is either the number of replicas for replicated
267	pools or the K+M sum for erasure coded pools (as returned by **ceph
268	osd erasure-code-profile get**).
269
270	You should then check if the result makes sense with the way you
271	designed your Ceph cluster to maximize `data durability`_,
272	`object distribution`_ and minimize `resource usage`_.
273
274	The result should be rounded up to the nearest power of two.
275	Rounding up is optional, but recommended for CRUSH to evenly balance
276	the number of objects among placement groups.
277
278	As an example, for a cluster with 200 OSDs and a pool size of 3
279	replicas, you would estimate your number of PGs as follows::
280
281	(200 * 100)
282	----------- = 6667. Nearest power of 2: 8192
283	3
284
285	When using multiple data pools for storing objects, you need to ensure
286	that you balance the number of placement groups per pool with the
287	number of placement groups per OSD so that you arrive at a reasonable
288	total number of placement groups that provides reasonably low variance
289	per OSD without taxing system resources or making the peering process
290	too slow.
291
292	For instance a cluster of 10 pools each with 512 placement groups on
293	ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
294	that is 512 placement groups per OSD. That does not use too many
295	resources. However, if 1,000 pools were created with 512 placement
296	groups each, the OSDs will handle ~50,000 placement groups each and it
297	would require significantly more resources and time for peering.
298
224ce89b WB	299	You may find the `PGCalc`_ tool helpful.
	300
	301
7c673cae FG	302	.. _setting the number of placement groups:
	303
	304	Set the Number of Placement Groups
	305	==================================
	306
	307	To set the number of placement groups in a pool, you must specify the
	308	number of placement groups at the time you create the pool.
	309	See `Create a Pool`_ for details. Once you've set placement groups for a
	310	pool, you may increase the number of placement groups (but you cannot
	311	decrease the number of placement groups). To increase the number of
	312	placement groups, execute the following::
	313
	314	ceph osd pool set {pool-name} pg_num {pg_num}
	315
	316	Once you increase the number of placement groups, you must also
	317	increase the number of placement groups for placement (``pgp_num``)
	318	before your cluster will rebalance. The ``pgp_num`` will be the number of
	319	placement groups that will be considered for placement by the CRUSH
	320	algorithm. Increasing ``pg_num`` splits the placement groups but data
	321	will not be migrated to the newer placement groups until placement
	322	groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
	323	should be equal to the ``pg_num``. To increase the number of
	324	placement groups for placement, execute the following::
	325
	326	ceph osd pool set {pool-name} pgp_num {pgp_num}
	327
	328
	329	Get the Number of Placement Groups
	330	==================================
	331
	332	To get the number of placement groups in a pool, execute the following::
	333
	334	ceph osd pool get {pool-name} pg_num
	335
	336
	337	Get a Cluster's PG Statistics
	338	=============================
	339
	340	To get the statistics for the placement groups in your cluster, execute the following::
	341
	342	ceph pg dump [--format {format}]
	343
	344	Valid formats are ``plain`` (default) and ``json``.
	345
	346
	347	Get Statistics for Stuck PGs
	348	============================
	349
	350	To get the statistics for all placement groups stuck in a specified state,
	351	execute the following::
	352
	353	ceph pg dump_stuck inactive\|unclean\|stale\|undersized\|degraded [--format <format>] [-t\|--threshold <seconds>]
	354
	355	Inactive Placement groups cannot process reads or writes because they are waiting for an OSD
	356	with the most up-to-date data to come up and in.
	357
	358	Unclean Placement groups contain objects that are not replicated the desired number
	359	of times. They should be recovering.
	360
	361	Stale Placement groups are in an unknown state - the OSDs that host them have not
	362	reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
	363
	364	Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
	365	of seconds the placement group is stuck before including it in the returned statistics
366	(default 300 seconds).
367
368
369	Get a PG Map
370	============
371
372	To get the placement group map for a particular placement group, execute the following::
373
374	ceph pg map {pg-id}
375
376	For example::
377
378	ceph pg map 1.6c
379
380	Ceph will return the placement group map, the placement group, and the OSD status::
381
382	osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
383
384
385	Get a PGs Statistics
386	====================
387
388	To retrieve statistics for a particular placement group, execute the following::
389
390	ceph pg {pg-id} query
391
392
393	Scrub a Placement Group
394	=======================
395
396	To scrub a placement group, execute the following::
397
398	ceph pg scrub {pg-id}
399
400	Ceph checks the primary and any replica nodes, generates a catalog of all objects
401	in the placement group and compares them to ensure that no objects are missing
402	or mismatched, and their contents are consistent. Assuming the replicas all
403	match, a final semantic sweep ensures that all of the snapshot-related object
404	metadata is consistent. Errors are reported via logs.
405
406
407	Revert Lost
408	===========
409
410	If the cluster has lost one or more objects, and you have decided to
411	abandon the search for the lost data, you must mark the unfound objects
412	as ``lost``.
413
414	If all possible locations have been queried and objects are still
415	lost, you may have to give up on the lost objects. This is
416	possible given unusual combinations of failures that allow the cluster
417	to learn about writes that were performed before the writes themselves
418	are recovered.
419
420	Currently the only supported option is "revert", which will either roll back to
421	a previous version of the object or (if it was a new object) forget about it
422	entirely. To mark the "unfound" objects as "lost", execute the following::
423
424	ceph pg {pg-id} mark_unfound_lost revert\|delete
425
426	.. important:: Use this feature with caution, because it may confuse
427	applications that expect the object(s) to exist.
428
429
430	.. toctree::
431	:hidden:
432
433	pg-states
434	pg-concepts
435
436
437	.. _Create a Pool: ../pools#createpool
438	.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
439	.. _pgcalc: http://ceph.com/pgcalc/