]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/placement-groups.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / rados / operations / placement-groups.rst
CommitLineData
7c673cae
FG
1==================
2 Placement Groups
3==================
4
5.. _preselection:
6
7A preselection of pg_num
8========================
9
10When creating a new pool with::
11
12 ceph osd pool create {pool-name} pg_num
13
14it is mandatory to choose the value of ``pg_num`` because it cannot be
15calculated automatically. Here are a few values commonly used:
16
17- Less than 5 OSDs set ``pg_num`` to 128
18
19- Between 5 and 10 OSDs set ``pg_num`` to 512
20
21- Between 10 and 50 OSDs set ``pg_num`` to 1024
22
23- If you have more than 50 OSDs, you need to understand the tradeoffs
24 and how to calculate the ``pg_num`` value by yourself
25
26- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
27
28As the number of OSDs increases, chosing the right value for pg_num
29becomes more important because it has a significant influence on the
30behavior of the cluster as well as the durability of the data when
31something goes wrong (i.e. the probability that a catastrophic event
32leads to data loss).
33
34How are Placement Groups used ?
35===============================
36
37A placement group (PG) aggregates objects within a pool because
38tracking object placement and object metadata on a per-object basis is
39computationally expensive--i.e., a system with millions of objects
40cannot realistically track placement on a per-object basis.
41
42.. ditaa::
43 /-----\ /-----\ /-----\ /-----\ /-----\
44 | obj | | obj | | obj | | obj | | obj |
45 \-----/ \-----/ \-----/ \-----/ \-----/
46 | | | | |
47 +--------+--------+ +---+----+
48 | |
49 v v
50 +-----------------------+ +-----------------------+
51 | Placement Group #1 | | Placement Group #2 |
52 | | | |
53 +-----------------------+ +-----------------------+
54 | |
55 +------------------------------+
56 |
57 v
58 +-----------------------+
59 | Pool |
60 | |
61 +-----------------------+
62
63The Ceph client will calculate which placement group an object should
64be in. It does this by hashing the object ID and applying an operation
65based on the number of PGs in the defined pool and the ID of the pool.
66See `Mapping PGs to OSDs`_ for details.
67
68The object's contents within a placement group are stored in a set of
69OSDs. For instance, in a replicated pool of size two, each placement
70group will store objects on two OSDs, as shown below.
71
72.. ditaa::
73
74 +-----------------------+ +-----------------------+
75 | Placement Group #1 | | Placement Group #2 |
76 | | | |
77 +-----------------------+ +-----------------------+
78 | | | |
79 v v v v
80 /----------\ /----------\ /----------\ /----------\
81 | | | | | | | |
82 | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
83 | | | | | | | |
84 \----------/ \----------/ \----------/ \----------/
85
86
87Should OSD #2 fail, another will be assigned to Placement Group #1 and
88will be filled with copies of all objects in OSD #1. If the pool size
89is changed from two to three, an additional OSD will be assigned to
90the placement group and will receive copies of all objects in the
91placement group.
92
93Placement groups do not own the OSD, they share it with other
94placement groups from the same pool or even other pools. If OSD #2
95fails, the Placement Group #2 will also have to restore copies of
96objects, using OSD #3.
97
98When the number of placement groups increases, the new placement
99groups will be assigned OSDs. The result of the CRUSH function will
100also change and some objects from the former placement groups will be
101copied over to the new Placement Groups and removed from the old ones.
102
103Placement Groups Tradeoffs
104==========================
105
106Data durability and even distribution among all OSDs call for more
107placement groups but their number should be reduced to the minimum to
108save CPU and memory.
109
110.. _data durability:
111
112Data durability
113---------------
114
115After an OSD fails, the risk of data loss increases until the data it
116contained is fully recovered. Let's imagine a scenario that causes
117permanent data loss in a single placement group:
118
119- The OSD fails and all copies of the object it contains are lost.
120 For all objects within the placement group the number of replica
121 suddently drops from three to two.
122
123- Ceph starts recovery for this placement group by chosing a new OSD
124 to re-create the third copy of all objects.
125
126- Another OSD, within the same placement group, fails before the new
127 OSD is fully populated with the third copy. Some objects will then
128 only have one surviving copies.
129
130- Ceph picks yet another OSD and keeps copying objects to restore the
131 desired number of copies.
132
133- A third OSD, within the same placement group, fails before recovery
134 is complete. If this OSD contained the only remaining copy of an
135 object, it is permanently lost.
136
137In a cluster containing 10 OSDs with 512 placement groups in a three
138replica pool, CRUSH will give each placement groups three OSDs. In the
139end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
140Groups. When the first OSD fails, the above scenario will therefore
141start recovery for all 150 placement groups at the same time.
142
143The 150 placement groups being recovered are likely to be
144homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
145therefore likely to send copies of objects to all others and also
146receive some new objects to be stored because they became part of a
147new placement group.
148
149The amount of time it takes for this recovery to complete entirely
150depends on the architecture of the Ceph cluster. Let say each OSD is
151hosted by a 1TB SSD on a single machine and all of them are connected
152to a 10Gb/s switch and the recovery for a single OSD completes within
153M minutes. If there are two OSDs per machine using spinners with no
154SSD journal and a 1Gb/s switch, it will at least be an order of
155magnitude slower.
156
157In a cluster of this size, the number of placement groups has almost
158no influence on data durability. It could be 128 or 8192 and the
159recovery would not be slower or faster.
160
161However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
162is likely to speed up recovery and therefore improve data durability
163significantly. Each OSD now participates in only ~75 placement groups
164instead of ~150 when there were only 10 OSDs and it will still require
165all 19 remaining OSDs to perform the same amount of object copies in
166order to recover. But where 10 OSDs had to copy approximately 100GB
167each, they now have to copy 50GB each instead. If the network was the
168bottleneck, recovery will happen twice as fast. In other words,
169recovery goes faster when the number of OSDs increases.
170
171If this cluster grows to 40 OSDs, each of them will only host ~35
172placement groups. If an OSD dies, recovery will keep going faster
173unless it is blocked by another bottleneck. However, if this cluster
174grows to 200 OSDs, each of them will only host ~7 placement groups. If
175an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
176in these placement groups: recovery will take longer than when there
177were 40 OSDs, meaning the number of placement groups should be
178increased.
179
180No matter how short the recovery time is, there is a chance for a
181second OSD to fail while it is in progress. In the 10 OSDs cluster
182described above, if any of them fail, then ~17 placement groups
183(i.e. ~150 / 9 placement groups being recovered) will only have one
184surviving copy. And if any of the 8 remaining OSD fail, the last
185objects of two placement groups are likely to be lost (i.e. ~17 / 8
186placement groups with only one remaining copy being recovered).
187
188When the size of the cluster grows to 20 OSDs, the number of Placement
189Groups damaged by the loss of three OSDs drops. The second OSD lost
190will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
191instead of ~17 and the third OSD lost will only lose data if it is one
192of the four OSDs containing the surviving copy. In other words, if the
193probability of losing one OSD is 0.0001% during the recovery time
194frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
1950.0001% in the cluster with 20 OSDs.
196
197In a nutshell, more OSDs mean faster recovery and a lower risk of
198cascading failures leading to the permanent loss of a Placement
199Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
200cluster with less than 50 OSDs as far as data durability is concerned.
201
202Note: It may take a long time for a new OSD added to the cluster to be
203populated with placement groups that were assigned to it. However
204there is no degradation of any object and it has no impact on the
205durability of the data contained in the Cluster.
206
207.. _object distribution:
208
209Object distribution within a pool
210---------------------------------
211
212Ideally objects are evenly distributed in each placement group. Since
213CRUSH computes the placement group for each object, but does not
214actually know how much data is stored in each OSD within this
215placement group, the ratio between the number of placement groups and
216the number of OSDs may influence the distribution of the data
217significantly.
218
219For instance, if there was single a placement group for ten OSDs in a
220three replica pool, only three OSD would be used because CRUSH would
221have no other choice. When more placement groups are available,
222objects are more likely to be evenly spread among them. CRUSH also
223makes every effort to evenly spread OSDs among all existing Placement
224Groups.
225
226As long as there are one or two orders of magnitude more Placement
227Groups than OSDs, the distribution should be even. For instance, 300
228placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.
229
230Uneven data distribution can be caused by factors other than the ratio
231between OSDs and placement groups. Since CRUSH does not take into
232account the size of the objects, a few very large objects may create
233an imbalance. Let say one million 4K objects totaling 4GB are evenly
234spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
235= 400MB on each OSD. If one 400MB object is added to the pool, the
236three OSDs supporting the placement group in which the object has been
237placed will be filled with 400MB + 400MB = 800MB while the seven
238others will remain occupied with only 400MB.
239
240.. _resource usage:
241
242Memory, CPU and network usage
243-----------------------------
244
245For each placement group, OSDs and MONs need memory, network and CPU
246at all times and even more during recovery. Sharing this overhead by
247clustering objects within a placement group is one of the main reasons
248they exist.
249
250Minimizing the number of placement groups saves significant amounts of
251resources.
252
253Choosing the number of Placement Groups
254=======================================
255
256If you have more than 50 OSDs, we recommend approximately 50-100
257placement groups per OSD to balance out resource usage, data
258durability and distribution. If you have less than 50 OSDs, chosing
259among the `preselection`_ above is best. For a single pool of objects,
260you can use the following formula to get a baseline::
261
262 (OSDs * 100)
263 Total PGs = ------------
264 pool size
265
266Where **pool size** is either the number of replicas for replicated
267pools or the K+M sum for erasure coded pools (as returned by **ceph
268osd erasure-code-profile get**).
269
270You should then check if the result makes sense with the way you
271designed your Ceph cluster to maximize `data durability`_,
272`object distribution`_ and minimize `resource usage`_.
273
274The result should be **rounded up to the nearest power of two.**
275Rounding up is optional, but recommended for CRUSH to evenly balance
276the number of objects among placement groups.
277
278As an example, for a cluster with 200 OSDs and a pool size of 3
279replicas, you would estimate your number of PGs as follows::
280
281 (200 * 100)
282 ----------- = 6667. Nearest power of 2: 8192
283 3
284
285When using multiple data pools for storing objects, you need to ensure
286that you balance the number of placement groups per pool with the
287number of placement groups per OSD so that you arrive at a reasonable
288total number of placement groups that provides reasonably low variance
289per OSD without taxing system resources or making the peering process
290too slow.
291
292For instance a cluster of 10 pools each with 512 placement groups on
293ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
294that is 512 placement groups per OSD. That does not use too many
295resources. However, if 1,000 pools were created with 512 placement
296groups each, the OSDs will handle ~50,000 placement groups each and it
297would require significantly more resources and time for peering.
298
224ce89b
WB
299You may find the `PGCalc`_ tool helpful.
300
301
7c673cae
FG
302.. _setting the number of placement groups:
303
304Set the Number of Placement Groups
305==================================
306
307To set the number of placement groups in a pool, you must specify the
308number of placement groups at the time you create the pool.
309See `Create a Pool`_ for details. Once you've set placement groups for a
310pool, you may increase the number of placement groups (but you cannot
311decrease the number of placement groups). To increase the number of
312placement groups, execute the following::
313
314 ceph osd pool set {pool-name} pg_num {pg_num}
315
316Once you increase the number of placement groups, you must also
317increase the number of placement groups for placement (``pgp_num``)
318before your cluster will rebalance. The ``pgp_num`` will be the number of
319placement groups that will be considered for placement by the CRUSH
320algorithm. Increasing ``pg_num`` splits the placement groups but data
321will not be migrated to the newer placement groups until placement
322groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
323should be equal to the ``pg_num``. To increase the number of
324placement groups for placement, execute the following::
325
326 ceph osd pool set {pool-name} pgp_num {pgp_num}
327
328
329Get the Number of Placement Groups
330==================================
331
332To get the number of placement groups in a pool, execute the following::
333
334 ceph osd pool get {pool-name} pg_num
335
336
337Get a Cluster's PG Statistics
338=============================
339
340To get the statistics for the placement groups in your cluster, execute the following::
341
342 ceph pg dump [--format {format}]
343
344Valid formats are ``plain`` (default) and ``json``.
345
346
347Get Statistics for Stuck PGs
348============================
349
350To get the statistics for all placement groups stuck in a specified state,
351execute the following::
352
353 ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
354
355**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
356with the most up-to-date data to come up and in.
357
358**Unclean** Placement groups contain objects that are not replicated the desired number
359of times. They should be recovering.
360
361**Stale** Placement groups are in an unknown state - the OSDs that host them have not
362reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
363
364Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
365of seconds the placement group is stuck before including it in the returned statistics
366(default 300 seconds).
367
368
369Get a PG Map
370============
371
372To get the placement group map for a particular placement group, execute the following::
373
374 ceph pg map {pg-id}
375
376For example::
377
378 ceph pg map 1.6c
379
380Ceph will return the placement group map, the placement group, and the OSD status::
381
382 osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
383
384
385Get a PGs Statistics
386====================
387
388To retrieve statistics for a particular placement group, execute the following::
389
390 ceph pg {pg-id} query
391
392
393Scrub a Placement Group
394=======================
395
396To scrub a placement group, execute the following::
397
398 ceph pg scrub {pg-id}
399
400Ceph checks the primary and any replica nodes, generates a catalog of all objects
401in the placement group and compares them to ensure that no objects are missing
402or mismatched, and their contents are consistent. Assuming the replicas all
403match, a final semantic sweep ensures that all of the snapshot-related object
404metadata is consistent. Errors are reported via logs.
405
406
407Revert Lost
408===========
409
410If the cluster has lost one or more objects, and you have decided to
411abandon the search for the lost data, you must mark the unfound objects
412as ``lost``.
413
414If all possible locations have been queried and objects are still
415lost, you may have to give up on the lost objects. This is
416possible given unusual combinations of failures that allow the cluster
417to learn about writes that were performed before the writes themselves
418are recovered.
419
420Currently the only supported option is "revert", which will either roll back to
421a previous version of the object or (if it was a new object) forget about it
422entirely. To mark the "unfound" objects as "lost", execute the following::
423
424 ceph pg {pg-id} mark_unfound_lost revert|delete
425
426.. important:: Use this feature with caution, because it may confuse
427 applications that expect the object(s) to exist.
428
429
430.. toctree::
431 :hidden:
432
433 pg-states
434 pg-concepts
435
436
437.. _Create a Pool: ../pools#createpool
438.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
439.. _pgcalc: http://ceph.com/pgcalc/