]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ================== |
2 | Placement Groups | |
3 | ================== | |
4 | ||
5 | .. _preselection: | |
6 | ||
7 | A preselection of pg_num | |
8 | ======================== | |
9 | ||
10 | When creating a new pool with:: | |
11 | ||
12 | ceph osd pool create {pool-name} pg_num | |
13 | ||
14 | it is mandatory to choose the value of ``pg_num`` because it cannot be | |
15 | calculated automatically. Here are a few values commonly used: | |
16 | ||
17 | - Less than 5 OSDs set ``pg_num`` to 128 | |
18 | ||
19 | - Between 5 and 10 OSDs set ``pg_num`` to 512 | |
20 | ||
21 | - Between 10 and 50 OSDs set ``pg_num`` to 1024 | |
22 | ||
23 | - If you have more than 50 OSDs, you need to understand the tradeoffs | |
24 | and how to calculate the ``pg_num`` value by yourself | |
25 | ||
26 | - For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool | |
27 | ||
28 | As the number of OSDs increases, chosing the right value for pg_num | |
29 | becomes more important because it has a significant influence on the | |
30 | behavior of the cluster as well as the durability of the data when | |
31 | something goes wrong (i.e. the probability that a catastrophic event | |
32 | leads to data loss). | |
33 | ||
34 | How are Placement Groups used ? | |
35 | =============================== | |
36 | ||
37 | A placement group (PG) aggregates objects within a pool because | |
38 | tracking object placement and object metadata on a per-object basis is | |
39 | computationally expensive--i.e., a system with millions of objects | |
40 | cannot realistically track placement on a per-object basis. | |
41 | ||
42 | .. ditaa:: | |
43 | /-----\ /-----\ /-----\ /-----\ /-----\ | |
44 | | obj | | obj | | obj | | obj | | obj | | |
45 | \-----/ \-----/ \-----/ \-----/ \-----/ | |
46 | | | | | | | |
47 | +--------+--------+ +---+----+ | |
48 | | | | |
49 | v v | |
50 | +-----------------------+ +-----------------------+ | |
51 | | Placement Group #1 | | Placement Group #2 | | |
52 | | | | | | |
53 | +-----------------------+ +-----------------------+ | |
54 | | | | |
55 | +------------------------------+ | |
56 | | | |
57 | v | |
58 | +-----------------------+ | |
59 | | Pool | | |
60 | | | | |
61 | +-----------------------+ | |
62 | ||
63 | The Ceph client will calculate which placement group an object should | |
64 | be in. It does this by hashing the object ID and applying an operation | |
65 | based on the number of PGs in the defined pool and the ID of the pool. | |
66 | See `Mapping PGs to OSDs`_ for details. | |
67 | ||
68 | The object's contents within a placement group are stored in a set of | |
69 | OSDs. For instance, in a replicated pool of size two, each placement | |
70 | group will store objects on two OSDs, as shown below. | |
71 | ||
72 | .. ditaa:: | |
73 | ||
74 | +-----------------------+ +-----------------------+ | |
75 | | Placement Group #1 | | Placement Group #2 | | |
76 | | | | | | |
77 | +-----------------------+ +-----------------------+ | |
78 | | | | | | |
79 | v v v v | |
80 | /----------\ /----------\ /----------\ /----------\ | |
81 | | | | | | | | | | |
82 | | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | | |
83 | | | | | | | | | | |
84 | \----------/ \----------/ \----------/ \----------/ | |
85 | ||
86 | ||
87 | Should OSD #2 fail, another will be assigned to Placement Group #1 and | |
88 | will be filled with copies of all objects in OSD #1. If the pool size | |
89 | is changed from two to three, an additional OSD will be assigned to | |
90 | the placement group and will receive copies of all objects in the | |
91 | placement group. | |
92 | ||
93 | Placement groups do not own the OSD, they share it with other | |
94 | placement groups from the same pool or even other pools. If OSD #2 | |
95 | fails, the Placement Group #2 will also have to restore copies of | |
96 | objects, using OSD #3. | |
97 | ||
98 | When the number of placement groups increases, the new placement | |
99 | groups will be assigned OSDs. The result of the CRUSH function will | |
100 | also change and some objects from the former placement groups will be | |
101 | copied over to the new Placement Groups and removed from the old ones. | |
102 | ||
103 | Placement Groups Tradeoffs | |
104 | ========================== | |
105 | ||
106 | Data durability and even distribution among all OSDs call for more | |
107 | placement groups but their number should be reduced to the minimum to | |
108 | save CPU and memory. | |
109 | ||
110 | .. _data durability: | |
111 | ||
112 | Data durability | |
113 | --------------- | |
114 | ||
115 | After an OSD fails, the risk of data loss increases until the data it | |
116 | contained is fully recovered. Let's imagine a scenario that causes | |
117 | permanent data loss in a single placement group: | |
118 | ||
119 | - The OSD fails and all copies of the object it contains are lost. | |
120 | For all objects within the placement group the number of replica | |
121 | suddently drops from three to two. | |
122 | ||
123 | - Ceph starts recovery for this placement group by chosing a new OSD | |
124 | to re-create the third copy of all objects. | |
125 | ||
126 | - Another OSD, within the same placement group, fails before the new | |
127 | OSD is fully populated with the third copy. Some objects will then | |
128 | only have one surviving copies. | |
129 | ||
130 | - Ceph picks yet another OSD and keeps copying objects to restore the | |
131 | desired number of copies. | |
132 | ||
133 | - A third OSD, within the same placement group, fails before recovery | |
134 | is complete. If this OSD contained the only remaining copy of an | |
135 | object, it is permanently lost. | |
136 | ||
137 | In a cluster containing 10 OSDs with 512 placement groups in a three | |
138 | replica pool, CRUSH will give each placement groups three OSDs. In the | |
139 | end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement | |
140 | Groups. When the first OSD fails, the above scenario will therefore | |
141 | start recovery for all 150 placement groups at the same time. | |
142 | ||
143 | The 150 placement groups being recovered are likely to be | |
144 | homogeneously spread over the 9 remaining OSDs. Each remaining OSD is | |
145 | therefore likely to send copies of objects to all others and also | |
146 | receive some new objects to be stored because they became part of a | |
147 | new placement group. | |
148 | ||
149 | The amount of time it takes for this recovery to complete entirely | |
150 | depends on the architecture of the Ceph cluster. Let say each OSD is | |
151 | hosted by a 1TB SSD on a single machine and all of them are connected | |
152 | to a 10Gb/s switch and the recovery for a single OSD completes within | |
153 | M minutes. If there are two OSDs per machine using spinners with no | |
154 | SSD journal and a 1Gb/s switch, it will at least be an order of | |
155 | magnitude slower. | |
156 | ||
157 | In a cluster of this size, the number of placement groups has almost | |
158 | no influence on data durability. It could be 128 or 8192 and the | |
159 | recovery would not be slower or faster. | |
160 | ||
161 | However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs | |
162 | is likely to speed up recovery and therefore improve data durability | |
163 | significantly. Each OSD now participates in only ~75 placement groups | |
164 | instead of ~150 when there were only 10 OSDs and it will still require | |
165 | all 19 remaining OSDs to perform the same amount of object copies in | |
166 | order to recover. But where 10 OSDs had to copy approximately 100GB | |
167 | each, they now have to copy 50GB each instead. If the network was the | |
168 | bottleneck, recovery will happen twice as fast. In other words, | |
169 | recovery goes faster when the number of OSDs increases. | |
170 | ||
171 | If this cluster grows to 40 OSDs, each of them will only host ~35 | |
172 | placement groups. If an OSD dies, recovery will keep going faster | |
173 | unless it is blocked by another bottleneck. However, if this cluster | |
174 | grows to 200 OSDs, each of them will only host ~7 placement groups. If | |
175 | an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs | |
176 | in these placement groups: recovery will take longer than when there | |
177 | were 40 OSDs, meaning the number of placement groups should be | |
178 | increased. | |
179 | ||
180 | No matter how short the recovery time is, there is a chance for a | |
181 | second OSD to fail while it is in progress. In the 10 OSDs cluster | |
182 | described above, if any of them fail, then ~17 placement groups | |
183 | (i.e. ~150 / 9 placement groups being recovered) will only have one | |
184 | surviving copy. And if any of the 8 remaining OSD fail, the last | |
185 | objects of two placement groups are likely to be lost (i.e. ~17 / 8 | |
186 | placement groups with only one remaining copy being recovered). | |
187 | ||
188 | When the size of the cluster grows to 20 OSDs, the number of Placement | |
189 | Groups damaged by the loss of three OSDs drops. The second OSD lost | |
190 | will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) | |
191 | instead of ~17 and the third OSD lost will only lose data if it is one | |
192 | of the four OSDs containing the surviving copy. In other words, if the | |
193 | probability of losing one OSD is 0.0001% during the recovery time | |
194 | frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * | |
195 | 0.0001% in the cluster with 20 OSDs. | |
196 | ||
197 | In a nutshell, more OSDs mean faster recovery and a lower risk of | |
198 | cascading failures leading to the permanent loss of a Placement | |
199 | Group. Having 512 or 4096 Placement Groups is roughly equivalent in a | |
200 | cluster with less than 50 OSDs as far as data durability is concerned. | |
201 | ||
202 | Note: It may take a long time for a new OSD added to the cluster to be | |
203 | populated with placement groups that were assigned to it. However | |
204 | there is no degradation of any object and it has no impact on the | |
205 | durability of the data contained in the Cluster. | |
206 | ||
207 | .. _object distribution: | |
208 | ||
209 | Object distribution within a pool | |
210 | --------------------------------- | |
211 | ||
212 | Ideally objects are evenly distributed in each placement group. Since | |
213 | CRUSH computes the placement group for each object, but does not | |
214 | actually know how much data is stored in each OSD within this | |
215 | placement group, the ratio between the number of placement groups and | |
216 | the number of OSDs may influence the distribution of the data | |
217 | significantly. | |
218 | ||
219 | For instance, if there was single a placement group for ten OSDs in a | |
220 | three replica pool, only three OSD would be used because CRUSH would | |
221 | have no other choice. When more placement groups are available, | |
222 | objects are more likely to be evenly spread among them. CRUSH also | |
223 | makes every effort to evenly spread OSDs among all existing Placement | |
224 | Groups. | |
225 | ||
226 | As long as there are one or two orders of magnitude more Placement | |
227 | Groups than OSDs, the distribution should be even. For instance, 300 | |
228 | placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc. | |
229 | ||
230 | Uneven data distribution can be caused by factors other than the ratio | |
231 | between OSDs and placement groups. Since CRUSH does not take into | |
232 | account the size of the objects, a few very large objects may create | |
233 | an imbalance. Let say one million 4K objects totaling 4GB are evenly | |
234 | spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10 | |
235 | = 400MB on each OSD. If one 400MB object is added to the pool, the | |
236 | three OSDs supporting the placement group in which the object has been | |
237 | placed will be filled with 400MB + 400MB = 800MB while the seven | |
238 | others will remain occupied with only 400MB. | |
239 | ||
240 | .. _resource usage: | |
241 | ||
242 | Memory, CPU and network usage | |
243 | ----------------------------- | |
244 | ||
245 | For each placement group, OSDs and MONs need memory, network and CPU | |
246 | at all times and even more during recovery. Sharing this overhead by | |
247 | clustering objects within a placement group is one of the main reasons | |
248 | they exist. | |
249 | ||
250 | Minimizing the number of placement groups saves significant amounts of | |
251 | resources. | |
252 | ||
253 | Choosing the number of Placement Groups | |
254 | ======================================= | |
255 | ||
256 | If you have more than 50 OSDs, we recommend approximately 50-100 | |
257 | placement groups per OSD to balance out resource usage, data | |
258 | durability and distribution. If you have less than 50 OSDs, chosing | |
259 | among the `preselection`_ above is best. For a single pool of objects, | |
260 | you can use the following formula to get a baseline:: | |
261 | ||
262 | (OSDs * 100) | |
263 | Total PGs = ------------ | |
264 | pool size | |
265 | ||
266 | Where **pool size** is either the number of replicas for replicated | |
267 | pools or the K+M sum for erasure coded pools (as returned by **ceph | |
268 | osd erasure-code-profile get**). | |
269 | ||
270 | You should then check if the result makes sense with the way you | |
271 | designed your Ceph cluster to maximize `data durability`_, | |
272 | `object distribution`_ and minimize `resource usage`_. | |
273 | ||
274 | The result should be **rounded up to the nearest power of two.** | |
275 | Rounding up is optional, but recommended for CRUSH to evenly balance | |
276 | the number of objects among placement groups. | |
277 | ||
278 | As an example, for a cluster with 200 OSDs and a pool size of 3 | |
279 | replicas, you would estimate your number of PGs as follows:: | |
280 | ||
281 | (200 * 100) | |
282 | ----------- = 6667. Nearest power of 2: 8192 | |
283 | 3 | |
284 | ||
285 | When using multiple data pools for storing objects, you need to ensure | |
286 | that you balance the number of placement groups per pool with the | |
287 | number of placement groups per OSD so that you arrive at a reasonable | |
288 | total number of placement groups that provides reasonably low variance | |
289 | per OSD without taxing system resources or making the peering process | |
290 | too slow. | |
291 | ||
292 | For instance a cluster of 10 pools each with 512 placement groups on | |
293 | ten OSDs is a total of 5,120 placement groups spread over ten OSDs, | |
294 | that is 512 placement groups per OSD. That does not use too many | |
295 | resources. However, if 1,000 pools were created with 512 placement | |
296 | groups each, the OSDs will handle ~50,000 placement groups each and it | |
297 | would require significantly more resources and time for peering. | |
298 | ||
224ce89b WB |
299 | You may find the `PGCalc`_ tool helpful. |
300 | ||
301 | ||
7c673cae FG |
302 | .. _setting the number of placement groups: |
303 | ||
304 | Set the Number of Placement Groups | |
305 | ================================== | |
306 | ||
307 | To set the number of placement groups in a pool, you must specify the | |
308 | number of placement groups at the time you create the pool. | |
309 | See `Create a Pool`_ for details. Once you've set placement groups for a | |
310 | pool, you may increase the number of placement groups (but you cannot | |
311 | decrease the number of placement groups). To increase the number of | |
312 | placement groups, execute the following:: | |
313 | ||
314 | ceph osd pool set {pool-name} pg_num {pg_num} | |
315 | ||
316 | Once you increase the number of placement groups, you must also | |
317 | increase the number of placement groups for placement (``pgp_num``) | |
318 | before your cluster will rebalance. The ``pgp_num`` will be the number of | |
319 | placement groups that will be considered for placement by the CRUSH | |
320 | algorithm. Increasing ``pg_num`` splits the placement groups but data | |
321 | will not be migrated to the newer placement groups until placement | |
322 | groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` | |
323 | should be equal to the ``pg_num``. To increase the number of | |
324 | placement groups for placement, execute the following:: | |
325 | ||
326 | ceph osd pool set {pool-name} pgp_num {pgp_num} | |
327 | ||
328 | ||
329 | Get the Number of Placement Groups | |
330 | ================================== | |
331 | ||
332 | To get the number of placement groups in a pool, execute the following:: | |
333 | ||
334 | ceph osd pool get {pool-name} pg_num | |
335 | ||
336 | ||
337 | Get a Cluster's PG Statistics | |
338 | ============================= | |
339 | ||
340 | To get the statistics for the placement groups in your cluster, execute the following:: | |
341 | ||
342 | ceph pg dump [--format {format}] | |
343 | ||
344 | Valid formats are ``plain`` (default) and ``json``. | |
345 | ||
346 | ||
347 | Get Statistics for Stuck PGs | |
348 | ============================ | |
349 | ||
350 | To get the statistics for all placement groups stuck in a specified state, | |
351 | execute the following:: | |
352 | ||
353 | ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] | |
354 | ||
355 | **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD | |
356 | with the most up-to-date data to come up and in. | |
357 | ||
358 | **Unclean** Placement groups contain objects that are not replicated the desired number | |
359 | of times. They should be recovering. | |
360 | ||
361 | **Stale** Placement groups are in an unknown state - the OSDs that host them have not | |
362 | reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). | |
363 | ||
364 | Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number | |
365 | of seconds the placement group is stuck before including it in the returned statistics | |
366 | (default 300 seconds). | |
367 | ||
368 | ||
369 | Get a PG Map | |
370 | ============ | |
371 | ||
372 | To get the placement group map for a particular placement group, execute the following:: | |
373 | ||
374 | ceph pg map {pg-id} | |
375 | ||
376 | For example:: | |
377 | ||
378 | ceph pg map 1.6c | |
379 | ||
380 | Ceph will return the placement group map, the placement group, and the OSD status:: | |
381 | ||
382 | osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] | |
383 | ||
384 | ||
385 | Get a PGs Statistics | |
386 | ==================== | |
387 | ||
388 | To retrieve statistics for a particular placement group, execute the following:: | |
389 | ||
390 | ceph pg {pg-id} query | |
391 | ||
392 | ||
393 | Scrub a Placement Group | |
394 | ======================= | |
395 | ||
396 | To scrub a placement group, execute the following:: | |
397 | ||
398 | ceph pg scrub {pg-id} | |
399 | ||
400 | Ceph checks the primary and any replica nodes, generates a catalog of all objects | |
401 | in the placement group and compares them to ensure that no objects are missing | |
402 | or mismatched, and their contents are consistent. Assuming the replicas all | |
403 | match, a final semantic sweep ensures that all of the snapshot-related object | |
404 | metadata is consistent. Errors are reported via logs. | |
405 | ||
406 | ||
407 | Revert Lost | |
408 | =========== | |
409 | ||
410 | If the cluster has lost one or more objects, and you have decided to | |
411 | abandon the search for the lost data, you must mark the unfound objects | |
412 | as ``lost``. | |
413 | ||
414 | If all possible locations have been queried and objects are still | |
415 | lost, you may have to give up on the lost objects. This is | |
416 | possible given unusual combinations of failures that allow the cluster | |
417 | to learn about writes that were performed before the writes themselves | |
418 | are recovered. | |
419 | ||
420 | Currently the only supported option is "revert", which will either roll back to | |
421 | a previous version of the object or (if it was a new object) forget about it | |
422 | entirely. To mark the "unfound" objects as "lost", execute the following:: | |
423 | ||
424 | ceph pg {pg-id} mark_unfound_lost revert|delete | |
425 | ||
426 | .. important:: Use this feature with caution, because it may confuse | |
427 | applications that expect the object(s) to exist. | |
428 | ||
429 | ||
430 | .. toctree:: | |
431 | :hidden: | |
432 | ||
433 | pg-states | |
434 | pg-concepts | |
435 | ||
436 | ||
437 | .. _Create a Pool: ../pools#createpool | |
438 | .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds | |
439 | .. _pgcalc: http://ceph.com/pgcalc/ |