ceph/doc/rados/operations/placement-groups.rst

   1 ==================
   2  Placement Groups
   3 ==================
   4
   5 .. _preselection:
   6
   7 A preselection of pg_num
   8 ========================
   9
  10 When creating a new pool with::
  11
  12         ceph osd pool create {pool-name} pg_num
  13
  14 it is mandatory to choose the value of ``pg_num`` because it cannot be
  15 calculated automatically. Here are a few values commonly used:
  16
  17 - Less than 5 OSDs set ``pg_num`` to 128
  18
  19 - Between 5 and 10 OSDs set ``pg_num`` to 512
  20
  21 - Between 10 and 50 OSDs set ``pg_num`` to 1024
  22
  23 - If you have more than 50 OSDs, you need to understand the tradeoffs
  24   and how to calculate the ``pg_num`` value by yourself
  25
  26 - For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
  27
  28 As the number of OSDs increases, chosing the right value for pg_num
  29 becomes more important because it has a significant influence on the
  30 behavior of the cluster as well as the durability of the data when
  31 something goes wrong (i.e. the probability that a catastrophic event
  32 leads to data loss).
  33
  34 How are Placement Groups used ?
  35 ===============================
  36
  37 A placement group (PG) aggregates objects within a pool because
  38 tracking object placement and object metadata on a per-object basis is
  39 computationally expensive--i.e., a system with millions of objects
  40 cannot realistically track placement on a per-object basis.
  41
  42 .. ditaa::
  43            /-----\  /-----\  /-----\  /-----\  /-----\
  44            | obj |  | obj |  | obj |  | obj |  | obj |
  45            \-----/  \-----/  \-----/  \-----/  \-----/
  46               |        |        |        |        |
  47               +--------+--------+        +---+----+
  48               |                              |
  49               v                              v
  50    +-----------------------+      +-----------------------+
  51    |  Placement Group #1   |      |  Placement Group #2   |
  52    |                       |      |                       |
  53    +-----------------------+      +-----------------------+
  54                |                              |
  55                +------------------------------+
  56                              |
  57                              v
  58                   +-----------------------+
  59                   |        Pool           |
  60                   |                       |
  61                   +-----------------------+
  62
  63 The Ceph client will calculate which placement group an object should
  64 be in. It does this by hashing the object ID and applying an operation
  65 based on the number of PGs in the defined pool and the ID of the pool.
  66 See `Mapping PGs to OSDs`_ for details.
  67
  68 The object's contents within a placement group are stored in a set of
  69 OSDs. For instance, in a replicated pool of size two, each placement
  70 group will store objects on two OSDs, as shown below.
  71
  72 .. ditaa::
  73
  74    +-----------------------+      +-----------------------+
  75    |  Placement Group #1   |      |  Placement Group #2   |
  76    |                       |      |                       |
  77    +-----------------------+      +-----------------------+
  78         |             |               |             |
  79         v             v               v             v
  80    /----------\  /----------\    /----------\  /----------\
  81    |          |  |          |    |          |  |          |
  82    |  OSD #1  |  |  OSD #2  |    |  OSD #2  |  |  OSD #3  |
  83    |          |  |          |    |          |  |          |
  84    \----------/  \----------/    \----------/  \----------/
  85
  86
  87 Should OSD #2 fail, another will be assigned to Placement Group #1 and
  88 will be filled with copies of all objects in OSD #1. If the pool size
  89 is changed from two to three, an additional OSD will be assigned to
  90 the placement group and will receive copies of all objects in the
  91 placement group.
  92
  93 Placement groups do not own the OSD, they share it with other
  94 placement groups from the same pool or even other pools. If OSD #2
  95 fails, the Placement Group #2 will also have to restore copies of
  96 objects, using OSD #3.
  97
  98 When the number of placement groups increases, the new placement
  99 groups will be assigned OSDs. The result of the CRUSH function will
 100 also change and some objects from the former placement groups will be
 101 copied over to the new Placement Groups and removed from the old ones.
 102
 103 Placement Groups Tradeoffs
 104 ==========================
 105
 106 Data durability and even distribution among all OSDs call for more
 107 placement groups but their number should be reduced to the minimum to
 108 save CPU and memory.
 109
 110 .. _data durability:
 111
 112 Data durability
 113 ---------------
 114
 115 After an OSD fails, the risk of data loss increases until the data it
 116 contained is fully recovered. Let's imagine a scenario that causes
 117 permanent data loss in a single placement group:
 118
 119 - The OSD fails and all copies of the object it contains are lost.
 120   For all objects within the placement group the number of replica
 121   suddently drops from three to two.
 122
 123 - Ceph starts recovery for this placement group by chosing a new OSD
 124   to re-create the third copy of all objects.
 125
 126 - Another OSD, within the same placement group, fails before the new
 127   OSD is fully populated with the third copy. Some objects will then
 128   only have one surviving copies.
 129
 130 - Ceph picks yet another OSD and keeps copying objects to restore the
 131   desired number of copies.
 132
 133 - A third OSD, within the same placement group, fails before recovery
 134   is complete. If this OSD contained the only remaining copy of an
 135   object, it is permanently lost.
 136
 137 In a cluster containing 10 OSDs with 512 placement groups in a three
 138 replica pool, CRUSH will give each placement groups three OSDs. In the
 139 end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
 140 Groups. When the first OSD fails, the above scenario will therefore
 141 start recovery for all 150 placement groups at the same time.
 142
 143 The 150 placement groups being recovered are likely to be
 144 homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
 145 therefore likely to send copies of objects to all others and also
 146 receive some new objects to be stored because they became part of a
 147 new placement group.
 148
 149 The amount of time it takes for this recovery to complete entirely
 150 depends on the architecture of the Ceph cluster. Let say each OSD is
 151 hosted by a 1TB SSD on a single machine and all of them are connected
 152 to a 10Gb/s switch and the recovery for a single OSD completes within
 153 M minutes. If there are two OSDs per machine using spinners with no
 154 SSD journal and a 1Gb/s switch, it will at least be an order of
 155 magnitude slower.
 156
 157 In a cluster of this size, the number of placement groups has almost
 158 no influence on data durability. It could be 128 or 8192 and the
 159 recovery would not be slower or faster.
 160
 161 However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
 162 is likely to speed up recovery and therefore improve data durability
 163 significantly. Each OSD now participates in only ~75 placement groups
 164 instead of ~150 when there were only 10 OSDs and it will still require
 165 all 19 remaining OSDs to perform the same amount of object copies in
 166 order to recover. But where 10 OSDs had to copy approximately 100GB
 167 each, they now have to copy 50GB each instead. If the network was the
 168 bottleneck, recovery will happen twice as fast. In other words,
 169 recovery goes faster when the number of OSDs increases.
 170
 171 If this cluster grows to 40 OSDs, each of them will only host ~35
 172 placement groups. If an OSD dies, recovery will keep going faster
 173 unless it is blocked by another bottleneck. However, if this cluster
 174 grows to 200 OSDs, each of them will only host ~7 placement groups. If
 175 an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
 176 in these placement groups: recovery will take longer than when there
 177 were 40 OSDs, meaning the number of placement groups should be
 178 increased.
 179
 180 No matter how short the recovery time is, there is a chance for a
 181 second OSD to fail while it is in progress. In the 10 OSDs cluster
 182 described above, if any of them fail, then ~17 placement groups
 183 (i.e. ~150 / 9 placement groups being recovered) will only have one
 184 surviving copy. And if any of the 8 remaining OSD fail, the last
 185 objects of two placement groups are likely to be lost (i.e. ~17 / 8
 186 placement groups with only one remaining copy being recovered).
 187
 188 When the size of the cluster grows to 20 OSDs, the number of Placement
 189 Groups damaged by the loss of three OSDs drops. The second OSD lost
 190 will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
 191 instead of ~17 and the third OSD lost will only lose data if it is one
 192 of the four OSDs containing the surviving copy. In other words, if the
 193 probability of losing one OSD is 0.0001% during the recovery time
 194 frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
 195 0.0001% in the cluster with 20 OSDs.
 196
 197 In a nutshell, more OSDs mean faster recovery and a lower risk of
 198 cascading failures leading to the permanent loss of a Placement
 199 Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
 200 cluster with less than 50 OSDs as far as data durability is concerned.
 201
 202 Note: It may take a long time for a new OSD added to the cluster to be
 203 populated with placement groups that were assigned to it. However
 204 there is no degradation of any object and it has no impact on the
 205 durability of the data contained in the Cluster.
 206
 207 .. _object distribution:
 208
 209 Object distribution within a pool
 210 ---------------------------------
 211
 212 Ideally objects are evenly distributed in each placement group. Since
 213 CRUSH computes the placement group for each object, but does not
 214 actually know how much data is stored in each OSD within this
 215 placement group, the ratio between the number of placement groups and
 216 the number of OSDs may influence the distribution of the data
 217 significantly.
 218
 219 For instance, if there was single a placement group for ten OSDs in a
 220 three replica pool, only three OSD would be used because CRUSH would
 221 have no other choice. When more placement groups are available,
 222 objects are more likely to be evenly spread among them. CRUSH also
 223 makes every effort to evenly spread OSDs among all existing Placement
 224 Groups.
 225
 226 As long as there are one or two orders of magnitude more Placement
 227 Groups than OSDs, the distribution should be even. For instance, 300
 228 placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.
 229
 230 Uneven data distribution can be caused by factors other than the ratio
 231 between OSDs and placement groups. Since CRUSH does not take into
 232 account the size of the objects, a few very large objects may create
 233 an imbalance. Let say one million 4K objects totaling 4GB are evenly
 234 spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
 235 = 400MB on each OSD. If one 400MB object is added to the pool, the
 236 three OSDs supporting the placement group in which the object has been
 237 placed will be filled with 400MB + 400MB = 800MB while the seven
 238 others will remain occupied with only 400MB.
 239
 240 .. _resource usage:
 241
 242 Memory, CPU and network usage
 243 -----------------------------
 244
 245 For each placement group, OSDs and MONs need memory, network and CPU
 246 at all times and even more during recovery. Sharing this overhead by
 247 clustering objects within a placement group is one of the main reasons
 248 they exist.
 249
 250 Minimizing the number of placement groups saves significant amounts of
 251 resources.
 252
 253 Choosing the number of Placement Groups
 254 =======================================
 255
 256 If you have more than 50 OSDs, we recommend approximately 50-100
 257 placement groups per OSD to balance out resource usage, data
 258 durability and distribution. If you have less than 50 OSDs, chosing
 259 among the `preselection`_ above is best. For a single pool of objects,
 260 you can use the following formula to get a baseline::
 261
 262                 (OSDs * 100)
 263    Total PGs =  ------------
 264                  pool size
 265
 266 Where **pool size** is either the number of replicas for replicated
 267 pools or the K+M sum for erasure coded pools (as returned by **ceph
 268 osd erasure-code-profile get**).
 269
 270 You should then check if the result makes sense with the way you
 271 designed your Ceph cluster to maximize `data durability`_,
 272 `object distribution`_ and minimize `resource usage`_.
 273
 274 The result should be **rounded up to the nearest power of two.**
 275 Rounding up is optional, but recommended for CRUSH to evenly balance
 276 the number of objects among placement groups.
 277
 278 As an example, for a cluster with 200 OSDs and a pool size of 3
 279 replicas, you would estimate your number of PGs as follows::
 280
 281    (200 * 100)
 282    ----------- = 6667. Nearest power of 2: 8192
 283         3
 284
 285 When using multiple data pools for storing objects, you need to ensure
 286 that you balance the number of placement groups per pool with the
 287 number of placement groups per OSD so that you arrive at a reasonable
 288 total number of placement groups that provides reasonably low variance
 289 per OSD without taxing system resources or making the peering process
 290 too slow.
 291
 292 For instance a cluster of 10 pools each with 512 placement groups on
 293 ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
 294 that is 512 placement groups per OSD. That does not use too many
 295 resources. However, if 1,000 pools were created with 512 placement
 296 groups each, the OSDs will handle ~50,000 placement groups each and it
 297 would require significantly more resources and time for peering.
 298
 299 .. _setting the number of placement groups:
 300
 301 Set the Number of Placement Groups
 302 ==================================
 303
 304 To set the number of placement groups in a pool, you must specify the
 305 number of placement groups at the time you create the pool.
 306 See `Create a Pool`_ for details. Once you've set placement groups for a
 307 pool, you may increase the number of placement groups (but you cannot
 308 decrease the number of placement groups). To increase the number of
 309 placement groups, execute the following::
 310
 311         ceph osd pool set {pool-name} pg_num {pg_num}
 312
 313 Once you increase the number of placement groups, you must also
 314 increase the number of placement groups for placement (``pgp_num``)
 315 before your cluster will rebalance. The ``pgp_num`` will be the number of
 316 placement groups that will be considered for placement by the CRUSH
 317 algorithm. Increasing ``pg_num`` splits the placement groups but data
 318 will not be migrated to the newer placement groups until placement
 319 groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
 320 should be equal to the ``pg_num``.  To increase the number of
 321 placement groups for placement, execute the following::
 322
 323         ceph osd pool set {pool-name} pgp_num {pgp_num}
 324
 325
 326 Get the Number of Placement Groups
 327 ==================================
 328
 329 To get the number of placement groups in a pool, execute the following::
 330
 331         ceph osd pool get {pool-name} pg_num
 332
 333
 334 Get a Cluster's PG Statistics
 335 =============================
 336
 337 To get the statistics for the placement groups in your cluster, execute the following::
 338
 339         ceph pg dump [--format {format}]
 340
 341 Valid formats are ``plain`` (default) and ``json``.
 342
 343
 344 Get Statistics for Stuck PGs
 345 ============================
 346
 347 To get the statistics for all placement groups stuck in a specified state,
 348 execute the following::
 349
 350         ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
 351
 352 **Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
 353 with the most up-to-date data to come up and in.
 354
 355 **Unclean** Placement groups contain objects that are not replicated the desired number
 356 of times. They should be recovering.
 357
 358 **Stale** Placement groups are in an unknown state - the OSDs that host them have not
 359 reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
 360
 361 Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
 362 of seconds the placement group is stuck before including it in the returned statistics
 363 (default 300 seconds).
 364
 365
 366 Get a PG Map
 367 ============
 368
 369 To get the placement group map for a particular placement group, execute the following::
 370
 371         ceph pg map {pg-id}
 372
 373 For example::
 374
 375         ceph pg map 1.6c
 376
 377 Ceph will return the placement group map, the placement group, and the OSD status::
 378
 379         osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
 380
 381
 382 Get a PGs Statistics
 383 ====================
 384
 385 To retrieve statistics for a particular placement group, execute the following::
 386
 387         ceph pg {pg-id} query
 388
 389
 390 Scrub a Placement Group
 391 =======================
 392
 393 To scrub a placement group, execute the following::
 394
 395         ceph pg scrub {pg-id}
 396
 397 Ceph checks the primary and any replica nodes, generates a catalog of all objects
 398 in the placement group and compares them to ensure that no objects are missing
 399 or mismatched, and their contents are consistent.  Assuming the replicas all
 400 match, a final semantic sweep ensures that all of the snapshot-related object
 401 metadata is consistent. Errors are reported via logs.
 402
 403
 404 Revert Lost
 405 ===========
 406
 407 If the cluster has lost one or more objects, and you have decided to
 408 abandon the search for the lost data, you must mark the unfound objects
 409 as ``lost``.
 410
 411 If all possible locations have been queried and objects are still
 412 lost, you may have to give up on the lost objects. This is
 413 possible given unusual combinations of failures that allow the cluster
 414 to learn about writes that were performed before the writes themselves
 415 are recovered.
 416
 417 Currently the only supported option is "revert", which will either roll back to
 418 a previous version of the object or (if it was a new object) forget about it
 419 entirely. To mark the "unfound" objects as "lost", execute the following::
 420
 421         ceph pg {pg-id} mark_unfound_lost revert|delete
 422
 423 .. important:: Use this feature with caution, because it may confuse
 424    applications that expect the object(s) to exist.
 425
 426
 427 .. toctree::
 428         :hidden:
 429
 430         pg-states
 431         pg-concepts
 432
 433
 434 .. _Create a Pool: ../pools#createpool
 435 .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
 436 .. _pgcalc: http://ceph.com/pgcalc/