ceph/doc/dev/placement-group.rst

   1 ============================
   2  PG (Placement Group) notes
   3 ============================
   4
   5 Miscellaneous copy-pastes from emails, when this gets cleaned up it
   6 should move out of /dev.
   7
   8 Overview
   9 ========
  10
  11 PG = "placement group". When placing data in the cluster, objects are
  12 mapped into PGs, and those PGs are mapped onto OSDs. We use the
  13 indirection so that we can group objects, which reduces the amount of
  14 per-object metadata we need to keep track of and processes we need to
  15 run (it would be prohibitively expensive to track eg the placement
  16 history on a per-object basis). Increasing the number of PGs can
  17 reduce the variance in per-OSD load across your cluster, but each PG
  18 requires a bit more CPU and memory on the OSDs that are storing it. We
  19 try and ballpark it at 100 PGs/OSD, although it can vary widely
  20 without ill effects depending on your cluster. You hit a bug in how we
  21 calculate the initial PG number from a cluster description.
  22
  23 There are a couple of different categories of PGs; the 6 that exist
  24 (in the original emailer's ``ceph -s`` output) are "local" PGs which
  25 are tied to a specific OSD. However, those aren't actually used in a
  26 standard Ceph configuration.
  27
  28
  29 Mapping algorithm (simplified)
  30 ==============================
  31
  32 | > How does the Object->PG mapping look like, do you map more than one object on
  33 | > one PG, or do you sometimes map an object to more than one PG? How about the
  34 | > mapping of PGs to OSDs, does one PG belong to exactly one OSD?
  35 | >
  36 | > Does one PG represent a fixed amount of storage space?
  37
  38 Many objects map to one PG.
  39
  40 Each object maps to exactly one PG.
  41
  42 One PG maps to a single list of OSDs, where the first one in the list
  43 is the primary and the rest are replicas.
  44
  45 Many PGs can map to one OSD.
  46
  47 A PG represents nothing but a grouping of objects; you configure the
  48 number of PGs you want, number of OSDs * 100 is a good starting point
  49 , and all of your stored objects are pseudo-randomly evenly distributed
  50 to the PGs. So a PG explicitly does NOT represent a fixed amount of
  51 storage; it represents 1/pg_num'th of the storage you happen to have
  52 on your OSDs.
  53
  54 Ignoring the finer points of CRUSH and custom placement, it goes
  55 something like this in pseudocode::
  56
  57         locator = object_name
  58         obj_hash = hash(locator)
  59         pg = obj_hash % num_pg
  60         OSDs_for_pg = crush(pg)  # returns a list of OSDs
  61         primary = osds_for_pg[0]
  62         replicas = osds_for_pg[1:]
  63
  64 If you want to understand the crush() part in the above, imagine a
  65 perfectly spherical datacenter in a vacuum ;) that is, if all OSDs
  66 have weight 1.0, and there is no topology to the data center (all OSDs
  67 are on the top level), and you use defaults, etc, it simplifies to
  68 consistent hashing; you can think of it as::
  69
  70         def crush(pg):
  71            all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
  72            result = []
  73            # size is the number of copies; primary+replicas
  74            while len(result) < size:
  75                r = hash(pg)
  76                chosen = all_osds[ r % len(all_osds) ]
  77                if chosen in result:
  78                    # OSD can be picked only once
  79                    continue
  80                result.append(chosen)
  81            return result
  82
  83 User-visible PG States
  84 ======================
  85
  86 .. todo:: diagram of states and how they can overlap
  87
  88 *creating*
  89   the PG is still being created
  90
  91 *active*
  92   requests to the PG will be processed
  93
  94 *clean*
  95   all objects in the PG are replicated the correct number of times
  96
  97 *down*
  98   a replica with necessary data is down, so the pg is offline
  99
 100 *replay*
 101   the PG is waiting for clients to replay operations after an OSD crashed
 102
 103 *splitting*
 104   the PG is being split into multiple PGs (not functional as of 2012-02)
 105
 106 *scrubbing*
 107   the PG is being checked for inconsistencies
 108
 109 *degraded*
 110   some objects in the PG are not replicated enough times yet
 111
 112 *inconsistent*
 113   replicas of the PG are not consistent (e.g. objects are
 114   the wrong size, objects are missing from one replica *after* recovery
 115   finished, etc.)
 116
 117 *peering*
 118   the PG is undergoing the :doc:`/dev/peering` process
 119
 120 *repair*
 121   the PG is being checked and any inconsistencies found will be repaired (if possible)
 122
 123 *recovering*
 124   objects are being migrated/synchronized with replicas
 125
 126 *recovery_wait*
 127   the PG is waiting for the local/remote recovery reservations
 128
 129 *backfilling*
 130   a special case of recovery, in which the entire contents of
 131   the PG are scanned and synchronized, instead of inferring what
 132   needs to be transferred from the PG logs of recent operations
 133
 134 *backfill_wait*
 135   the PG is waiting in line to start backfill
 136
 137 *backfill_toofull*
 138   backfill reservation rejected, OSD too full
 139
 140 *incomplete*
 141   a pg is missing a necessary period of history from its
 142   log.  If you see this state, report a bug, and try to start any
 143   failed OSDs that may contain the needed information.
 144
 145 *stale*
 146   the PG is in an unknown state - the monitors have not received
 147   an update for it since the PG mapping changed.
 148
 149 *remapped*
 150   the PG is temporarily mapped to a different set of OSDs from what
 151   CRUSH specified