ceph/doc/dev/peering.rst

   1 ======================
   2 Peering
   3 ======================
   4
   5 Concepts
   6 --------
   7
   8 *Peering*
   9    the process of bringing all of the OSDs that store
  10    a Placement Group (PG) into agreement about the state
  11    of all of the objects (and their metadata) in that PG.
  12    Note that agreeing on the state does not mean that
  13    they all have the latest contents.
  14
  15 *Acting set*
  16    the ordered list of OSDs who are (or were as of some epoch)
  17    responsible for a particular PG.
  18
  19 *Up set*
  20    the ordered list of OSDs responsible for a particular PG for
  21    a particular epoch according to CRUSH.  Normally this
  22    is the same as the *acting set*, except when the *acting set* has been
  23    explicitly overridden via *PG temp* in the OSDMap.
  24
  25 *PG temp*
  26    a temporary placement group acting set used while backfilling the
  27    primary osd. Let say acting is [0,1,2] and we are
  28    active+clean. Something happens and acting is now [3,1,2]. osd 3 is
  29    empty and can't serve reads although it is the primary. osd.3 will
  30    see that and request a *PG temp* of [1,2,3] to the monitors using a
  31    MOSDPGTemp message so that osd.1 temporarily becomes the
  32    primary. It will select osd.3 as a backfill peer and continue to
  33    serve reads and writes while osd.3 is backfilled. When backfilling
  34    is complete, *PG temp* is discarded and the acting set changes back
  35    to [3,1,2] and osd.3 becomes the primary.
  36
  37 *current interval* or *past interval*
  38    a sequence of OSD map epochs during which the *acting set* and *up
  39    set* for particular PG do not change
  40
  41 *primary*
  42    the (by convention first) member of the *acting set*,
  43    who is responsible for coordination peering, and is
  44    the only OSD that will accept client initiated
  45    writes to objects in a placement group.
  46
  47 *replica*
  48    a non-primary OSD in the *acting set* for a placement group
  49    (and who has been recognized as such and *activated* by the primary).
  50
  51 *stray*
  52    an OSD who is not a member of the current *acting set*, but
  53    has not yet been told that it can delete its copies of a
  54    particular placement group.
  55
  56 *recovery*
  57    ensuring that copies of all of the objects in a PG
  58    are on all of the OSDs in the *acting set*.  Once
  59    *peering* has been performed, the primary can start
  60    accepting write operations, and *recovery* can proceed
  61    in the background.
  62
  63 *PG info*
  64    basic metadata about the PG's creation epoch, the version
  65    for the most recent write to the PG, *last epoch started*, *last
  66    epoch clean*, and the beginning of the *current interval*.  Any
  67    inter-OSD communication about PGs includes the *PG info*, such that
  68    any OSD that knows a PG exists (or once existed) also has a lower
  69    bound on *last epoch clean* or *last epoch started*.
  70
  71 *PG log*
  72    a list of recent updates made to objects in a PG.
  73    Note that these logs can be truncated after all OSDs
  74    in the *acting set* have acknowledged up to a certain
  75    point.
  76
  77 *missing set*
  78    Each OSD notes update log entries and if they imply updates to
  79    the contents of an object, adds that object to a list of needed
  80    updates.  This list is called the *missing set* for that <OSD,PG>.
  81
  82 *Authoritative History*
  83    a complete, and fully ordered set of operations that, if
  84    performed, would bring an OSD's copy of a Placement Group
  85    up to date.
  86
  87 *epoch*
  88    a (monotonically increasing) OSD map version number
  89
  90 *last epoch start*
  91    the last epoch at which all nodes in the *acting set*
  92    for a particular placement group agreed on an
  93    *authoritative history*.  At this point, *peering* is
  94    deemed to have been successful.
  95
  96 *up_thru*
  97    before a primary can successfully complete the *peering* process,
  98    it must inform a monitor that is alive through the current
  99    OSD map epoch by having the monitor set its *up_thru* in the osd
 100    map.  This helps peering ignore previous *acting sets* for which
 101    peering never completed after certain sequences of failures, such as
 102    the second interval below:
 103
 104    - *acting set* = [A,B]
 105    - *acting set* = [A]
 106    - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
 107    - *acting set* = [B] (B restarts, A does not)
 108
 109 *last epoch clean*
 110    the last epoch at which all nodes in the *acting set*
 111    for a particular placement group were completely
 112    up to date (both PG logs and object contents).
 113    At this point, *recovery* is deemed to have been
 114    completed.
 115
 116 Description of the Peering Process
 117 ----------------------------------
 118
 119 The *Golden Rule* is that no write operation to any PG
 120 is acknowledged to a client until it has been persisted
 121 by all members of the *acting set* for that PG.  This means
 122 that if we can communicate with at least one member of
 123 each *acting set* since the last successful *peering*, someone
 124 will have a record of every (acknowledged) operation
 125 since the last successful *peering*.
 126 This means that it should be possible for the current
 127 primary to construct and disseminate a new *authoritative history*.
 128
 129 It is also important to appreciate the role of the OSD map
 130 (list of all known OSDs and their states, as well as some
 131 information about the placement groups) in the *peering*
 132 process:
 133
 134    When OSDs go up or down (or get added or removed)
 135    this has the potential to affect the *active sets*
 136    of many placement groups.
 137
 138    Before a primary successfully completes the *peering*
 139    process, the OSD map must reflect that the OSD was alive
 140    and well as of the first epoch in the *current interval*.
 141
 142    Changes can only be made after successful *peering*.
 143
 144 Thus, a new primary can use the latest OSD map along with a recent
 145 history of past maps to generate a set of *past intervals* to
 146 determine which OSDs must be consulted before we can successfully
 147 *peer*.  The set of past intervals is bounded by *last epoch started*,
 148 the most recent *past interval* for which we know *peering* completed.
 149 The process by which an OSD discovers a PG exists in the first place is
 150 by exchanging *PG info* messages, so the OSD always has some lower
 151 bound on *last epoch started*.
 152
 153 The high level process is for the current PG primary to:
 154
 155   1. get a recent OSD map (to identify the members of the all
 156      interesting *acting sets*, and confirm that we are still the
 157      primary).
 158
 159   #. generate a list of *past intervals* since *last epoch started*.
 160      Consider the subset of those for which *up_thru* was greater than
 161      the first interval epoch by the last interval epoch's OSD map; that is,
 162      the subset for which *peering* could have completed before the *acting
 163      set* changed to another set of OSDs.
 164
 165      Successful *peering* will require that we be able to contact at
 166      least one OSD from each of *past interval*'s *acting set*.
 167
 168   #. ask every node in that list for its *PG info*, which includes the most
 169      recent write made to the PG, and a value for *last epoch started*.  If
 170      we learn about a *last epoch started* that is newer than our own, we can
 171      prune older *past intervals* and reduce the peer OSDs we need to contact.
 172
 173   #. if anyone else has (in its PG log) operations that I do not have,
 174      instruct them to send me the missing log entries so that the primary's
 175      *PG log* is up to date (includes the newest write)..
 176
 177   #. for each member of the current *acting set*:
 178
 179      a. ask it for copies of all PG log entries since *last epoch start*
 180         so that I can verify that they agree with mine (or know what
 181         objects I will be telling it to delete).
 182
 183         If the cluster failed before an operation was persisted by all
 184         members of the *acting set*, and the subsequent *peering* did not
 185         remember that operation, and a node that did remember that
 186         operation later rejoined, its logs would record a different
 187         (divergent) history than the *authoritative history* that was
 188         reconstructed in the *peering* after the failure.
 189
 190         Since the *divergent* events were not recorded in other logs
 191         from that *acting set*, they were not acknowledged to the client,
 192         and there is no harm in discarding them (so that all OSDs agree
 193         on the *authoritative history*).  But, we will have to instruct
 194         any OSD that stores data from a divergent update to delete the
 195         affected (and now deemed to be apocryphal) objects.
 196
 197      #. ask it for its *missing set* (object updates recorded
 198         in its PG log, but for which it does not have the new data).
 199         This is the list of objects that must be fully replicated
 200         before we can accept writes.
 201
 202   #. at this point, the primary's PG log contains an *authoritative history* of
 203      the placement group, and the OSD now has sufficient
 204      information to bring any other OSD in the *acting set* up to date.
 205
 206   #. if the primary's *up_thru* value in the current OSD map is not greater than
 207      or equal to the first epoch in the *current interval*, send a request to the
 208      monitor to update it, and wait until receive an updated OSD map that reflects
 209      the change.
 210
 211   #. for each member of the current *acting set*:
 212
 213      a. send them log updates to bring their PG logs into agreement with
 214         my own (*authoritative history*) ... which may involve deciding
 215         to delete divergent objects.
 216
 217      #. await acknowledgment that they have persisted the PG log entries.
 218
 219   #. at this point all OSDs in the *acting set* agree on all of the meta-data,
 220      and would (in any future *peering*) return identical accounts of all
 221      updates.
 222
 223      a. start accepting client write operations (because we have unanimous
 224         agreement on the state of the objects into which those updates are
 225         being accepted).  Note, however, that if a client tries to write to an
 226         object it will be promoted to the front of the recovery queue, and the
 227         write willy be applied after it is fully replicated to the current *acting set*.
 228
 229      #. update the *last epoch started* value in our local *PG info*, and instruct
 230         other *active set* OSDs to do the same.
 231
 232      #. start pulling object data updates that other OSDs have, but I do not.  We may
 233         need to query OSDs from additional *past intervals* prior to *last epoch started*
 234         (the last time *peering* completed) and following *last epoch clean* (the last epoch that
 235         recovery completed) in order to find copies of all objects.
 236
 237      #. start pushing object data updates to other OSDs that do not yet have them.
 238
 239         We push these updates from the primary (rather than having the replicas
 240         pull them) because this allows the primary to ensure that a replica has
 241         the current contents before sending it an update write.  It also makes
 242         it possible for a single read (from the primary) to be used to write
 243         the data to multiple replicas.  If each replica did its own pulls,
 244         the data might have to be read multiple times.
 245
 246   #. once all replicas store the all copies of all objects (that
 247      existed prior to the start of this epoch) we can update *last
 248      epoch clean* in the *PG info*, and we can dismiss all of the
 249      *stray* replicas, allowing them to delete their copies of objects
 250      for which they are no longer in the *acting set*.
 251
 252      We could not dismiss the *strays* prior to this because it was possible
 253      that one of those *strays* might hold the sole surviving copy of an
 254      old object (all of whose copies disappeared before they could be
 255      replicated on members of the current *acting set*).
 256
 257 Generate a State Model
 258 ----------------------
 259
 260 Use the `gen_state_diagram.py <https://github.com/ceph/ceph/blob/master/doc/scripts/gen_state_diagram.py>`_ script to generate a copy of the latest peering state model::
 261
 262         $ git clone https://github.com/ceph/ceph.git
 263         $ cd ceph
 264         $ cat src/osd/PeeringState.h src/osd/PeeringState.cc | doc/scripts/gen_state_diagram.py > doc/dev/peering_graph.generated.dot
 265         $ sed -i 's/7,7/1080,1080/' doc/dev/peering_graph.generated.dot
 266         $ dot -Tsvg doc/dev/peering_graph.generated.dot > doc/dev/peering_graph.generated.svg
 267
 268 Sample state model:
 269
 270 .. graphviz:: peering_graph.generated.dot