[ceph.git] / ceph / doc / dev / peering.rst

======================
Peering
======================

Concepts
--------

*Peering*
   the process of bringing all of the OSDs that store
   a Placement Group (PG) into agreement about the state
   of all of the objects (and their metadata) in that PG.
   Note that agreeing on the state does not mean that
   they all have the latest contents.

*Acting set*
   the ordered list of OSDs who are (or were as of some epoch)
   responsible for a particular PG.

*Up set*
   the ordered list of OSDs responsible for a particular PG for
   a particular epoch according to CRUSH.  Normally this
   is the same as the *acting set*, except when the *acting set* has been
   explicitly overridden via *PG temp* in the OSDMap.

*PG temp* 
   a temporary placement group acting set used while backfilling the
   primary osd. Let say acting is [0,1,2] and we are
   active+clean. Something happens and acting is now [3,1,2]. osd 3 is
   empty and can't serve reads although it is the primary. osd.3 will
   see that and request a *PG temp* of [1,2,3] to the monitors using a
   MOSDPGTemp message so that osd.1 temporarily becomes the
   primary. It will select osd.3 as a backfill peer and continue to
   serve reads and writes while osd.3 is backfilled. When backfilling
   is complete, *PG temp* is discarded and the acting set changes back
   to [3,1,2] and osd.3 becomes the primary.

*current interval* or *past interval*
   a sequence of OSD map epochs during which the *acting set* and *up
   set* for particular PG do not change

*primary*
   the (by convention first) member of the *acting set*,
   who is responsible for coordination peering, and is
   the only OSD that will accept client initiated
   writes to objects in a placement group.

*replica*
   a non-primary OSD in the *acting set* for a placement group
   (and who has been recognized as such and *activated* by the primary).

*stray*
   an OSD who is not a member of the current *acting set*, but
   has not yet been told that it can delete its copies of a
   particular placement group.

*recovery*
   ensuring that copies of all of the objects in a PG
   are on all of the OSDs in the *acting set*.  Once
   *peering* has been performed, the primary can start
   accepting write operations, and *recovery* can proceed
   in the background.

*PG info* basic metadata about the PG's creation epoch, the version
   for the most recent write to the PG, *last epoch started*, *last
   epoch clean*, and the beginning of the *current interval*.  Any
   inter-OSD communication about PGs includes the *PG info*, such that
   any OSD that knows a PG exists (or once existed) also has a lower
   bound on *last epoch clean* or *last epoch started*.

*PG log*
   a list of recent updates made to objects in a PG.
   Note that these logs can be truncated after all OSDs
   in the *acting set* have acknowledged up to a certain
   point.

*missing set*
   Each OSD notes update log entries and if they imply updates to
   the contents of an object, adds that object to a list of needed
   updates.  This list is called the *missing set* for that <OSD,PG>.

*Authoritative History*
   a complete, and fully ordered set of operations that, if
   performed, would bring an OSD's copy of a Placement Group
   up to date.

*epoch*
   a (monotonically increasing) OSD map version number

*last epoch start*
   the last epoch at which all nodes in the *acting set*
   for a particular placement group agreed on an
   *authoritative history*.  At this point, *peering* is
   deemed to have been successful.

*up_thru*
   before a primary can successfully complete the *peering* process,
   it must inform a monitor that is alive through the current
   OSD map epoch by having the monitor set its *up_thru* in the osd
   map.  This helps peering ignore previous *acting sets* for which
   peering never completed after certain sequences of failures, such as
   the second interval below:

   - *acting set* = [A,B]
   - *acting set* = [A]
   - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
   - *acting set* = [B] (B restarts, A does not)

*last epoch clean*
   the last epoch at which all nodes in the *acting set*
   for a particular placement group were completely
   up to date (both PG logs and object contents).
   At this point, *recovery* is deemed to have been
   completed.

Description of the Peering Process
----------------------------------

The *Golden Rule* is that no write operation to any PG
is acknowledged to a client until it has been persisted
by all members of the *acting set* for that PG.  This means
that if we can communicate with at least one member of
each *acting set* since the last successful *peering*, someone
will have a record of every (acknowledged) operation
since the last successful *peering*.
This means that it should be possible for the current
primary to construct and disseminate a new *authoritative history*.

It is also important to appreciate the role of the OSD map
(list of all known OSDs and their states, as well as some
information about the placement groups) in the *peering*
process:

   When OSDs go up or down (or get added or removed)
   this has the potential to affect the *active sets*
   of many placement groups.

   Before a primary successfully completes the *peering*
   process, the OSD map must reflect that the OSD was alive
   and well as of the first epoch in the *current interval*.

   Changes can only be made after successful *peering*.

Thus, a new primary can use the latest OSD map along with a recent
history of past maps to generate a set of *past intervals* to
determine which OSDs must be consulted before we can successfully
*peer*.  The set of past intervals is bounded by *last epoch started*,
the most recent *past interval* for which we know *peering* completed.
The process by which an OSD discovers a PG exists in the first place is
by exchanging *PG info* messages, so the OSD always has some lower
bound on *last epoch started*.

The high level process is for the current PG primary to:

  1. get a recent OSD map (to identify the members of the all
     interesting *acting sets*, and confirm that we are still the
     primary).

  #. generate a list of *past intervals* since *last epoch started*.
     Consider the subset of those for which *up_thru* was greater than
     the first interval epoch by the last interval epoch's OSD map; that is,
     the subset for which *peering* could have completed before the *acting
     set* changed to another set of OSDs.

     Successful *peering* will require that we be able to contact at
     least one OSD from each of *past interval*'s *acting set*.

  #. ask every node in that list for its *PG info*, which includes the most
     recent write made to the PG, and a value for *last epoch started*.  If
     we learn about a *last epoch started* that is newer than our own, we can
     prune older *past intervals* and reduce the peer OSDs we need to contact.

  #. if anyone else has (in its PG log) operations that I do not have,
     instruct them to send me the missing log entries so that the primary's
     *PG log* is up to date (includes the newest write)..

  #. for each member of the current *acting set*:

     a. ask it for copies of all PG log entries since *last epoch start*
	so that I can verify that they agree with mine (or know what
	objects I will be telling it to delete).

	If the cluster failed before an operation was persisted by all
	members of the *acting set*, and the subsequent *peering* did not
	remember that operation, and a node that did remember that
	operation later rejoined, its logs would record a different
	(divergent) history than the *authoritative history* that was
	reconstructed in the *peering* after the failure.

	Since the *divergent* events were not recorded in other logs
	from that *acting set*, they were not acknowledged to the client,
	and there is no harm in discarding them (so that all OSDs agree
	on the *authoritative history*).  But, we will have to instruct
	any OSD that stores data from a divergent update to delete the
	affected (and now deemed to be apocryphal) objects.

     #. ask it for its *missing set* (object updates recorded
	in its PG log, but for which it does not have the new data).
	This is the list of objects that must be fully replicated
	before we can accept writes.

  #. at this point, the primary's PG log contains an *authoritative history* of
     the placement group, and the OSD now has sufficient
     information to bring any other OSD in the *acting set* up to date.

  #. if the primary's *up_thru* value in the current OSD map is not greater than
     or equal to the first epoch in the *current interval*, send a request to the
     monitor to update it, and wait until receive an updated OSD map that reflects
     the change.

  #. for each member of the current *acting set*:

     a. send them log updates to bring their PG logs into agreement with
	my own (*authoritative history*) ... which may involve deciding
	to delete divergent objects.

     #. await acknowledgment that they have persisted the PG log entries.

  #. at this point all OSDs in the *acting set* agree on all of the meta-data,
     and would (in any future *peering*) return identical accounts of all
     updates.

     a. start accepting client write operations (because we have unanimous
	agreement on the state of the objects into which those updates are
	being accepted).  Note, however, that if a client tries to write to an
        object it will be promoted to the front of the recovery queue, and the
        write willy be applied after it is fully replicated to the current *acting set*.

     #. update the *last epoch started* value in our local *PG info*, and instruct
	other *active set* OSDs to do the same.

     #. start pulling object data updates that other OSDs have, but I do not.  We may
	need to query OSDs from additional *past intervals* prior to *last epoch started*
	(the last time *peering* completed) and following *last epoch clean* (the last epoch that
	recovery completed) in order to find copies of all objects.

     #. start pushing object data updates to other OSDs that do not yet have them.

	We push these updates from the primary (rather than having the replicas
	pull them) because this allows the primary to ensure that a replica has
	the current contents before sending it an update write.  It also makes
	it possible for a single read (from the primary) to be used to write
	the data to multiple replicas.  If each replica did its own pulls,
	the data might have to be read multiple times.

  #. once all replicas store the all copies of all objects (that
     existed prior to the start of this epoch) we can update *last
     epoch clean* in the *PG info*, and we can dismiss all of the
     *stray* replicas, allowing them to delete their copies of objects
     for which they are no longer in the *acting set*.

     We could not dismiss the *strays* prior to this because it was possible
     that one of those *strays* might hold the sole surviving copy of an
     old object (all of whose copies disappeared before they could be
     replicated on members of the current *acting set*).

State Model
-----------

.. graphviz:: peering_graph.generated.dot
Commit	Line	Data
7c673cae FG	1	======================
	2	Peering
	3	======================
	4
	5	Concepts
	6	--------
	7
	8	Peering
	9	the process of bringing all of the OSDs that store
	10	a Placement Group (PG) into agreement about the state
	11	of all of the objects (and their metadata) in that PG.
	12	Note that agreeing on the state does not mean that
	13	they all have the latest contents.
	14
	15	Acting set
	16	the ordered list of OSDs who are (or were as of some epoch)
	17	responsible for a particular PG.
	18
	19	Up set
	20	the ordered list of OSDs responsible for a particular PG for
	21	a particular epoch according to CRUSH. Normally this
	22	is the same as the acting set, except when the acting set has been
	23	explicitly overridden via PG temp in the OSDMap.
	24
	25	PG temp
	26	a temporary placement group acting set used while backfilling the
	27	primary osd. Let say acting is [0,1,2] and we are
	28	active+clean. Something happens and acting is now [3,1,2]. osd 3 is
	29	empty and can't serve reads although it is the primary. osd.3 will
	30	see that and request a PG temp of [1,2,3] to the monitors using a
	31	MOSDPGTemp message so that osd.1 temporarily becomes the
	32	primary. It will select osd.3 as a backfill peer and continue to
	33	serve reads and writes while osd.3 is backfilled. When backfilling
	34	is complete, PG temp is discarded and the acting set changes back
	35	to [3,1,2] and osd.3 becomes the primary.
	36
	37	current interval or past interval
	38	a sequence of OSD map epochs during which the acting set and *up
	39	set* for particular PG do not change
	40
	41	primary
	42	the (by convention first) member of the acting set,
	43	who is responsible for coordination peering, and is
	44	the only OSD that will accept client initiated
	45	writes to objects in a placement group.
	46
	47	replica
	48	a non-primary OSD in the acting set for a placement group
	49	(and who has been recognized as such and activated by the primary).
	50
	51	stray
	52	an OSD who is not a member of the current acting set, but
	53	has not yet been told that it can delete its copies of a
	54	particular placement group.
	55
	56	recovery
	57	ensuring that copies of all of the objects in a PG
	58	are on all of the OSDs in the acting set. Once
	59	peering has been performed, the primary can start
	60	accepting write operations, and recovery can proceed
	61	in the background.
	62
	63	PG info basic metadata about the PG's creation epoch, the version
	64	for the most recent write to the PG, last epoch started, *last
65	epoch clean, and the beginning of the current interval*. Any
66	inter-OSD communication about PGs includes the PG info, such that
67	any OSD that knows a PG exists (or once existed) also has a lower
68	bound on last epoch clean or last epoch started.
69
70	PG log
71	a list of recent updates made to objects in a PG.
72	Note that these logs can be truncated after all OSDs
73	in the acting set have acknowledged up to a certain
74	point.
75
76	missing set
77	Each OSD notes update log entries and if they imply updates to
78	the contents of an object, adds that object to a list of needed
79	updates. This list is called the missing set for that <OSD,PG>.
80
81	Authoritative History
82	a complete, and fully ordered set of operations that, if
83	performed, would bring an OSD's copy of a Placement Group
84	up to date.
85
86	epoch
87	a (monotonically increasing) OSD map version number
88
89	last epoch start
90	the last epoch at which all nodes in the acting set
91	for a particular placement group agreed on an
92	authoritative history. At this point, peering is
93	deemed to have been successful.
94
95	up_thru
96	before a primary can successfully complete the peering process,
97	it must inform a monitor that is alive through the current
98	OSD map epoch by having the monitor set its up_thru in the osd
99	map. This helps peering ignore previous acting sets for which
100	peering never completed after certain sequences of failures, such as
101	the second interval below:
102
103	- acting set = [A,B]
104	- acting set = [A]
105	- acting set = [] very shortly after (e.g., simultaneous failure, but staggered detection)
106	- acting set = [B] (B restarts, A does not)
107
108	last epoch clean
109	the last epoch at which all nodes in the acting set
110	for a particular placement group were completely
111	up to date (both PG logs and object contents).
112	At this point, recovery is deemed to have been
113	completed.
114
115	Description of the Peering Process
116	----------------------------------
117
118	The Golden Rule is that no write operation to any PG
119	is acknowledged to a client until it has been persisted
120	by all members of the acting set for that PG. This means
121	that if we can communicate with at least one member of
122	each acting set since the last successful peering, someone
123	will have a record of every (acknowledged) operation
124	since the last successful peering.
125	This means that it should be possible for the current
126	primary to construct and disseminate a new authoritative history.
127
128	It is also important to appreciate the role of the OSD map
129	(list of all known OSDs and their states, as well as some
130	information about the placement groups) in the peering
131	process:
132
133	When OSDs go up or down (or get added or removed)
134	this has the potential to affect the active sets
135	of many placement groups.
136
137	Before a primary successfully completes the peering
138	process, the OSD map must reflect that the OSD was alive
139	and well as of the first epoch in the current interval.
140
141	Changes can only be made after successful peering.
142
143	Thus, a new primary can use the latest OSD map along with a recent
144	history of past maps to generate a set of past intervals to
145	determine which OSDs must be consulted before we can successfully
146	peer. The set of past intervals is bounded by last epoch started,
147	the most recent past interval for which we know peering completed.
148	The process by which an OSD discovers a PG exists in the first place is
149	by exchanging PG info messages, so the OSD always has some lower
150	bound on last epoch started.
151
152	The high level process is for the current PG primary to:
153
154	1. get a recent OSD map (to identify the members of the all
155	interesting acting sets, and confirm that we are still the
156	primary).
157
158	#. generate a list of past intervals since last epoch started.
159	Consider the subset of those for which up_thru was greater than
160	the first interval epoch by the last interval epoch's OSD map; that is,
161	the subset for which peering could have completed before the *acting
162	set* changed to another set of OSDs.
163
164	Successful peering will require that we be able to contact at
165	least one OSD from each of past interval's acting set.
166
167	#. ask every node in that list for its PG info, which includes the most
168	recent write made to the PG, and a value for last epoch started. If
169	we learn about a last epoch started that is newer than our own, we can
170	prune older past intervals and reduce the peer OSDs we need to contact.
171
172	#. if anyone else has (in its PG log) operations that I do not have,
173	instruct them to send me the missing log entries so that the primary's
174	PG log is up to date (includes the newest write)..
175
176	#. for each member of the current acting set:
177
178	a. ask it for copies of all PG log entries since last epoch start
179	so that I can verify that they agree with mine (or know what
180	objects I will be telling it to delete).
181
182	If the cluster failed before an operation was persisted by all
183	members of the acting set, and the subsequent peering did not
184	remember that operation, and a node that did remember that
185	operation later rejoined, its logs would record a different
186	(divergent) history than the authoritative history that was
187	reconstructed in the peering after the failure.
188
189	Since the divergent events were not recorded in other logs
190	from that acting set, they were not acknowledged to the client,
191	and there is no harm in discarding them (so that all OSDs agree
192	on the authoritative history). But, we will have to instruct
193	any OSD that stores data from a divergent update to delete the
194	affected (and now deemed to be apocryphal) objects.
195
196	#. ask it for its missing set (object updates recorded
197	in its PG log, but for which it does not have the new data).
198	This is the list of objects that must be fully replicated
199	before we can accept writes.
200
201	#. at this point, the primary's PG log contains an authoritative history of
202	the placement group, and the OSD now has sufficient
203	information to bring any other OSD in the acting set up to date.
204
205	#. if the primary's up_thru value in the current OSD map is not greater than
206	or equal to the first epoch in the current interval, send a request to the
207	monitor to update it, and wait until receive an updated OSD map that reflects
208	the change.
209
210	#. for each member of the current acting set:
211
212	a. send them log updates to bring their PG logs into agreement with
213	my own (authoritative history) ... which may involve deciding
214	to delete divergent objects.
215
216	#. await acknowledgment that they have persisted the PG log entries.
217
218	#. at this point all OSDs in the acting set agree on all of the meta-data,
219	and would (in any future peering) return identical accounts of all
220	updates.
221
222	a. start accepting client write operations (because we have unanimous
223	agreement on the state of the objects into which those updates are
224	being accepted). Note, however, that if a client tries to write to an
225	object it will be promoted to the front of the recovery queue, and the
226	write willy be applied after it is fully replicated to the current acting set.
227
228	#. update the last epoch started value in our local PG info, and instruct
229	other active set OSDs to do the same.
230
231	#. start pulling object data updates that other OSDs have, but I do not. We may
232	need to query OSDs from additional past intervals prior to last epoch started
233	(the last time peering completed) and following last epoch clean (the last epoch that
234	recovery completed) in order to find copies of all objects.
235
236	#. start pushing object data updates to other OSDs that do not yet have them.
237
238	We push these updates from the primary (rather than having the replicas
239	pull them) because this allows the primary to ensure that a replica has
240	the current contents before sending it an update write. It also makes
241	it possible for a single read (from the primary) to be used to write
242	the data to multiple replicas. If each replica did its own pulls,
243	the data might have to be read multiple times.
244
245	#. once all replicas store the all copies of all objects (that
246	existed prior to the start of this epoch) we can update *last
247	epoch clean* in the PG info, and we can dismiss all of the
248	stray replicas, allowing them to delete their copies of objects
249	for which they are no longer in the acting set.
250
251	We could not dismiss the strays prior to this because it was possible
252	that one of those strays might hold the sole surviving copy of an
253	old object (all of whose copies disappeared before they could be
254	replicated on members of the current acting set).
255
256	State Model
257	-----------
258
259	.. graphviz:: peering_graph.generated.dot