]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/peering.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / dev / peering.rst
CommitLineData
7c673cae
FG
1======================
2Peering
3======================
4
5Concepts
6--------
7
8*Peering*
9 the process of bringing all of the OSDs that store
10 a Placement Group (PG) into agreement about the state
11 of all of the objects (and their metadata) in that PG.
12 Note that agreeing on the state does not mean that
13 they all have the latest contents.
14
15*Acting set*
16 the ordered list of OSDs who are (or were as of some epoch)
17 responsible for a particular PG.
18
19*Up set*
20 the ordered list of OSDs responsible for a particular PG for
21 a particular epoch according to CRUSH. Normally this
22 is the same as the *acting set*, except when the *acting set* has been
23 explicitly overridden via *PG temp* in the OSDMap.
24
25*PG temp*
26 a temporary placement group acting set used while backfilling the
27 primary osd. Let say acting is [0,1,2] and we are
28 active+clean. Something happens and acting is now [3,1,2]. osd 3 is
29 empty and can't serve reads although it is the primary. osd.3 will
30 see that and request a *PG temp* of [1,2,3] to the monitors using a
31 MOSDPGTemp message so that osd.1 temporarily becomes the
32 primary. It will select osd.3 as a backfill peer and continue to
33 serve reads and writes while osd.3 is backfilled. When backfilling
34 is complete, *PG temp* is discarded and the acting set changes back
35 to [3,1,2] and osd.3 becomes the primary.
36
37*current interval* or *past interval*
38 a sequence of OSD map epochs during which the *acting set* and *up
39 set* for particular PG do not change
40
41*primary*
42 the (by convention first) member of the *acting set*,
43 who is responsible for coordination peering, and is
44 the only OSD that will accept client initiated
45 writes to objects in a placement group.
46
47*replica*
48 a non-primary OSD in the *acting set* for a placement group
49 (and who has been recognized as such and *activated* by the primary).
50
51*stray*
52 an OSD who is not a member of the current *acting set*, but
53 has not yet been told that it can delete its copies of a
54 particular placement group.
55
56*recovery*
57 ensuring that copies of all of the objects in a PG
58 are on all of the OSDs in the *acting set*. Once
59 *peering* has been performed, the primary can start
60 accepting write operations, and *recovery* can proceed
61 in the background.
62
63*PG info* basic metadata about the PG's creation epoch, the version
64 for the most recent write to the PG, *last epoch started*, *last
65 epoch clean*, and the beginning of the *current interval*. Any
66 inter-OSD communication about PGs includes the *PG info*, such that
67 any OSD that knows a PG exists (or once existed) also has a lower
68 bound on *last epoch clean* or *last epoch started*.
69
70*PG log*
71 a list of recent updates made to objects in a PG.
72 Note that these logs can be truncated after all OSDs
73 in the *acting set* have acknowledged up to a certain
74 point.
75
76*missing set*
77 Each OSD notes update log entries and if they imply updates to
78 the contents of an object, adds that object to a list of needed
79 updates. This list is called the *missing set* for that <OSD,PG>.
80
81*Authoritative History*
82 a complete, and fully ordered set of operations that, if
83 performed, would bring an OSD's copy of a Placement Group
84 up to date.
85
86*epoch*
87 a (monotonically increasing) OSD map version number
88
89*last epoch start*
90 the last epoch at which all nodes in the *acting set*
91 for a particular placement group agreed on an
92 *authoritative history*. At this point, *peering* is
93 deemed to have been successful.
94
95*up_thru*
96 before a primary can successfully complete the *peering* process,
97 it must inform a monitor that is alive through the current
98 OSD map epoch by having the monitor set its *up_thru* in the osd
99 map. This helps peering ignore previous *acting sets* for which
100 peering never completed after certain sequences of failures, such as
101 the second interval below:
102
103 - *acting set* = [A,B]
104 - *acting set* = [A]
105 - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
106 - *acting set* = [B] (B restarts, A does not)
107
108*last epoch clean*
109 the last epoch at which all nodes in the *acting set*
110 for a particular placement group were completely
111 up to date (both PG logs and object contents).
112 At this point, *recovery* is deemed to have been
113 completed.
114
115Description of the Peering Process
116----------------------------------
117
118The *Golden Rule* is that no write operation to any PG
119is acknowledged to a client until it has been persisted
120by all members of the *acting set* for that PG. This means
121that if we can communicate with at least one member of
122each *acting set* since the last successful *peering*, someone
123will have a record of every (acknowledged) operation
124since the last successful *peering*.
125This means that it should be possible for the current
126primary to construct and disseminate a new *authoritative history*.
127
128It is also important to appreciate the role of the OSD map
129(list of all known OSDs and their states, as well as some
130information about the placement groups) in the *peering*
131process:
132
133 When OSDs go up or down (or get added or removed)
134 this has the potential to affect the *active sets*
135 of many placement groups.
136
137 Before a primary successfully completes the *peering*
138 process, the OSD map must reflect that the OSD was alive
139 and well as of the first epoch in the *current interval*.
140
141 Changes can only be made after successful *peering*.
142
143Thus, a new primary can use the latest OSD map along with a recent
144history of past maps to generate a set of *past intervals* to
145determine which OSDs must be consulted before we can successfully
146*peer*. The set of past intervals is bounded by *last epoch started*,
147the most recent *past interval* for which we know *peering* completed.
148The process by which an OSD discovers a PG exists in the first place is
149by exchanging *PG info* messages, so the OSD always has some lower
150bound on *last epoch started*.
151
152The high level process is for the current PG primary to:
153
154 1. get a recent OSD map (to identify the members of the all
155 interesting *acting sets*, and confirm that we are still the
156 primary).
157
158 #. generate a list of *past intervals* since *last epoch started*.
159 Consider the subset of those for which *up_thru* was greater than
160 the first interval epoch by the last interval epoch's OSD map; that is,
161 the subset for which *peering* could have completed before the *acting
162 set* changed to another set of OSDs.
163
164 Successful *peering* will require that we be able to contact at
165 least one OSD from each of *past interval*'s *acting set*.
166
167 #. ask every node in that list for its *PG info*, which includes the most
168 recent write made to the PG, and a value for *last epoch started*. If
169 we learn about a *last epoch started* that is newer than our own, we can
170 prune older *past intervals* and reduce the peer OSDs we need to contact.
171
172 #. if anyone else has (in its PG log) operations that I do not have,
173 instruct them to send me the missing log entries so that the primary's
174 *PG log* is up to date (includes the newest write)..
175
176 #. for each member of the current *acting set*:
177
178 a. ask it for copies of all PG log entries since *last epoch start*
179 so that I can verify that they agree with mine (or know what
180 objects I will be telling it to delete).
181
182 If the cluster failed before an operation was persisted by all
183 members of the *acting set*, and the subsequent *peering* did not
184 remember that operation, and a node that did remember that
185 operation later rejoined, its logs would record a different
186 (divergent) history than the *authoritative history* that was
187 reconstructed in the *peering* after the failure.
188
189 Since the *divergent* events were not recorded in other logs
190 from that *acting set*, they were not acknowledged to the client,
191 and there is no harm in discarding them (so that all OSDs agree
192 on the *authoritative history*). But, we will have to instruct
193 any OSD that stores data from a divergent update to delete the
194 affected (and now deemed to be apocryphal) objects.
195
196 #. ask it for its *missing set* (object updates recorded
197 in its PG log, but for which it does not have the new data).
198 This is the list of objects that must be fully replicated
199 before we can accept writes.
200
201 #. at this point, the primary's PG log contains an *authoritative history* of
202 the placement group, and the OSD now has sufficient
203 information to bring any other OSD in the *acting set* up to date.
204
205 #. if the primary's *up_thru* value in the current OSD map is not greater than
206 or equal to the first epoch in the *current interval*, send a request to the
207 monitor to update it, and wait until receive an updated OSD map that reflects
208 the change.
209
210 #. for each member of the current *acting set*:
211
212 a. send them log updates to bring their PG logs into agreement with
213 my own (*authoritative history*) ... which may involve deciding
214 to delete divergent objects.
215
216 #. await acknowledgment that they have persisted the PG log entries.
217
218 #. at this point all OSDs in the *acting set* agree on all of the meta-data,
219 and would (in any future *peering*) return identical accounts of all
220 updates.
221
222 a. start accepting client write operations (because we have unanimous
223 agreement on the state of the objects into which those updates are
224 being accepted). Note, however, that if a client tries to write to an
225 object it will be promoted to the front of the recovery queue, and the
226 write willy be applied after it is fully replicated to the current *acting set*.
227
228 #. update the *last epoch started* value in our local *PG info*, and instruct
229 other *active set* OSDs to do the same.
230
231 #. start pulling object data updates that other OSDs have, but I do not. We may
232 need to query OSDs from additional *past intervals* prior to *last epoch started*
233 (the last time *peering* completed) and following *last epoch clean* (the last epoch that
234 recovery completed) in order to find copies of all objects.
235
236 #. start pushing object data updates to other OSDs that do not yet have them.
237
238 We push these updates from the primary (rather than having the replicas
239 pull them) because this allows the primary to ensure that a replica has
240 the current contents before sending it an update write. It also makes
241 it possible for a single read (from the primary) to be used to write
242 the data to multiple replicas. If each replica did its own pulls,
243 the data might have to be read multiple times.
244
245 #. once all replicas store the all copies of all objects (that
246 existed prior to the start of this epoch) we can update *last
247 epoch clean* in the *PG info*, and we can dismiss all of the
248 *stray* replicas, allowing them to delete their copies of objects
249 for which they are no longer in the *acting set*.
250
251 We could not dismiss the *strays* prior to this because it was possible
252 that one of those *strays* might hold the sole surviving copy of an
253 old object (all of whose copies disappeared before they could be
254 replicated on members of the current *acting set*).
255
256State Model
257-----------
258
259.. graphviz:: peering_graph.generated.dot