]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/peering.rst
bump version to 18.2.4-pve3
[ceph.git] / ceph / doc / dev / peering.rst
CommitLineData
7c673cae
FG
1======================
2Peering
3======================
4
5Concepts
6--------
7
8*Peering*
f38dd50b
TL
9 the process of bringing all of the OSDs that store a Placement Group (PG)
10 into agreement about the state of all of the objects in that PG and all of
11 the metadata associated with those objects. Two OSDs can agree on the state
12 of the objects in the placement group yet still may not necessarily have the
13 latest contents.
7c673cae
FG
14
15*Acting set*
f38dd50b
TL
16 the ordered list of OSDs that are (or were as of some epoch) responsible for
17 a particular PG.
7c673cae
FG
18
19*Up set*
f38dd50b
TL
20 the ordered list of OSDs responsible for a particular PG for a particular
21 epoch, according to CRUSH. This is the same as the *acting set* except when
22 the *acting set* has been explicitly overridden via *PG temp* in the OSDMap.
7c673cae
FG
23
24*PG temp*
f38dd50b
TL
25 a temporary placement group acting set that is used while backfilling the
26 primary OSD. Assume that the acting set is ``[0,1,2]`` and we are
27 ``active+clean``. Now assume that something happens and the acting set
28 becomes ``[2,1,2]``. Under these circumstances, OSD ``3`` is empty and can't
29 serve reads even though it is the primary. ``osd.3`` will respond by
30 requesting a *PG temp* of ``[1,2,3]`` to the monitors using a ``MOSDPGTemp``
31 message, and ``osd.1`` will become the primary temporarily. ``osd.1`` will
32 select ``osd.3`` as a backfill peer and will continue to serve reads and
33 writes while ``osd.3`` is backfilled. When backfilling is complete, *PG
34 temp* is discarded. The acting set changes back to ``[3,1,2]`` and ``osd.3``
35 becomes the primary.
7c673cae
FG
36
37*current interval* or *past interval*
f38dd50b
TL
38 a sequence of OSD map epochs during which the *acting set* and the *up
39 set* for particular PG do not change.
7c673cae
FG
40
41*primary*
f38dd50b
TL
42 the member of the *acting set* that is responsible for coordination peering.
43 The only OSD that accepts client-initiated writes to the objects in a
44 placement group. By convention, the primary is the first member of the
45 *acting set*.
7c673cae
FG
46
47*replica*
f38dd50b
TL
48 a non-primary OSD in the *acting set* of a placement group. A replica has
49 been recognized as a non-primary OSD and has been *activated* by the
50 primary.
7c673cae
FG
51
52*stray*
f38dd50b
TL
53 an OSD that is not a member of the current *acting set* and has not yet been
54 told to delete its copies of a particular placement group.
7c673cae
FG
55
56*recovery*
f38dd50b
TL
57 the process of ensuring that copies of all of the objects in a PG are on all
58 of the OSDs in the *acting set*. After *peering* has been performed, the
59 primary can begin accepting write operations and *recovery* can proceed in
60 the background.
7c673cae 61
1e59de90 62*PG info*
f38dd50b
TL
63 basic metadata about the PG's creation epoch, the version for the most
64 recent write to the PG, the *last epoch started*, the *last epoch clean*,
65 and the beginning of the *current interval*. Any inter-OSD communication
66 about PGs includes the *PG info*, such that any OSD that knows a PG exists
67 (or once existed) and also has a lower bound on *last epoch clean* or *last
68 epoch started*.
7c673cae
FG
69
70*PG log*
f38dd50b
TL
71 a list of recent updates made to objects in a PG. These logs can be
72 truncated after all OSDs in the *acting set* have acknowledged the changes.
7c673cae
FG
73
74*missing set*
f38dd50b
TL
75 the set of all objects that have not yet had their contents updated to match
76 the log entries. The missing set is collated by each OSD. Missing sets are
77 kept track of on an ``<OSD,PG>`` basis.
7c673cae
FG
78
79*Authoritative History*
f38dd50b
TL
80 a complete and fully-ordered set of operations that bring an OSD's copy of a
81 Placement Group up to date.
7c673cae
FG
82
83*epoch*
f38dd50b 84 a (monotonically increasing) OSD map version number.
7c673cae
FG
85
86*last epoch start*
f38dd50b
TL
87 the last epoch at which all nodes in the *acting set* for a given placement
88 group agreed on an *authoritative history*. At the start of the last epoch,
89 *peering* is deemed to have been successful.
7c673cae
FG
90
91*up_thru*
92 before a primary can successfully complete the *peering* process,
93 it must inform a monitor that is alive through the current
94 OSD map epoch by having the monitor set its *up_thru* in the osd
f38dd50b 95 map. This helps peering ignore previous *acting sets* for which
7c673cae
FG
96 peering never completed after certain sequences of failures, such as
97 the second interval below:
98
99 - *acting set* = [A,B]
100 - *acting set* = [A]
101 - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
102 - *acting set* = [B] (B restarts, A does not)
103
104*last epoch clean*
f38dd50b
TL
105 the last epoch at which all nodes in the *acting set* for a given placement
106 group were completely up to date (this includes both the PG's logs and the
107 PG's object contents). At this point, *recovery* is deemed to have been
7c673cae
FG
108 completed.
109
110Description of the Peering Process
111----------------------------------
112
113The *Golden Rule* is that no write operation to any PG
114is acknowledged to a client until it has been persisted
115by all members of the *acting set* for that PG. This means
116that if we can communicate with at least one member of
117each *acting set* since the last successful *peering*, someone
118will have a record of every (acknowledged) operation
119since the last successful *peering*.
120This means that it should be possible for the current
121primary to construct and disseminate a new *authoritative history*.
122
123It is also important to appreciate the role of the OSD map
124(list of all known OSDs and their states, as well as some
125information about the placement groups) in the *peering*
126process:
127
128 When OSDs go up or down (or get added or removed)
129 this has the potential to affect the *active sets*
130 of many placement groups.
131
132 Before a primary successfully completes the *peering*
133 process, the OSD map must reflect that the OSD was alive
134 and well as of the first epoch in the *current interval*.
135
136 Changes can only be made after successful *peering*.
137
138Thus, a new primary can use the latest OSD map along with a recent
139history of past maps to generate a set of *past intervals* to
140determine which OSDs must be consulted before we can successfully
141*peer*. The set of past intervals is bounded by *last epoch started*,
142the most recent *past interval* for which we know *peering* completed.
143The process by which an OSD discovers a PG exists in the first place is
144by exchanging *PG info* messages, so the OSD always has some lower
145bound on *last epoch started*.
146
147The high level process is for the current PG primary to:
148
149 1. get a recent OSD map (to identify the members of the all
150 interesting *acting sets*, and confirm that we are still the
151 primary).
152
153 #. generate a list of *past intervals* since *last epoch started*.
154 Consider the subset of those for which *up_thru* was greater than
155 the first interval epoch by the last interval epoch's OSD map; that is,
156 the subset for which *peering* could have completed before the *acting
157 set* changed to another set of OSDs.
158
159 Successful *peering* will require that we be able to contact at
160 least one OSD from each of *past interval*'s *acting set*.
161
162 #. ask every node in that list for its *PG info*, which includes the most
163 recent write made to the PG, and a value for *last epoch started*. If
164 we learn about a *last epoch started* that is newer than our own, we can
165 prune older *past intervals* and reduce the peer OSDs we need to contact.
166
167 #. if anyone else has (in its PG log) operations that I do not have,
168 instruct them to send me the missing log entries so that the primary's
169 *PG log* is up to date (includes the newest write)..
170
171 #. for each member of the current *acting set*:
172
173 a. ask it for copies of all PG log entries since *last epoch start*
174 so that I can verify that they agree with mine (or know what
175 objects I will be telling it to delete).
176
177 If the cluster failed before an operation was persisted by all
178 members of the *acting set*, and the subsequent *peering* did not
179 remember that operation, and a node that did remember that
180 operation later rejoined, its logs would record a different
181 (divergent) history than the *authoritative history* that was
182 reconstructed in the *peering* after the failure.
183
184 Since the *divergent* events were not recorded in other logs
185 from that *acting set*, they were not acknowledged to the client,
186 and there is no harm in discarding them (so that all OSDs agree
187 on the *authoritative history*). But, we will have to instruct
188 any OSD that stores data from a divergent update to delete the
189 affected (and now deemed to be apocryphal) objects.
190
191 #. ask it for its *missing set* (object updates recorded
192 in its PG log, but for which it does not have the new data).
193 This is the list of objects that must be fully replicated
194 before we can accept writes.
195
196 #. at this point, the primary's PG log contains an *authoritative history* of
197 the placement group, and the OSD now has sufficient
198 information to bring any other OSD in the *acting set* up to date.
199
200 #. if the primary's *up_thru* value in the current OSD map is not greater than
201 or equal to the first epoch in the *current interval*, send a request to the
202 monitor to update it, and wait until receive an updated OSD map that reflects
203 the change.
204
205 #. for each member of the current *acting set*:
206
207 a. send them log updates to bring their PG logs into agreement with
208 my own (*authoritative history*) ... which may involve deciding
209 to delete divergent objects.
210
211 #. await acknowledgment that they have persisted the PG log entries.
212
213 #. at this point all OSDs in the *acting set* agree on all of the meta-data,
214 and would (in any future *peering*) return identical accounts of all
215 updates.
216
217 a. start accepting client write operations (because we have unanimous
218 agreement on the state of the objects into which those updates are
219 being accepted). Note, however, that if a client tries to write to an
220 object it will be promoted to the front of the recovery queue, and the
221 write willy be applied after it is fully replicated to the current *acting set*.
222
223 #. update the *last epoch started* value in our local *PG info*, and instruct
224 other *active set* OSDs to do the same.
225
226 #. start pulling object data updates that other OSDs have, but I do not. We may
227 need to query OSDs from additional *past intervals* prior to *last epoch started*
228 (the last time *peering* completed) and following *last epoch clean* (the last epoch that
229 recovery completed) in order to find copies of all objects.
230
231 #. start pushing object data updates to other OSDs that do not yet have them.
232
233 We push these updates from the primary (rather than having the replicas
234 pull them) because this allows the primary to ensure that a replica has
235 the current contents before sending it an update write. It also makes
236 it possible for a single read (from the primary) to be used to write
237 the data to multiple replicas. If each replica did its own pulls,
238 the data might have to be read multiple times.
239
240 #. once all replicas store the all copies of all objects (that
241 existed prior to the start of this epoch) we can update *last
242 epoch clean* in the *PG info*, and we can dismiss all of the
243 *stray* replicas, allowing them to delete their copies of objects
244 for which they are no longer in the *acting set*.
245
246 We could not dismiss the *strays* prior to this because it was possible
247 that one of those *strays* might hold the sole surviving copy of an
248 old object (all of whose copies disappeared before they could be
249 replicated on members of the current *acting set*).
250
1e59de90
TL
251Generate a State Model
252----------------------
253
254Use the `gen_state_diagram.py <https://github.com/ceph/ceph/blob/master/doc/scripts/gen_state_diagram.py>`_ script to generate a copy of the latest peering state model::
255
256 $ git clone https://github.com/ceph/ceph.git
257 $ cd ceph
258 $ cat src/osd/PeeringState.h src/osd/PeeringState.cc | doc/scripts/gen_state_diagram.py > doc/dev/peering_graph.generated.dot
259 $ sed -i 's/7,7/1080,1080/' doc/dev/peering_graph.generated.dot
260 $ dot -Tsvg doc/dev/peering_graph.generated.dot > doc/dev/peering_graph.generated.svg
261
262Sample state model:
7c673cae
FG
263
264.. graphviz:: peering_graph.generated.dot