]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | Peering | |
3 | ====================== | |
4 | ||
5 | Concepts | |
6 | -------- | |
7 | ||
8 | *Peering* | |
f38dd50b TL |
9 | the process of bringing all of the OSDs that store a Placement Group (PG) |
10 | into agreement about the state of all of the objects in that PG and all of | |
11 | the metadata associated with those objects. Two OSDs can agree on the state | |
12 | of the objects in the placement group yet still may not necessarily have the | |
13 | latest contents. | |
7c673cae FG |
14 | |
15 | *Acting set* | |
f38dd50b TL |
16 | the ordered list of OSDs that are (or were as of some epoch) responsible for |
17 | a particular PG. | |
7c673cae FG |
18 | |
19 | *Up set* | |
f38dd50b TL |
20 | the ordered list of OSDs responsible for a particular PG for a particular |
21 | epoch, according to CRUSH. This is the same as the *acting set* except when | |
22 | the *acting set* has been explicitly overridden via *PG temp* in the OSDMap. | |
7c673cae FG |
23 | |
24 | *PG temp* | |
f38dd50b TL |
25 | a temporary placement group acting set that is used while backfilling the |
26 | primary OSD. Assume that the acting set is ``[0,1,2]`` and we are | |
27 | ``active+clean``. Now assume that something happens and the acting set | |
28 | becomes ``[2,1,2]``. Under these circumstances, OSD ``3`` is empty and can't | |
29 | serve reads even though it is the primary. ``osd.3`` will respond by | |
30 | requesting a *PG temp* of ``[1,2,3]`` to the monitors using a ``MOSDPGTemp`` | |
31 | message, and ``osd.1`` will become the primary temporarily. ``osd.1`` will | |
32 | select ``osd.3`` as a backfill peer and will continue to serve reads and | |
33 | writes while ``osd.3`` is backfilled. When backfilling is complete, *PG | |
34 | temp* is discarded. The acting set changes back to ``[3,1,2]`` and ``osd.3`` | |
35 | becomes the primary. | |
7c673cae FG |
36 | |
37 | *current interval* or *past interval* | |
f38dd50b TL |
38 | a sequence of OSD map epochs during which the *acting set* and the *up |
39 | set* for particular PG do not change. | |
7c673cae FG |
40 | |
41 | *primary* | |
f38dd50b TL |
42 | the member of the *acting set* that is responsible for coordination peering. |
43 | The only OSD that accepts client-initiated writes to the objects in a | |
44 | placement group. By convention, the primary is the first member of the | |
45 | *acting set*. | |
7c673cae FG |
46 | |
47 | *replica* | |
f38dd50b TL |
48 | a non-primary OSD in the *acting set* of a placement group. A replica has |
49 | been recognized as a non-primary OSD and has been *activated* by the | |
50 | primary. | |
7c673cae FG |
51 | |
52 | *stray* | |
f38dd50b TL |
53 | an OSD that is not a member of the current *acting set* and has not yet been |
54 | told to delete its copies of a particular placement group. | |
7c673cae FG |
55 | |
56 | *recovery* | |
f38dd50b TL |
57 | the process of ensuring that copies of all of the objects in a PG are on all |
58 | of the OSDs in the *acting set*. After *peering* has been performed, the | |
59 | primary can begin accepting write operations and *recovery* can proceed in | |
60 | the background. | |
7c673cae | 61 | |
1e59de90 | 62 | *PG info* |
f38dd50b TL |
63 | basic metadata about the PG's creation epoch, the version for the most |
64 | recent write to the PG, the *last epoch started*, the *last epoch clean*, | |
65 | and the beginning of the *current interval*. Any inter-OSD communication | |
66 | about PGs includes the *PG info*, such that any OSD that knows a PG exists | |
67 | (or once existed) and also has a lower bound on *last epoch clean* or *last | |
68 | epoch started*. | |
7c673cae FG |
69 | |
70 | *PG log* | |
f38dd50b TL |
71 | a list of recent updates made to objects in a PG. These logs can be |
72 | truncated after all OSDs in the *acting set* have acknowledged the changes. | |
7c673cae FG |
73 | |
74 | *missing set* | |
f38dd50b TL |
75 | the set of all objects that have not yet had their contents updated to match |
76 | the log entries. The missing set is collated by each OSD. Missing sets are | |
77 | kept track of on an ``<OSD,PG>`` basis. | |
7c673cae FG |
78 | |
79 | *Authoritative History* | |
f38dd50b TL |
80 | a complete and fully-ordered set of operations that bring an OSD's copy of a |
81 | Placement Group up to date. | |
7c673cae FG |
82 | |
83 | *epoch* | |
f38dd50b | 84 | a (monotonically increasing) OSD map version number. |
7c673cae FG |
85 | |
86 | *last epoch start* | |
f38dd50b TL |
87 | the last epoch at which all nodes in the *acting set* for a given placement |
88 | group agreed on an *authoritative history*. At the start of the last epoch, | |
89 | *peering* is deemed to have been successful. | |
7c673cae FG |
90 | |
91 | *up_thru* | |
92 | before a primary can successfully complete the *peering* process, | |
93 | it must inform a monitor that is alive through the current | |
94 | OSD map epoch by having the monitor set its *up_thru* in the osd | |
f38dd50b | 95 | map. This helps peering ignore previous *acting sets* for which |
7c673cae FG |
96 | peering never completed after certain sequences of failures, such as |
97 | the second interval below: | |
98 | ||
99 | - *acting set* = [A,B] | |
100 | - *acting set* = [A] | |
101 | - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) | |
102 | - *acting set* = [B] (B restarts, A does not) | |
103 | ||
104 | *last epoch clean* | |
f38dd50b TL |
105 | the last epoch at which all nodes in the *acting set* for a given placement |
106 | group were completely up to date (this includes both the PG's logs and the | |
107 | PG's object contents). At this point, *recovery* is deemed to have been | |
7c673cae FG |
108 | completed. |
109 | ||
110 | Description of the Peering Process | |
111 | ---------------------------------- | |
112 | ||
113 | The *Golden Rule* is that no write operation to any PG | |
114 | is acknowledged to a client until it has been persisted | |
115 | by all members of the *acting set* for that PG. This means | |
116 | that if we can communicate with at least one member of | |
117 | each *acting set* since the last successful *peering*, someone | |
118 | will have a record of every (acknowledged) operation | |
119 | since the last successful *peering*. | |
120 | This means that it should be possible for the current | |
121 | primary to construct and disseminate a new *authoritative history*. | |
122 | ||
123 | It is also important to appreciate the role of the OSD map | |
124 | (list of all known OSDs and their states, as well as some | |
125 | information about the placement groups) in the *peering* | |
126 | process: | |
127 | ||
128 | When OSDs go up or down (or get added or removed) | |
129 | this has the potential to affect the *active sets* | |
130 | of many placement groups. | |
131 | ||
132 | Before a primary successfully completes the *peering* | |
133 | process, the OSD map must reflect that the OSD was alive | |
134 | and well as of the first epoch in the *current interval*. | |
135 | ||
136 | Changes can only be made after successful *peering*. | |
137 | ||
138 | Thus, a new primary can use the latest OSD map along with a recent | |
139 | history of past maps to generate a set of *past intervals* to | |
140 | determine which OSDs must be consulted before we can successfully | |
141 | *peer*. The set of past intervals is bounded by *last epoch started*, | |
142 | the most recent *past interval* for which we know *peering* completed. | |
143 | The process by which an OSD discovers a PG exists in the first place is | |
144 | by exchanging *PG info* messages, so the OSD always has some lower | |
145 | bound on *last epoch started*. | |
146 | ||
147 | The high level process is for the current PG primary to: | |
148 | ||
149 | 1. get a recent OSD map (to identify the members of the all | |
150 | interesting *acting sets*, and confirm that we are still the | |
151 | primary). | |
152 | ||
153 | #. generate a list of *past intervals* since *last epoch started*. | |
154 | Consider the subset of those for which *up_thru* was greater than | |
155 | the first interval epoch by the last interval epoch's OSD map; that is, | |
156 | the subset for which *peering* could have completed before the *acting | |
157 | set* changed to another set of OSDs. | |
158 | ||
159 | Successful *peering* will require that we be able to contact at | |
160 | least one OSD from each of *past interval*'s *acting set*. | |
161 | ||
162 | #. ask every node in that list for its *PG info*, which includes the most | |
163 | recent write made to the PG, and a value for *last epoch started*. If | |
164 | we learn about a *last epoch started* that is newer than our own, we can | |
165 | prune older *past intervals* and reduce the peer OSDs we need to contact. | |
166 | ||
167 | #. if anyone else has (in its PG log) operations that I do not have, | |
168 | instruct them to send me the missing log entries so that the primary's | |
169 | *PG log* is up to date (includes the newest write).. | |
170 | ||
171 | #. for each member of the current *acting set*: | |
172 | ||
173 | a. ask it for copies of all PG log entries since *last epoch start* | |
174 | so that I can verify that they agree with mine (or know what | |
175 | objects I will be telling it to delete). | |
176 | ||
177 | If the cluster failed before an operation was persisted by all | |
178 | members of the *acting set*, and the subsequent *peering* did not | |
179 | remember that operation, and a node that did remember that | |
180 | operation later rejoined, its logs would record a different | |
181 | (divergent) history than the *authoritative history* that was | |
182 | reconstructed in the *peering* after the failure. | |
183 | ||
184 | Since the *divergent* events were not recorded in other logs | |
185 | from that *acting set*, they were not acknowledged to the client, | |
186 | and there is no harm in discarding them (so that all OSDs agree | |
187 | on the *authoritative history*). But, we will have to instruct | |
188 | any OSD that stores data from a divergent update to delete the | |
189 | affected (and now deemed to be apocryphal) objects. | |
190 | ||
191 | #. ask it for its *missing set* (object updates recorded | |
192 | in its PG log, but for which it does not have the new data). | |
193 | This is the list of objects that must be fully replicated | |
194 | before we can accept writes. | |
195 | ||
196 | #. at this point, the primary's PG log contains an *authoritative history* of | |
197 | the placement group, and the OSD now has sufficient | |
198 | information to bring any other OSD in the *acting set* up to date. | |
199 | ||
200 | #. if the primary's *up_thru* value in the current OSD map is not greater than | |
201 | or equal to the first epoch in the *current interval*, send a request to the | |
202 | monitor to update it, and wait until receive an updated OSD map that reflects | |
203 | the change. | |
204 | ||
205 | #. for each member of the current *acting set*: | |
206 | ||
207 | a. send them log updates to bring their PG logs into agreement with | |
208 | my own (*authoritative history*) ... which may involve deciding | |
209 | to delete divergent objects. | |
210 | ||
211 | #. await acknowledgment that they have persisted the PG log entries. | |
212 | ||
213 | #. at this point all OSDs in the *acting set* agree on all of the meta-data, | |
214 | and would (in any future *peering*) return identical accounts of all | |
215 | updates. | |
216 | ||
217 | a. start accepting client write operations (because we have unanimous | |
218 | agreement on the state of the objects into which those updates are | |
219 | being accepted). Note, however, that if a client tries to write to an | |
220 | object it will be promoted to the front of the recovery queue, and the | |
221 | write willy be applied after it is fully replicated to the current *acting set*. | |
222 | ||
223 | #. update the *last epoch started* value in our local *PG info*, and instruct | |
224 | other *active set* OSDs to do the same. | |
225 | ||
226 | #. start pulling object data updates that other OSDs have, but I do not. We may | |
227 | need to query OSDs from additional *past intervals* prior to *last epoch started* | |
228 | (the last time *peering* completed) and following *last epoch clean* (the last epoch that | |
229 | recovery completed) in order to find copies of all objects. | |
230 | ||
231 | #. start pushing object data updates to other OSDs that do not yet have them. | |
232 | ||
233 | We push these updates from the primary (rather than having the replicas | |
234 | pull them) because this allows the primary to ensure that a replica has | |
235 | the current contents before sending it an update write. It also makes | |
236 | it possible for a single read (from the primary) to be used to write | |
237 | the data to multiple replicas. If each replica did its own pulls, | |
238 | the data might have to be read multiple times. | |
239 | ||
240 | #. once all replicas store the all copies of all objects (that | |
241 | existed prior to the start of this epoch) we can update *last | |
242 | epoch clean* in the *PG info*, and we can dismiss all of the | |
243 | *stray* replicas, allowing them to delete their copies of objects | |
244 | for which they are no longer in the *acting set*. | |
245 | ||
246 | We could not dismiss the *strays* prior to this because it was possible | |
247 | that one of those *strays* might hold the sole surviving copy of an | |
248 | old object (all of whose copies disappeared before they could be | |
249 | replicated on members of the current *acting set*). | |
250 | ||
1e59de90 TL |
251 | Generate a State Model |
252 | ---------------------- | |
253 | ||
254 | Use the `gen_state_diagram.py <https://github.com/ceph/ceph/blob/master/doc/scripts/gen_state_diagram.py>`_ script to generate a copy of the latest peering state model:: | |
255 | ||
256 | $ git clone https://github.com/ceph/ceph.git | |
257 | $ cd ceph | |
258 | $ cat src/osd/PeeringState.h src/osd/PeeringState.cc | doc/scripts/gen_state_diagram.py > doc/dev/peering_graph.generated.dot | |
259 | $ sed -i 's/7,7/1080,1080/' doc/dev/peering_graph.generated.dot | |
260 | $ dot -Tsvg doc/dev/peering_graph.generated.dot > doc/dev/peering_graph.generated.svg | |
261 | ||
262 | Sample state model: | |
7c673cae FG |
263 | |
264 | .. graphviz:: peering_graph.generated.dot |