]>
Commit | Line | Data |
---|---|---|
eda2c3c0 SF |
1 | .. |
2 | Licensed under the Apache License, Version 2.0 (the "License"); you may | |
3 | not use this file except in compliance with the License. You may obtain | |
4 | a copy of the License at | |
5 | ||
6 | http://www.apache.org/licenses/LICENSE-2.0 | |
7 | ||
8 | Unless required by applicable law or agreed to in writing, software | |
9 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | |
10 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | |
11 | License for the specific language governing permissions and limitations | |
12 | under the License. | |
13 | ||
14 | Convention for heading levels in Open vSwitch documentation: | |
15 | ||
16 | ======= Heading 0 (reserved for the title in a document) | |
17 | ------- Heading 1 | |
18 | ~~~~~~~ Heading 2 | |
19 | +++++++ Heading 3 | |
20 | ''''''' Heading 4 | |
21 | ||
22 | Avoid deeper levels because they do not render well. | |
23 | ||
24 | ================================== | |
760d034b EJ |
25 | OVN Gateway High Availability Plan |
26 | ================================== | |
eda2c3c0 SF |
27 | |
28 | :: | |
29 | ||
30 | OVN Gateway | |
31 | ||
760d034b EJ |
32 | +---------------------------+ |
33 | | | | |
34 | | External Network | | |
35 | | | | |
36 | +-------------^-------------+ | |
37 | | | |
38 | | | |
39 | +-----------+ | |
40 | | | | |
41 | | Gateway | | |
42 | | | | |
43 | +-----------+ | |
44 | ^ | |
45 | | | |
46 | | | |
47 | +-------------v-------------+ | |
48 | | | | |
49 | | OVN Virtual Network | | |
50 | | | | |
51 | +---------------------------+ | |
52 | ||
760d034b EJ |
53 | The OVN gateway is responsible for shuffling traffic between the tunneled |
54 | overlay network (governed by ovn-northd), and the legacy physical network. In | |
55 | a naive implementation, the gateway is a single x86 server, or hardware VTEP. | |
56 | For most deployments, a single system has enough forwarding capacity to service | |
57 | the entire virtualized network, however, it introduces a single point of | |
58 | failure. If this system dies, the entire OVN deployment becomes unavailable. | |
59 | To mitigate this risk, an HA solution is critical -- by spreading | |
60 | responsibility across multiple systems, no single server failure can take down | |
61 | the network. | |
62 | ||
63 | An HA solution is both critical to the manageability of the system, and | |
64 | extremely difficult to get right. The purpose of this document, is to propose | |
65 | a plan for OVN Gateway High Availability which takes into account our past | |
66 | experience building similar systems. It should be considered a fluid changing | |
67 | proposal, not a set-in-stone decree. | |
68 | ||
69 | Basic Architecture | |
70 | ------------------ | |
eda2c3c0 | 71 | |
760d034b EJ |
72 | In an OVN deployment, the set of hypervisors and network elements operating |
73 | under the guidance of ovn-northd are in what's called "logical space". These | |
74 | servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of | |
75 | the underlying physical network. When these systems need to communicate with | |
76 | legacy networks, traffic must be routed through a Gateway which translates from | |
77 | OVN controlled tunnel traffic, to raw physical network traffic. | |
78 | ||
79 | Since the gateway is typically the only system with a connection to the | |
80 | physical network all traffic between logical space and the WAN must travel | |
eda2c3c0 SF |
81 | through it. This makes it a critical single point of failure -- if the gateway |
82 | dies, communication with the WAN ceases for all systems in logical space. | |
760d034b EJ |
83 | |
84 | To mitigate this risk, multiple gateways should be run in a "High Availability | |
85 | Cluster" or "HA Cluster". The HA cluster will be responsible for performing | |
86 | the duties of a gateways, while being able to recover gracefully from | |
87 | individual member failures. | |
88 | ||
eda2c3c0 SF |
89 | :: |
90 | ||
91 | OVN Gateway HA Cluster | |
92 | ||
93 | +---------------------------+ | |
94 | | | | |
95 | | External Network | | |
96 | | | | |
97 | +-------------^-------------+ | |
98 | | | |
99 | | | |
100 | +----------------------v----------------------+ | |
101 | | | | |
102 | | High Availability Cluster | | |
103 | | | | |
104 | | +-----------+ +-----------+ +-----------+ | | |
105 | | | | | | | | | | |
106 | | | Gateway | | Gateway | | Gateway | | | |
107 | | | | | | | | | | |
108 | | +-----------+ +-----------+ +-----------+ | | |
109 | +----------------------^----------------------+ | |
110 | | | |
111 | | | |
112 | +-------------v-------------+ | |
113 | | | | |
114 | | OVN Virtual Network | | |
115 | | | | |
116 | +---------------------------+ | |
117 | ||
118 | L2 vs L3 High Availability | |
119 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
760d034b | 120 | |
760d034b EJ |
121 | In order to achieve this goal, there are two broad approaches one can take. |
122 | The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch, | |
123 | or like a giant IP Router. These approaches are called L2HA, and L3HA | |
124 | respectively. L2HA allows ethernet broadcast domains to extend into logical | |
125 | space, a significant advantage, but this comes at a cost. The need to avoid | |
126 | transient L2 loops during failover significantly complicates their design. On | |
127 | the other hand, L3HA works for most use cases, is simpler, and fails more | |
128 | gracefully. For these reasons, it is suggested that OVN supports an L3HA | |
129 | model, leaving L2HA for future work (or third party VTEP providers). Both | |
130 | models are discussed further below. | |
131 | ||
132 | L3HA | |
133 | ---- | |
eda2c3c0 | 134 | |
760d034b EJ |
135 | In this section, we'll work through a basic simple L3HA implementation, on top |
136 | of which we'll gradually build more sophisticated features explaining their | |
137 | motivations and implementations as we go. | |
138 | ||
eda2c3c0 SF |
139 | Naive active-backup |
140 | ~~~~~~~~~~~~~~~~~~~ | |
141 | ||
760d034b EJ |
142 | Let's assume that there are a collection of logical routers which a tenant has |
143 | asked for, our task is to schedule these logical routers on one of N gateways, | |
144 | and gracefully redistribute the routers on gateways which have failed. The | |
145 | absolute simplest way to achieve this is what we'll call "naive-active-backup". | |
146 | ||
eda2c3c0 SF |
147 | :: |
148 | ||
149 | Naive Active Backup HA Implementation | |
150 | ||
151 | +----------------+ +----------------+ | |
152 | | Leader | | Backup | | |
153 | | | | | | |
154 | | A B C | | | | |
155 | | | | | | |
156 | +----+-+-+-+----++ +-+--------------+ | |
157 | ^ ^ ^ ^ | | | |
158 | | | | | | | | |
159 | | | | | +-+------+---+ | |
160 | + + + + | ovn-northd | | |
161 | Traffic +------------+ | |
760d034b EJ |
162 | |
163 | In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a | |
164 | leader. All logical routers (A, B, C in the figure), are scheduled on this | |
165 | leader gateway and all traffic flows through it. ovn-northd monitors this | |
166 | gateway via OpenFlow echo requests (or some equivalent), and if the gateway | |
167 | dies, it recreates the routers on one of the backups. | |
168 | ||
169 | This approach basically works in most cases and should likely be the starting | |
170 | point for OVN -- it's strictly better than no HA solution and is a good | |
171 | foundation for more sophisticated solutions. That said, it's not without it's | |
172 | limitations. Specifically, this approach doesn't coordinate with the physical | |
173 | network to minimize disruption during failures, and it tightly couples failover | |
174 | to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by | |
175 | leaving backup gateways completely unutilized. | |
176 | ||
eda2c3c0 SF |
177 | Router Failover |
178 | +++++++++++++++ | |
179 | ||
180 | When ovn-northd notices the leader has died and decides to migrate routers to a | |
181 | backup gateway, the physical network has to be notified to direct traffic to | |
182 | the new gateway. Otherwise, traffic could be blackholed for longer than | |
760d034b EJ |
183 | necessary making failovers worse than they need to be. |
184 | ||
185 | For now, let's assume that OVN requires all gateways to be on the same IP | |
eda2c3c0 SF |
186 | subnet on the physical network. If this isn't the case, gateways would need to |
187 | participate in routing protocols to orchestrate failovers, something which is | |
188 | difficult and out of scope of this document. | |
760d034b EJ |
189 | |
190 | Since all gateways are on the same IP subnet, we simply need to worry about | |
191 | updating the MAC learning tables of the Ethernet switches on that subnet. | |
192 | Presumably, they all have entries for each logical router pointing to the old | |
193 | leader. If these entries aren't updated, all traffic will be sent to the (now | |
194 | defunct) old leader, instead of the new one. | |
195 | ||
196 | In order to mitigate this issue, it's recommended that the new gateway sends a | |
197 | Reverse ARP (RARP) onto the physical network for each logical router it now | |
198 | controls. A Reverse ARP is a benign protocol used by many hypervisors when | |
199 | virtual machines migrate to update L2 forwarding tables. In this case, the | |
200 | ethernet source address of the RARP is that of the logical router it | |
201 | corresponds to, and its destination is the broadcast address. This causes the | |
202 | RARP to travel to every L2 switch in the broadcast domain, updating forwarding | |
203 | tables accordingly. This strategy is recommended in all failover mechanisms | |
204 | discussed in this document -- when a router newly boots on a new leader, it | |
205 | should RARP its MAC address. | |
206 | ||
eda2c3c0 SF |
207 | Controller Independent Active-backup |
208 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
209 | ||
210 | :: | |
211 | ||
212 | Controller Independent Active-Backup Implementation | |
213 | ||
214 | +----------------+ +----------------+ | |
215 | | Leader | | Backup | | |
216 | | | | | | |
217 | | A B C | | | | |
218 | | | | | | |
219 | +----------------+ +----------------+ | |
220 | ^ ^ ^ ^ | |
221 | | | | | | |
222 | | | | | | |
223 | + + + + | |
224 | Traffic | |
760d034b EJ |
225 | |
226 | The fundamental problem with naive active-backup, is it tightly couples the | |
227 | failover solution to ovn-northd. This can significantly increase downtime in | |
228 | the event of a failover as the (often already busy) ovn-northd controller has | |
eda2c3c0 SF |
229 | to recompute state for the new leader. Worse, if ovn-northd goes down, we can't |
230 | perform gateway failover at all. This violates the principle that control | |
231 | plane outages should have no impact on dataplane functionality. | |
760d034b EJ |
232 | |
233 | In a controller independent active-backup configuration, ovn-northd is | |
234 | responsible for initial configuration while the HA cluster is responsible for | |
235 | monitoring the leader, and failing over to a backup if necessary. ovn-northd | |
236 | sets HA policy, but doesn't actively participate when failovers occur. | |
237 | ||
238 | Of course, in this model, ovn-northd is not without some responsibility. Its | |
eda2c3c0 SF |
239 | role is to pre-plan what should happen in the event of a failure, leaving it to |
240 | the individual switches to execute this plan. It does this by assigning each | |
241 | gateway a unique leadership priority. Once assigned, it communicates this | |
760d034b EJ |
242 | priority to each node it controls. Nodes use the leadership priority to |
243 | determine which gateway in the cluster is the active leader by using a simple | |
244 | metric: the leader is the gateway that is healthy, with the highest priority. | |
245 | If that gateway goes down, leadership falls to the next highest priority, and | |
246 | conversely, if a new gateway comes up with a higher priority, it takes over | |
247 | leadership. | |
248 | ||
249 | Thus, in this model, leadership of the HA cluster is determined simply by the | |
250 | status of its members. Therefore if we can communicate the status of each | |
251 | gateway to each transport node, they can individually figure out which is the | |
252 | leader, and direct traffic accordingly. | |
253 | ||
eda2c3c0 SF |
254 | Tunnel Monitoring |
255 | +++++++++++++++++ | |
256 | ||
760d034b EJ |
257 | Since in this model leadership is determined exclusively by the health status |
258 | of member gateways, a key problem is how do we communicate this information to | |
259 | the relevant transport nodes. Luckily, we can do this fairly cheaply using | |
260 | tunnel monitoring protocols like BFD. | |
261 | ||
262 | The basic idea is pretty straightforward. Each transport node maintains a | |
eda2c3c0 SF |
263 | tunnel to every gateway in the HA cluster (not just the leader). These tunnels |
264 | are monitored using the BFD protocol to see which are alive. Given this | |
265 | information, hypervisors can trivially compute the highest priority live | |
760d034b EJ |
266 | gateway, and thus the leader. |
267 | ||
268 | In practice, this leadership computation can be performed trivially using the | |
269 | bundle or group action. Rather than using OpenFlow to simply output to the | |
270 | leader, all gateways could be listed in an active-backup bundle action ordered | |
271 | by their priority. The bundle action will automatically take into account the | |
272 | tunnel monitoring status to output the packet to the highest priority live | |
273 | gateway. | |
274 | ||
eda2c3c0 SF |
275 | Inter-Gateway Monitoring |
276 | ++++++++++++++++++++++++ | |
277 | ||
760d034b EJ |
278 | One somewhat subtle aspect of this model, is that failovers are not globally |
279 | atomic. When a failover occurs, it will take some time for all hypervisors to | |
280 | notice and adjust accordingly. Similarly, if a new high priority Gateway comes | |
281 | up, it may take some time for all hypervisors to switch over to the new leader. | |
282 | In order to avoid confusing the physical network, under these circumstances | |
283 | it's important for the backup gateways to drop traffic they've received | |
284 | erroneously. In order to do this, each Gateway must know whether or not it is, | |
285 | in fact active. This can be achieved by creating a mesh of tunnels between | |
286 | gateways. Each gateway monitors the other gateways its cluster to determine | |
287 | which are alive, and therefore whether or not that gateway happens to be the | |
288 | leader. If leading, the gateway forwards traffic normally, otherwise it drops | |
289 | all traffic. | |
290 | ||
712eb1bc MAA |
291 | We should note that this method works well under the assumption that there |
292 | are no inter-gateway connectivity failures, in such case this method would fail | |
293 | to elect a single master. The simplest example is two gateways which stop seeing | |
294 | each other but can still reach the hypervisors. Protocols like VRRP or CARP | |
295 | have the same issue. A mitigation for this type of failure mode could be | |
296 | achieved by having all network elements (hypervisors and gateways) periodically | |
297 | share their link status to other endpoints. | |
298 | ||
eda2c3c0 SF |
299 | Gateway Leadership Resignation |
300 | ++++++++++++++++++++++++++++++ | |
301 | ||
760d034b EJ |
302 | Sometimes a gateway may be healthy, but still may not be suitable to lead the |
303 | HA cluster. This could happen for several reasons including: | |
304 | ||
eda2c3c0 SF |
305 | * The physical network is unreachable |
306 | ||
307 | * BFD (or ping) has detected the next hop router is unreachable | |
308 | ||
309 | * The Gateway recently booted and isn't fully configured | |
760d034b EJ |
310 | |
311 | In this case, the Gateway should resign leadership by holding its tunnels down | |
eda2c3c0 | 312 | using the ``other_config:cpath_down`` flag. This indicates to participating |
760d034b EJ |
313 | hypervisors and Gateways that this gateway should be treated as if it's down, |
314 | even though its tunnels are still healthy. | |
315 | ||
760d034b | 316 | Router Specific Active-Backup |
eda2c3c0 SF |
317 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
318 | ||
319 | :: | |
320 | ||
321 | Router Specific Active-Backup | |
322 | ||
323 | +----------------+ +----------------+ | |
324 | | | | | | |
325 | | A C | | B D E | | |
326 | | | | | | |
327 | +----------------+ +----------------+ | |
328 | ^ ^ ^ ^ | |
329 | | | | | | |
330 | | | | | | |
331 | + + + + | |
332 | Traffic | |
333 | ||
760d034b EJ |
334 | Controller independent active-backup is a great advance over naive |
335 | active-backup, but it still has one glaring problem -- it under-utilizes the | |
336 | backup gateways. In ideal scenario, all traffic would split evenly among the | |
337 | live set of gateways. Getting all the way there is somewhat tricky, but as a | |
338 | step in the direction, one could use the "Router Specific Active-Backup" | |
339 | algorithm. This algorithm looks a lot like active-backup on a per logical | |
340 | router basis, with one twist. It chooses a different active Gateway for each | |
341 | logical router. Thus, in situations where there are several logical routers, | |
342 | all with somewhat balanced load, this algorithm performs better. | |
343 | ||
344 | Implementation of this strategy is quite straightforward if built on top of | |
345 | basic controller independent active-backup. On a per logical router basis, the | |
346 | algorithm is the same, leadership is determined by the liveness of the | |
347 | gateways. The key difference here is that the gateways must have a different | |
348 | leadership priority for each logical router. These leadership priorities can | |
349 | be computed by ovn-northd just as they had been in the controller independent | |
350 | active-backup model. | |
351 | ||
352 | Once we have these per logical router priorities, they simply need be | |
353 | communicated to the members of the gateway cluster and the hypervisors. The | |
354 | hypervisors in particular, need simply have an active-backup bundle action (or | |
355 | group action) per logical router listing the gateways in priority order for | |
356 | *that router*, rather than having a single bundle action shared for all the | |
357 | routers. | |
358 | ||
359 | Additionally, the gateways need to be updated to take into account individual | |
360 | router priorities. Specifically, each gateway should drop traffic of backup | |
361 | routers it's running, and forward traffic of active gateways, instead of simply | |
362 | dropping or forwarding everything. This should likely be done by having | |
363 | ovn-controller recompute OpenFlow for the gateway, though other options exist. | |
364 | ||
365 | The final complication is that ovn-northd's logic must be updated to choose | |
366 | these per logical router leadership priorities in a more sophisticated manner. | |
367 | It doesn't matter much exactly what algorithm it chooses to do this, beyond | |
368 | that it should provide good balancing in the common case. I.E. each logical | |
369 | routers priorities should be different enough that routers balance to different | |
370 | gateways even when failures occur. | |
371 | ||
eda2c3c0 SF |
372 | Preemption |
373 | ++++++++++ | |
374 | ||
760d034b EJ |
375 | In an active-backup setup, one issue that users will run into is that of |
376 | gateway leader preemption. If a new Gateway is added to a cluster, or for some | |
377 | reason an existing gateway is rebooted, we could end up in a situation where | |
378 | the newly activated gateway has higher priority than any other in the HA | |
eda2c3c0 SF |
379 | cluster. In this case, as soon as that gateway appears, it will preempt |
380 | leadership from the currently active leader causing an unnecessary failover. | |
381 | Since failover can be quite expensive, this preemption may be undesirable. | |
760d034b EJ |
382 | |
383 | The controller can optionally avoid preemption by cleverly tweaking the | |
384 | leadership priorities. For each router, new gateways should be assigned | |
385 | priorities that put them second in line or later when they eventually come up. | |
386 | Furthermore, if a gateway goes down for a significant period of time, its old | |
387 | leadership priorities should be revoked and new ones should be assigned as if | |
388 | it's a brand new gateway. Note that this should only happen if a gateway has | |
389 | been down for a while (several minutes), otherwise a flapping gateway could | |
390 | have wide ranging, unpredictable, consequences. | |
391 | ||
392 | Note that preemption avoidance should be optional depending on the deployment. | |
eda2c3c0 SF |
393 | One necessarily sacrifices optimal load balancing to satisfy these requirements |
394 | as new gateways will get no traffic on boot. Thus, this feature represents a | |
395 | trade-off which must be made on a per installation basis. | |
396 | ||
397 | Fully Active-Active HA | |
398 | ~~~~~~~~~~~~~~~~~~~~~~ | |
399 | ||
400 | :: | |
401 | ||
402 | Fully Active-Active HA | |
403 | ||
404 | +----------------+ +----------------+ | |
405 | | | | | | |
406 | | A B C D E | | A B C D E | | |
407 | | | | | | |
408 | +----------------+ +----------------+ | |
409 | ^ ^ ^ ^ | |
410 | | | | | | |
411 | | | | | | |
412 | + + + + | |
413 | Traffic | |
760d034b EJ |
414 | |
415 | The final step in L3HA is to have true active-active HA. In this scenario each | |
416 | router has an instance on each Gateway, and a mechanism similar to ECMP is used | |
417 | to distribute traffic evenly among all instances. This mechanism would require | |
418 | Gateways to participate in routing protocols with the physical network to | |
419 | attract traffic and alert of failures. It is out of scope of this document, | |
420 | but may eventually be necessary. | |
421 | ||
422 | L2HA | |
423 | ---- | |
eda2c3c0 | 424 | |
760d034b EJ |
425 | L2HA is very difficult to get right. Unlike L3HA, where the consequences of |
426 | problems are minor, in L2HA if two gateways are both transiently active, an L2 | |
427 | loop triggers and a broadcast storm results. In practice to get around this, | |
428 | gateways end up implementing an overly conservative "when in doubt drop all | |
429 | traffic" policy, or they implement something like MLAG. | |
430 | ||
431 | MLAG has multiple gateways work together to pretend to be a single L2 switch | |
eda2c3c0 SF |
432 | with a large LACP bond. In principle, it's the right solution to the problem |
433 | as it solves the broadcast storm problem, and has been deployed successfully in | |
760d034b | 434 | other contexts. That said, it's difficult to get right and not recommended. |