]>
Commit | Line | Data |
---|---|---|
eda2c3c0 SF |
1 | .. |
2 | Licensed under the Apache License, Version 2.0 (the "License"); you may | |
3 | not use this file except in compliance with the License. You may obtain | |
4 | a copy of the License at | |
5 | ||
6 | http://www.apache.org/licenses/LICENSE-2.0 | |
7 | ||
8 | Unless required by applicable law or agreed to in writing, software | |
9 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | |
10 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | |
11 | License for the specific language governing permissions and limitations | |
12 | under the License. | |
13 | ||
14 | Convention for heading levels in Open vSwitch documentation: | |
15 | ||
16 | ======= Heading 0 (reserved for the title in a document) | |
17 | ------- Heading 1 | |
18 | ~~~~~~~ Heading 2 | |
19 | +++++++ Heading 3 | |
20 | ''''''' Heading 4 | |
21 | ||
22 | Avoid deeper levels because they do not render well. | |
23 | ||
24 | ================================== | |
760d034b EJ |
25 | OVN Gateway High Availability Plan |
26 | ================================== | |
eda2c3c0 SF |
27 | |
28 | :: | |
29 | ||
30 | OVN Gateway | |
31 | ||
760d034b EJ |
32 | +---------------------------+ |
33 | | | | |
34 | | External Network | | |
35 | | | | |
36 | +-------------^-------------+ | |
37 | | | |
38 | | | |
39 | +-----------+ | |
40 | | | | |
41 | | Gateway | | |
42 | | | | |
43 | +-----------+ | |
44 | ^ | |
45 | | | |
46 | | | |
47 | +-------------v-------------+ | |
48 | | | | |
49 | | OVN Virtual Network | | |
50 | | | | |
51 | +---------------------------+ | |
52 | ||
760d034b EJ |
53 | The OVN gateway is responsible for shuffling traffic between the tunneled |
54 | overlay network (governed by ovn-northd), and the legacy physical network. In | |
55 | a naive implementation, the gateway is a single x86 server, or hardware VTEP. | |
56 | For most deployments, a single system has enough forwarding capacity to service | |
57 | the entire virtualized network, however, it introduces a single point of | |
58 | failure. If this system dies, the entire OVN deployment becomes unavailable. | |
59 | To mitigate this risk, an HA solution is critical -- by spreading | |
60 | responsibility across multiple systems, no single server failure can take down | |
61 | the network. | |
62 | ||
63 | An HA solution is both critical to the manageability of the system, and | |
64 | extremely difficult to get right. The purpose of this document, is to propose | |
65 | a plan for OVN Gateway High Availability which takes into account our past | |
66 | experience building similar systems. It should be considered a fluid changing | |
67 | proposal, not a set-in-stone decree. | |
68 | ||
f66cca6a RB |
69 | .. note:: |
70 | This document describes a range of options OVN could take to provide | |
71 | high availability for gateways. The current implementation provides L3 | |
72 | gateway high availability by the "Router Specific Active/Backup" | |
73 | approach described in this document. | |
74 | ||
760d034b EJ |
75 | Basic Architecture |
76 | ------------------ | |
eda2c3c0 | 77 | |
760d034b EJ |
78 | In an OVN deployment, the set of hypervisors and network elements operating |
79 | under the guidance of ovn-northd are in what's called "logical space". These | |
80 | servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of | |
81 | the underlying physical network. When these systems need to communicate with | |
82 | legacy networks, traffic must be routed through a Gateway which translates from | |
83 | OVN controlled tunnel traffic, to raw physical network traffic. | |
84 | ||
85 | Since the gateway is typically the only system with a connection to the | |
86 | physical network all traffic between logical space and the WAN must travel | |
eda2c3c0 SF |
87 | through it. This makes it a critical single point of failure -- if the gateway |
88 | dies, communication with the WAN ceases for all systems in logical space. | |
760d034b EJ |
89 | |
90 | To mitigate this risk, multiple gateways should be run in a "High Availability | |
91 | Cluster" or "HA Cluster". The HA cluster will be responsible for performing | |
92 | the duties of a gateways, while being able to recover gracefully from | |
93 | individual member failures. | |
94 | ||
eda2c3c0 SF |
95 | :: |
96 | ||
97 | OVN Gateway HA Cluster | |
98 | ||
99 | +---------------------------+ | |
100 | | | | |
101 | | External Network | | |
102 | | | | |
103 | +-------------^-------------+ | |
104 | | | |
105 | | | |
106 | +----------------------v----------------------+ | |
107 | | | | |
108 | | High Availability Cluster | | |
109 | | | | |
110 | | +-----------+ +-----------+ +-----------+ | | |
111 | | | | | | | | | | |
112 | | | Gateway | | Gateway | | Gateway | | | |
113 | | | | | | | | | | |
114 | | +-----------+ +-----------+ +-----------+ | | |
115 | +----------------------^----------------------+ | |
116 | | | |
117 | | | |
118 | +-------------v-------------+ | |
119 | | | | |
120 | | OVN Virtual Network | | |
121 | | | | |
122 | +---------------------------+ | |
123 | ||
124 | L2 vs L3 High Availability | |
125 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
760d034b | 126 | |
760d034b EJ |
127 | In order to achieve this goal, there are two broad approaches one can take. |
128 | The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch, | |
129 | or like a giant IP Router. These approaches are called L2HA, and L3HA | |
130 | respectively. L2HA allows ethernet broadcast domains to extend into logical | |
131 | space, a significant advantage, but this comes at a cost. The need to avoid | |
132 | transient L2 loops during failover significantly complicates their design. On | |
133 | the other hand, L3HA works for most use cases, is simpler, and fails more | |
134 | gracefully. For these reasons, it is suggested that OVN supports an L3HA | |
135 | model, leaving L2HA for future work (or third party VTEP providers). Both | |
136 | models are discussed further below. | |
137 | ||
138 | L3HA | |
139 | ---- | |
eda2c3c0 | 140 | |
760d034b EJ |
141 | In this section, we'll work through a basic simple L3HA implementation, on top |
142 | of which we'll gradually build more sophisticated features explaining their | |
143 | motivations and implementations as we go. | |
144 | ||
eda2c3c0 SF |
145 | Naive active-backup |
146 | ~~~~~~~~~~~~~~~~~~~ | |
147 | ||
760d034b EJ |
148 | Let's assume that there are a collection of logical routers which a tenant has |
149 | asked for, our task is to schedule these logical routers on one of N gateways, | |
150 | and gracefully redistribute the routers on gateways which have failed. The | |
151 | absolute simplest way to achieve this is what we'll call "naive-active-backup". | |
152 | ||
eda2c3c0 SF |
153 | :: |
154 | ||
155 | Naive Active Backup HA Implementation | |
156 | ||
157 | +----------------+ +----------------+ | |
158 | | Leader | | Backup | | |
159 | | | | | | |
160 | | A B C | | | | |
161 | | | | | | |
162 | +----+-+-+-+----++ +-+--------------+ | |
163 | ^ ^ ^ ^ | | | |
164 | | | | | | | | |
165 | | | | | +-+------+---+ | |
166 | + + + + | ovn-northd | | |
167 | Traffic +------------+ | |
760d034b EJ |
168 | |
169 | In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a | |
170 | leader. All logical routers (A, B, C in the figure), are scheduled on this | |
171 | leader gateway and all traffic flows through it. ovn-northd monitors this | |
172 | gateway via OpenFlow echo requests (or some equivalent), and if the gateway | |
173 | dies, it recreates the routers on one of the backups. | |
174 | ||
175 | This approach basically works in most cases and should likely be the starting | |
176 | point for OVN -- it's strictly better than no HA solution and is a good | |
177 | foundation for more sophisticated solutions. That said, it's not without it's | |
178 | limitations. Specifically, this approach doesn't coordinate with the physical | |
179 | network to minimize disruption during failures, and it tightly couples failover | |
180 | to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by | |
181 | leaving backup gateways completely unutilized. | |
182 | ||
eda2c3c0 SF |
183 | Router Failover |
184 | +++++++++++++++ | |
185 | ||
186 | When ovn-northd notices the leader has died and decides to migrate routers to a | |
187 | backup gateway, the physical network has to be notified to direct traffic to | |
188 | the new gateway. Otherwise, traffic could be blackholed for longer than | |
760d034b EJ |
189 | necessary making failovers worse than they need to be. |
190 | ||
191 | For now, let's assume that OVN requires all gateways to be on the same IP | |
eda2c3c0 SF |
192 | subnet on the physical network. If this isn't the case, gateways would need to |
193 | participate in routing protocols to orchestrate failovers, something which is | |
194 | difficult and out of scope of this document. | |
760d034b EJ |
195 | |
196 | Since all gateways are on the same IP subnet, we simply need to worry about | |
197 | updating the MAC learning tables of the Ethernet switches on that subnet. | |
198 | Presumably, they all have entries for each logical router pointing to the old | |
199 | leader. If these entries aren't updated, all traffic will be sent to the (now | |
200 | defunct) old leader, instead of the new one. | |
201 | ||
202 | In order to mitigate this issue, it's recommended that the new gateway sends a | |
203 | Reverse ARP (RARP) onto the physical network for each logical router it now | |
204 | controls. A Reverse ARP is a benign protocol used by many hypervisors when | |
205 | virtual machines migrate to update L2 forwarding tables. In this case, the | |
206 | ethernet source address of the RARP is that of the logical router it | |
207 | corresponds to, and its destination is the broadcast address. This causes the | |
208 | RARP to travel to every L2 switch in the broadcast domain, updating forwarding | |
209 | tables accordingly. This strategy is recommended in all failover mechanisms | |
210 | discussed in this document -- when a router newly boots on a new leader, it | |
211 | should RARP its MAC address. | |
212 | ||
eda2c3c0 SF |
213 | Controller Independent Active-backup |
214 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
215 | ||
216 | :: | |
217 | ||
218 | Controller Independent Active-Backup Implementation | |
219 | ||
220 | +----------------+ +----------------+ | |
221 | | Leader | | Backup | | |
222 | | | | | | |
223 | | A B C | | | | |
224 | | | | | | |
225 | +----------------+ +----------------+ | |
226 | ^ ^ ^ ^ | |
227 | | | | | | |
228 | | | | | | |
229 | + + + + | |
230 | Traffic | |
760d034b EJ |
231 | |
232 | The fundamental problem with naive active-backup, is it tightly couples the | |
233 | failover solution to ovn-northd. This can significantly increase downtime in | |
234 | the event of a failover as the (often already busy) ovn-northd controller has | |
eda2c3c0 SF |
235 | to recompute state for the new leader. Worse, if ovn-northd goes down, we can't |
236 | perform gateway failover at all. This violates the principle that control | |
237 | plane outages should have no impact on dataplane functionality. | |
760d034b EJ |
238 | |
239 | In a controller independent active-backup configuration, ovn-northd is | |
240 | responsible for initial configuration while the HA cluster is responsible for | |
241 | monitoring the leader, and failing over to a backup if necessary. ovn-northd | |
242 | sets HA policy, but doesn't actively participate when failovers occur. | |
243 | ||
244 | Of course, in this model, ovn-northd is not without some responsibility. Its | |
eda2c3c0 SF |
245 | role is to pre-plan what should happen in the event of a failure, leaving it to |
246 | the individual switches to execute this plan. It does this by assigning each | |
247 | gateway a unique leadership priority. Once assigned, it communicates this | |
760d034b EJ |
248 | priority to each node it controls. Nodes use the leadership priority to |
249 | determine which gateway in the cluster is the active leader by using a simple | |
250 | metric: the leader is the gateway that is healthy, with the highest priority. | |
251 | If that gateway goes down, leadership falls to the next highest priority, and | |
252 | conversely, if a new gateway comes up with a higher priority, it takes over | |
253 | leadership. | |
254 | ||
255 | Thus, in this model, leadership of the HA cluster is determined simply by the | |
256 | status of its members. Therefore if we can communicate the status of each | |
257 | gateway to each transport node, they can individually figure out which is the | |
258 | leader, and direct traffic accordingly. | |
259 | ||
eda2c3c0 SF |
260 | Tunnel Monitoring |
261 | +++++++++++++++++ | |
262 | ||
760d034b EJ |
263 | Since in this model leadership is determined exclusively by the health status |
264 | of member gateways, a key problem is how do we communicate this information to | |
265 | the relevant transport nodes. Luckily, we can do this fairly cheaply using | |
266 | tunnel monitoring protocols like BFD. | |
267 | ||
268 | The basic idea is pretty straightforward. Each transport node maintains a | |
eda2c3c0 SF |
269 | tunnel to every gateway in the HA cluster (not just the leader). These tunnels |
270 | are monitored using the BFD protocol to see which are alive. Given this | |
271 | information, hypervisors can trivially compute the highest priority live | |
760d034b EJ |
272 | gateway, and thus the leader. |
273 | ||
274 | In practice, this leadership computation can be performed trivially using the | |
275 | bundle or group action. Rather than using OpenFlow to simply output to the | |
276 | leader, all gateways could be listed in an active-backup bundle action ordered | |
277 | by their priority. The bundle action will automatically take into account the | |
278 | tunnel monitoring status to output the packet to the highest priority live | |
279 | gateway. | |
280 | ||
eda2c3c0 SF |
281 | Inter-Gateway Monitoring |
282 | ++++++++++++++++++++++++ | |
283 | ||
760d034b EJ |
284 | One somewhat subtle aspect of this model, is that failovers are not globally |
285 | atomic. When a failover occurs, it will take some time for all hypervisors to | |
286 | notice and adjust accordingly. Similarly, if a new high priority Gateway comes | |
287 | up, it may take some time for all hypervisors to switch over to the new leader. | |
288 | In order to avoid confusing the physical network, under these circumstances | |
289 | it's important for the backup gateways to drop traffic they've received | |
290 | erroneously. In order to do this, each Gateway must know whether or not it is, | |
291 | in fact active. This can be achieved by creating a mesh of tunnels between | |
292 | gateways. Each gateway monitors the other gateways its cluster to determine | |
293 | which are alive, and therefore whether or not that gateway happens to be the | |
294 | leader. If leading, the gateway forwards traffic normally, otherwise it drops | |
295 | all traffic. | |
296 | ||
712eb1bc MAA |
297 | We should note that this method works well under the assumption that there |
298 | are no inter-gateway connectivity failures, in such case this method would fail | |
299 | to elect a single master. The simplest example is two gateways which stop seeing | |
300 | each other but can still reach the hypervisors. Protocols like VRRP or CARP | |
301 | have the same issue. A mitigation for this type of failure mode could be | |
302 | achieved by having all network elements (hypervisors and gateways) periodically | |
303 | share their link status to other endpoints. | |
304 | ||
eda2c3c0 SF |
305 | Gateway Leadership Resignation |
306 | ++++++++++++++++++++++++++++++ | |
307 | ||
760d034b EJ |
308 | Sometimes a gateway may be healthy, but still may not be suitable to lead the |
309 | HA cluster. This could happen for several reasons including: | |
310 | ||
eda2c3c0 SF |
311 | * The physical network is unreachable |
312 | ||
313 | * BFD (or ping) has detected the next hop router is unreachable | |
314 | ||
315 | * The Gateway recently booted and isn't fully configured | |
760d034b EJ |
316 | |
317 | In this case, the Gateway should resign leadership by holding its tunnels down | |
eda2c3c0 | 318 | using the ``other_config:cpath_down`` flag. This indicates to participating |
760d034b EJ |
319 | hypervisors and Gateways that this gateway should be treated as if it's down, |
320 | even though its tunnels are still healthy. | |
321 | ||
760d034b | 322 | Router Specific Active-Backup |
eda2c3c0 SF |
323 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
324 | ||
325 | :: | |
326 | ||
327 | Router Specific Active-Backup | |
328 | ||
329 | +----------------+ +----------------+ | |
330 | | | | | | |
331 | | A C | | B D E | | |
332 | | | | | | |
333 | +----------------+ +----------------+ | |
334 | ^ ^ ^ ^ | |
335 | | | | | | |
336 | | | | | | |
337 | + + + + | |
338 | Traffic | |
339 | ||
760d034b EJ |
340 | Controller independent active-backup is a great advance over naive |
341 | active-backup, but it still has one glaring problem -- it under-utilizes the | |
342 | backup gateways. In ideal scenario, all traffic would split evenly among the | |
343 | live set of gateways. Getting all the way there is somewhat tricky, but as a | |
344 | step in the direction, one could use the "Router Specific Active-Backup" | |
345 | algorithm. This algorithm looks a lot like active-backup on a per logical | |
346 | router basis, with one twist. It chooses a different active Gateway for each | |
347 | logical router. Thus, in situations where there are several logical routers, | |
348 | all with somewhat balanced load, this algorithm performs better. | |
349 | ||
350 | Implementation of this strategy is quite straightforward if built on top of | |
351 | basic controller independent active-backup. On a per logical router basis, the | |
352 | algorithm is the same, leadership is determined by the liveness of the | |
353 | gateways. The key difference here is that the gateways must have a different | |
354 | leadership priority for each logical router. These leadership priorities can | |
355 | be computed by ovn-northd just as they had been in the controller independent | |
356 | active-backup model. | |
357 | ||
358 | Once we have these per logical router priorities, they simply need be | |
359 | communicated to the members of the gateway cluster and the hypervisors. The | |
360 | hypervisors in particular, need simply have an active-backup bundle action (or | |
361 | group action) per logical router listing the gateways in priority order for | |
362 | *that router*, rather than having a single bundle action shared for all the | |
363 | routers. | |
364 | ||
365 | Additionally, the gateways need to be updated to take into account individual | |
366 | router priorities. Specifically, each gateway should drop traffic of backup | |
367 | routers it's running, and forward traffic of active gateways, instead of simply | |
368 | dropping or forwarding everything. This should likely be done by having | |
369 | ovn-controller recompute OpenFlow for the gateway, though other options exist. | |
370 | ||
371 | The final complication is that ovn-northd's logic must be updated to choose | |
372 | these per logical router leadership priorities in a more sophisticated manner. | |
373 | It doesn't matter much exactly what algorithm it chooses to do this, beyond | |
374 | that it should provide good balancing in the common case. I.E. each logical | |
375 | routers priorities should be different enough that routers balance to different | |
376 | gateways even when failures occur. | |
377 | ||
eda2c3c0 SF |
378 | Preemption |
379 | ++++++++++ | |
380 | ||
760d034b EJ |
381 | In an active-backup setup, one issue that users will run into is that of |
382 | gateway leader preemption. If a new Gateway is added to a cluster, or for some | |
383 | reason an existing gateway is rebooted, we could end up in a situation where | |
384 | the newly activated gateway has higher priority than any other in the HA | |
eda2c3c0 SF |
385 | cluster. In this case, as soon as that gateway appears, it will preempt |
386 | leadership from the currently active leader causing an unnecessary failover. | |
387 | Since failover can be quite expensive, this preemption may be undesirable. | |
760d034b EJ |
388 | |
389 | The controller can optionally avoid preemption by cleverly tweaking the | |
390 | leadership priorities. For each router, new gateways should be assigned | |
391 | priorities that put them second in line or later when they eventually come up. | |
392 | Furthermore, if a gateway goes down for a significant period of time, its old | |
393 | leadership priorities should be revoked and new ones should be assigned as if | |
394 | it's a brand new gateway. Note that this should only happen if a gateway has | |
395 | been down for a while (several minutes), otherwise a flapping gateway could | |
396 | have wide ranging, unpredictable, consequences. | |
397 | ||
398 | Note that preemption avoidance should be optional depending on the deployment. | |
eda2c3c0 SF |
399 | One necessarily sacrifices optimal load balancing to satisfy these requirements |
400 | as new gateways will get no traffic on boot. Thus, this feature represents a | |
401 | trade-off which must be made on a per installation basis. | |
402 | ||
403 | Fully Active-Active HA | |
404 | ~~~~~~~~~~~~~~~~~~~~~~ | |
405 | ||
406 | :: | |
407 | ||
408 | Fully Active-Active HA | |
409 | ||
410 | +----------------+ +----------------+ | |
411 | | | | | | |
412 | | A B C D E | | A B C D E | | |
413 | | | | | | |
414 | +----------------+ +----------------+ | |
415 | ^ ^ ^ ^ | |
416 | | | | | | |
417 | | | | | | |
418 | + + + + | |
419 | Traffic | |
760d034b EJ |
420 | |
421 | The final step in L3HA is to have true active-active HA. In this scenario each | |
422 | router has an instance on each Gateway, and a mechanism similar to ECMP is used | |
423 | to distribute traffic evenly among all instances. This mechanism would require | |
424 | Gateways to participate in routing protocols with the physical network to | |
425 | attract traffic and alert of failures. It is out of scope of this document, | |
426 | but may eventually be necessary. | |
427 | ||
428 | L2HA | |
429 | ---- | |
eda2c3c0 | 430 | |
760d034b EJ |
431 | L2HA is very difficult to get right. Unlike L3HA, where the consequences of |
432 | problems are minor, in L2HA if two gateways are both transiently active, an L2 | |
433 | loop triggers and a broadcast storm results. In practice to get around this, | |
434 | gateways end up implementing an overly conservative "when in doubt drop all | |
435 | traffic" policy, or they implement something like MLAG. | |
436 | ||
437 | MLAG has multiple gateways work together to pretend to be a single L2 switch | |
eda2c3c0 SF |
438 | with a large LACP bond. In principle, it's the right solution to the problem |
439 | as it solves the broadcast storm problem, and has been deployed successfully in | |
760d034b | 440 | other contexts. That said, it's difficult to get right and not recommended. |