]> git.proxmox.com Git - mirror_ovs.git/blob - Documentation/topics/high-availability.rst
Implement OF1.3 extension for OF1.4 role status feature.
[mirror_ovs.git] / Documentation / topics / high-availability.rst
1 ..
2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
5
6 http://www.apache.org/licenses/LICENSE-2.0
7
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
12 under the License.
13
14 Convention for heading levels in Open vSwitch documentation:
15
16 ======= Heading 0 (reserved for the title in a document)
17 ------- Heading 1
18 ~~~~~~~ Heading 2
19 +++++++ Heading 3
20 ''''''' Heading 4
21
22 Avoid deeper levels because they do not render well.
23
24 ==================================
25 OVN Gateway High Availability Plan
26 ==================================
27
28 ::
29
30 OVN Gateway
31
32 +---------------------------+
33 | |
34 | External Network |
35 | |
36 +-------------^-------------+
37 |
38 |
39 +-----------+
40 | |
41 | Gateway |
42 | |
43 +-----------+
44 ^
45 |
46 |
47 +-------------v-------------+
48 | |
49 | OVN Virtual Network |
50 | |
51 +---------------------------+
52
53 The OVN gateway is responsible for shuffling traffic between the tunneled
54 overlay network (governed by ovn-northd), and the legacy physical network. In
55 a naive implementation, the gateway is a single x86 server, or hardware VTEP.
56 For most deployments, a single system has enough forwarding capacity to service
57 the entire virtualized network, however, it introduces a single point of
58 failure. If this system dies, the entire OVN deployment becomes unavailable.
59 To mitigate this risk, an HA solution is critical -- by spreading
60 responsibility across multiple systems, no single server failure can take down
61 the network.
62
63 An HA solution is both critical to the manageability of the system, and
64 extremely difficult to get right. The purpose of this document, is to propose
65 a plan for OVN Gateway High Availability which takes into account our past
66 experience building similar systems. It should be considered a fluid changing
67 proposal, not a set-in-stone decree.
68
69 .. note::
70 This document describes a range of options OVN could take to provide
71 high availability for gateways. The current implementation provides L3
72 gateway high availability by the "Router Specific Active/Backup"
73 approach described in this document.
74
75 Basic Architecture
76 ------------------
77
78 In an OVN deployment, the set of hypervisors and network elements operating
79 under the guidance of ovn-northd are in what's called "logical space". These
80 servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
81 the underlying physical network. When these systems need to communicate with
82 legacy networks, traffic must be routed through a Gateway which translates from
83 OVN controlled tunnel traffic, to raw physical network traffic.
84
85 Since the gateway is typically the only system with a connection to the
86 physical network all traffic between logical space and the WAN must travel
87 through it. This makes it a critical single point of failure -- if the gateway
88 dies, communication with the WAN ceases for all systems in logical space.
89
90 To mitigate this risk, multiple gateways should be run in a "High Availability
91 Cluster" or "HA Cluster". The HA cluster will be responsible for performing
92 the duties of a gateways, while being able to recover gracefully from
93 individual member failures.
94
95 ::
96
97 OVN Gateway HA Cluster
98
99 +---------------------------+
100 | |
101 | External Network |
102 | |
103 +-------------^-------------+
104 |
105 |
106 +----------------------v----------------------+
107 | |
108 | High Availability Cluster |
109 | |
110 | +-----------+ +-----------+ +-----------+ |
111 | | | | | | | |
112 | | Gateway | | Gateway | | Gateway | |
113 | | | | | | | |
114 | +-----------+ +-----------+ +-----------+ |
115 +----------------------^----------------------+
116 |
117 |
118 +-------------v-------------+
119 | |
120 | OVN Virtual Network |
121 | |
122 +---------------------------+
123
124 L2 vs L3 High Availability
125 ~~~~~~~~~~~~~~~~~~~~~~~~~~
126
127 In order to achieve this goal, there are two broad approaches one can take.
128 The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
129 or like a giant IP Router. These approaches are called L2HA, and L3HA
130 respectively. L2HA allows ethernet broadcast domains to extend into logical
131 space, a significant advantage, but this comes at a cost. The need to avoid
132 transient L2 loops during failover significantly complicates their design. On
133 the other hand, L3HA works for most use cases, is simpler, and fails more
134 gracefully. For these reasons, it is suggested that OVN supports an L3HA
135 model, leaving L2HA for future work (or third party VTEP providers). Both
136 models are discussed further below.
137
138 L3HA
139 ----
140
141 In this section, we'll work through a basic simple L3HA implementation, on top
142 of which we'll gradually build more sophisticated features explaining their
143 motivations and implementations as we go.
144
145 Naive active-backup
146 ~~~~~~~~~~~~~~~~~~~
147
148 Let's assume that there are a collection of logical routers which a tenant has
149 asked for, our task is to schedule these logical routers on one of N gateways,
150 and gracefully redistribute the routers on gateways which have failed. The
151 absolute simplest way to achieve this is what we'll call "naive-active-backup".
152
153 ::
154
155 Naive Active Backup HA Implementation
156
157 +----------------+ +----------------+
158 | Leader | | Backup |
159 | | | |
160 | A B C | | |
161 | | | |
162 +----+-+-+-+----++ +-+--------------+
163 ^ ^ ^ ^ | |
164 | | | | | |
165 | | | | +-+------+---+
166 + + + + | ovn-northd |
167 Traffic +------------+
168
169 In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
170 leader. All logical routers (A, B, C in the figure), are scheduled on this
171 leader gateway and all traffic flows through it. ovn-northd monitors this
172 gateway via OpenFlow echo requests (or some equivalent), and if the gateway
173 dies, it recreates the routers on one of the backups.
174
175 This approach basically works in most cases and should likely be the starting
176 point for OVN -- it's strictly better than no HA solution and is a good
177 foundation for more sophisticated solutions. That said, it's not without it's
178 limitations. Specifically, this approach doesn't coordinate with the physical
179 network to minimize disruption during failures, and it tightly couples failover
180 to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
181 leaving backup gateways completely unutilized.
182
183 Router Failover
184 +++++++++++++++
185
186 When ovn-northd notices the leader has died and decides to migrate routers to a
187 backup gateway, the physical network has to be notified to direct traffic to
188 the new gateway. Otherwise, traffic could be blackholed for longer than
189 necessary making failovers worse than they need to be.
190
191 For now, let's assume that OVN requires all gateways to be on the same IP
192 subnet on the physical network. If this isn't the case, gateways would need to
193 participate in routing protocols to orchestrate failovers, something which is
194 difficult and out of scope of this document.
195
196 Since all gateways are on the same IP subnet, we simply need to worry about
197 updating the MAC learning tables of the Ethernet switches on that subnet.
198 Presumably, they all have entries for each logical router pointing to the old
199 leader. If these entries aren't updated, all traffic will be sent to the (now
200 defunct) old leader, instead of the new one.
201
202 In order to mitigate this issue, it's recommended that the new gateway sends a
203 Reverse ARP (RARP) onto the physical network for each logical router it now
204 controls. A Reverse ARP is a benign protocol used by many hypervisors when
205 virtual machines migrate to update L2 forwarding tables. In this case, the
206 ethernet source address of the RARP is that of the logical router it
207 corresponds to, and its destination is the broadcast address. This causes the
208 RARP to travel to every L2 switch in the broadcast domain, updating forwarding
209 tables accordingly. This strategy is recommended in all failover mechanisms
210 discussed in this document -- when a router newly boots on a new leader, it
211 should RARP its MAC address.
212
213 Controller Independent Active-backup
214 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215
216 ::
217
218 Controller Independent Active-Backup Implementation
219
220 +----------------+ +----------------+
221 | Leader | | Backup |
222 | | | |
223 | A B C | | |
224 | | | |
225 +----------------+ +----------------+
226 ^ ^ ^ ^
227 | | | |
228 | | | |
229 + + + +
230 Traffic
231
232 The fundamental problem with naive active-backup, is it tightly couples the
233 failover solution to ovn-northd. This can significantly increase downtime in
234 the event of a failover as the (often already busy) ovn-northd controller has
235 to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
236 perform gateway failover at all. This violates the principle that control
237 plane outages should have no impact on dataplane functionality.
238
239 In a controller independent active-backup configuration, ovn-northd is
240 responsible for initial configuration while the HA cluster is responsible for
241 monitoring the leader, and failing over to a backup if necessary. ovn-northd
242 sets HA policy, but doesn't actively participate when failovers occur.
243
244 Of course, in this model, ovn-northd is not without some responsibility. Its
245 role is to pre-plan what should happen in the event of a failure, leaving it to
246 the individual switches to execute this plan. It does this by assigning each
247 gateway a unique leadership priority. Once assigned, it communicates this
248 priority to each node it controls. Nodes use the leadership priority to
249 determine which gateway in the cluster is the active leader by using a simple
250 metric: the leader is the gateway that is healthy, with the highest priority.
251 If that gateway goes down, leadership falls to the next highest priority, and
252 conversely, if a new gateway comes up with a higher priority, it takes over
253 leadership.
254
255 Thus, in this model, leadership of the HA cluster is determined simply by the
256 status of its members. Therefore if we can communicate the status of each
257 gateway to each transport node, they can individually figure out which is the
258 leader, and direct traffic accordingly.
259
260 Tunnel Monitoring
261 +++++++++++++++++
262
263 Since in this model leadership is determined exclusively by the health status
264 of member gateways, a key problem is how do we communicate this information to
265 the relevant transport nodes. Luckily, we can do this fairly cheaply using
266 tunnel monitoring protocols like BFD.
267
268 The basic idea is pretty straightforward. Each transport node maintains a
269 tunnel to every gateway in the HA cluster (not just the leader). These tunnels
270 are monitored using the BFD protocol to see which are alive. Given this
271 information, hypervisors can trivially compute the highest priority live
272 gateway, and thus the leader.
273
274 In practice, this leadership computation can be performed trivially using the
275 bundle or group action. Rather than using OpenFlow to simply output to the
276 leader, all gateways could be listed in an active-backup bundle action ordered
277 by their priority. The bundle action will automatically take into account the
278 tunnel monitoring status to output the packet to the highest priority live
279 gateway.
280
281 Inter-Gateway Monitoring
282 ++++++++++++++++++++++++
283
284 One somewhat subtle aspect of this model, is that failovers are not globally
285 atomic. When a failover occurs, it will take some time for all hypervisors to
286 notice and adjust accordingly. Similarly, if a new high priority Gateway comes
287 up, it may take some time for all hypervisors to switch over to the new leader.
288 In order to avoid confusing the physical network, under these circumstances
289 it's important for the backup gateways to drop traffic they've received
290 erroneously. In order to do this, each Gateway must know whether or not it is,
291 in fact active. This can be achieved by creating a mesh of tunnels between
292 gateways. Each gateway monitors the other gateways its cluster to determine
293 which are alive, and therefore whether or not that gateway happens to be the
294 leader. If leading, the gateway forwards traffic normally, otherwise it drops
295 all traffic.
296
297 We should note that this method works well under the assumption that there
298 are no inter-gateway connectivity failures, in such case this method would fail
299 to elect a single master. The simplest example is two gateways which stop seeing
300 each other but can still reach the hypervisors. Protocols like VRRP or CARP
301 have the same issue. A mitigation for this type of failure mode could be
302 achieved by having all network elements (hypervisors and gateways) periodically
303 share their link status to other endpoints.
304
305 Gateway Leadership Resignation
306 ++++++++++++++++++++++++++++++
307
308 Sometimes a gateway may be healthy, but still may not be suitable to lead the
309 HA cluster. This could happen for several reasons including:
310
311 * The physical network is unreachable
312
313 * BFD (or ping) has detected the next hop router is unreachable
314
315 * The Gateway recently booted and isn't fully configured
316
317 In this case, the Gateway should resign leadership by holding its tunnels down
318 using the ``other_config:cpath_down`` flag. This indicates to participating
319 hypervisors and Gateways that this gateway should be treated as if it's down,
320 even though its tunnels are still healthy.
321
322 Router Specific Active-Backup
323 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
324
325 ::
326
327 Router Specific Active-Backup
328
329 +----------------+ +----------------+
330 | | | |
331 | A C | | B D E |
332 | | | |
333 +----------------+ +----------------+
334 ^ ^ ^ ^
335 | | | |
336 | | | |
337 + + + +
338 Traffic
339
340 Controller independent active-backup is a great advance over naive
341 active-backup, but it still has one glaring problem -- it under-utilizes the
342 backup gateways. In ideal scenario, all traffic would split evenly among the
343 live set of gateways. Getting all the way there is somewhat tricky, but as a
344 step in the direction, one could use the "Router Specific Active-Backup"
345 algorithm. This algorithm looks a lot like active-backup on a per logical
346 router basis, with one twist. It chooses a different active Gateway for each
347 logical router. Thus, in situations where there are several logical routers,
348 all with somewhat balanced load, this algorithm performs better.
349
350 Implementation of this strategy is quite straightforward if built on top of
351 basic controller independent active-backup. On a per logical router basis, the
352 algorithm is the same, leadership is determined by the liveness of the
353 gateways. The key difference here is that the gateways must have a different
354 leadership priority for each logical router. These leadership priorities can
355 be computed by ovn-northd just as they had been in the controller independent
356 active-backup model.
357
358 Once we have these per logical router priorities, they simply need be
359 communicated to the members of the gateway cluster and the hypervisors. The
360 hypervisors in particular, need simply have an active-backup bundle action (or
361 group action) per logical router listing the gateways in priority order for
362 *that router*, rather than having a single bundle action shared for all the
363 routers.
364
365 Additionally, the gateways need to be updated to take into account individual
366 router priorities. Specifically, each gateway should drop traffic of backup
367 routers it's running, and forward traffic of active gateways, instead of simply
368 dropping or forwarding everything. This should likely be done by having
369 ovn-controller recompute OpenFlow for the gateway, though other options exist.
370
371 The final complication is that ovn-northd's logic must be updated to choose
372 these per logical router leadership priorities in a more sophisticated manner.
373 It doesn't matter much exactly what algorithm it chooses to do this, beyond
374 that it should provide good balancing in the common case. I.E. each logical
375 routers priorities should be different enough that routers balance to different
376 gateways even when failures occur.
377
378 Preemption
379 ++++++++++
380
381 In an active-backup setup, one issue that users will run into is that of
382 gateway leader preemption. If a new Gateway is added to a cluster, or for some
383 reason an existing gateway is rebooted, we could end up in a situation where
384 the newly activated gateway has higher priority than any other in the HA
385 cluster. In this case, as soon as that gateway appears, it will preempt
386 leadership from the currently active leader causing an unnecessary failover.
387 Since failover can be quite expensive, this preemption may be undesirable.
388
389 The controller can optionally avoid preemption by cleverly tweaking the
390 leadership priorities. For each router, new gateways should be assigned
391 priorities that put them second in line or later when they eventually come up.
392 Furthermore, if a gateway goes down for a significant period of time, its old
393 leadership priorities should be revoked and new ones should be assigned as if
394 it's a brand new gateway. Note that this should only happen if a gateway has
395 been down for a while (several minutes), otherwise a flapping gateway could
396 have wide ranging, unpredictable, consequences.
397
398 Note that preemption avoidance should be optional depending on the deployment.
399 One necessarily sacrifices optimal load balancing to satisfy these requirements
400 as new gateways will get no traffic on boot. Thus, this feature represents a
401 trade-off which must be made on a per installation basis.
402
403 Fully Active-Active HA
404 ~~~~~~~~~~~~~~~~~~~~~~
405
406 ::
407
408 Fully Active-Active HA
409
410 +----------------+ +----------------+
411 | | | |
412 | A B C D E | | A B C D E |
413 | | | |
414 +----------------+ +----------------+
415 ^ ^ ^ ^
416 | | | |
417 | | | |
418 + + + +
419 Traffic
420
421 The final step in L3HA is to have true active-active HA. In this scenario each
422 router has an instance on each Gateway, and a mechanism similar to ECMP is used
423 to distribute traffic evenly among all instances. This mechanism would require
424 Gateways to participate in routing protocols with the physical network to
425 attract traffic and alert of failures. It is out of scope of this document,
426 but may eventually be necessary.
427
428 L2HA
429 ----
430
431 L2HA is very difficult to get right. Unlike L3HA, where the consequences of
432 problems are minor, in L2HA if two gateways are both transiently active, an L2
433 loop triggers and a broadcast storm results. In practice to get around this,
434 gateways end up implementing an overly conservative "when in doubt drop all
435 traffic" policy, or they implement something like MLAG.
436
437 MLAG has multiple gateways work together to pretend to be a single L2 switch
438 with a large LACP bond. In principle, it's the right solution to the problem
439 as it solves the broadcast storm problem, and has been deployed successfully in
440 other contexts. That said, it's difficult to get right and not recommended.