]> git.proxmox.com Git - mirror_ovs.git/blame - Documentation/topics/high-availability.rst
Implement OF1.3 extension for OF1.4 role status feature.
[mirror_ovs.git] / Documentation / topics / high-availability.rst
CommitLineData
eda2c3c0
SF
1..
2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
5
6 http://www.apache.org/licenses/LICENSE-2.0
7
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
12 under the License.
13
14 Convention for heading levels in Open vSwitch documentation:
15
16 ======= Heading 0 (reserved for the title in a document)
17 ------- Heading 1
18 ~~~~~~~ Heading 2
19 +++++++ Heading 3
20 ''''''' Heading 4
21
22 Avoid deeper levels because they do not render well.
23
24==================================
760d034b
EJ
25OVN Gateway High Availability Plan
26==================================
eda2c3c0
SF
27
28::
29
30 OVN Gateway
31
760d034b
EJ
32 +---------------------------+
33 | |
34 | External Network |
35 | |
36 +-------------^-------------+
37 |
38 |
39 +-----------+
40 | |
41 | Gateway |
42 | |
43 +-----------+
44 ^
45 |
46 |
47 +-------------v-------------+
48 | |
49 | OVN Virtual Network |
50 | |
51 +---------------------------+
52
760d034b
EJ
53The OVN gateway is responsible for shuffling traffic between the tunneled
54overlay network (governed by ovn-northd), and the legacy physical network. In
55a naive implementation, the gateway is a single x86 server, or hardware VTEP.
56For most deployments, a single system has enough forwarding capacity to service
57the entire virtualized network, however, it introduces a single point of
58failure. If this system dies, the entire OVN deployment becomes unavailable.
59To mitigate this risk, an HA solution is critical -- by spreading
60responsibility across multiple systems, no single server failure can take down
61the network.
62
63An HA solution is both critical to the manageability of the system, and
64extremely difficult to get right. The purpose of this document, is to propose
65a plan for OVN Gateway High Availability which takes into account our past
66experience building similar systems. It should be considered a fluid changing
67proposal, not a set-in-stone decree.
68
f66cca6a
RB
69.. note::
70 This document describes a range of options OVN could take to provide
71 high availability for gateways. The current implementation provides L3
72 gateway high availability by the "Router Specific Active/Backup"
73 approach described in this document.
74
760d034b
EJ
75Basic Architecture
76------------------
eda2c3c0 77
760d034b
EJ
78In an OVN deployment, the set of hypervisors and network elements operating
79under the guidance of ovn-northd are in what's called "logical space". These
80servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
81the underlying physical network. When these systems need to communicate with
82legacy networks, traffic must be routed through a Gateway which translates from
83OVN controlled tunnel traffic, to raw physical network traffic.
84
85Since the gateway is typically the only system with a connection to the
86physical network all traffic between logical space and the WAN must travel
eda2c3c0
SF
87through it. This makes it a critical single point of failure -- if the gateway
88dies, communication with the WAN ceases for all systems in logical space.
760d034b
EJ
89
90To mitigate this risk, multiple gateways should be run in a "High Availability
91Cluster" or "HA Cluster". The HA cluster will be responsible for performing
92the duties of a gateways, while being able to recover gracefully from
93individual member failures.
94
eda2c3c0
SF
95::
96
97 OVN Gateway HA Cluster
98
99 +---------------------------+
100 | |
101 | External Network |
102 | |
103 +-------------^-------------+
104 |
105 |
106 +----------------------v----------------------+
107 | |
108 | High Availability Cluster |
109 | |
110 | +-----------+ +-----------+ +-----------+ |
111 | | | | | | | |
112 | | Gateway | | Gateway | | Gateway | |
113 | | | | | | | |
114 | +-----------+ +-----------+ +-----------+ |
115 +----------------------^----------------------+
116 |
117 |
118 +-------------v-------------+
119 | |
120 | OVN Virtual Network |
121 | |
122 +---------------------------+
123
124L2 vs L3 High Availability
125~~~~~~~~~~~~~~~~~~~~~~~~~~
760d034b 126
760d034b
EJ
127In order to achieve this goal, there are two broad approaches one can take.
128The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
129or like a giant IP Router. These approaches are called L2HA, and L3HA
130respectively. L2HA allows ethernet broadcast domains to extend into logical
131space, a significant advantage, but this comes at a cost. The need to avoid
132transient L2 loops during failover significantly complicates their design. On
133the other hand, L3HA works for most use cases, is simpler, and fails more
134gracefully. For these reasons, it is suggested that OVN supports an L3HA
135model, leaving L2HA for future work (or third party VTEP providers). Both
136models are discussed further below.
137
138L3HA
139----
eda2c3c0 140
760d034b
EJ
141In this section, we'll work through a basic simple L3HA implementation, on top
142of which we'll gradually build more sophisticated features explaining their
143motivations and implementations as we go.
144
eda2c3c0
SF
145Naive active-backup
146~~~~~~~~~~~~~~~~~~~
147
760d034b
EJ
148Let's assume that there are a collection of logical routers which a tenant has
149asked for, our task is to schedule these logical routers on one of N gateways,
150and gracefully redistribute the routers on gateways which have failed. The
151absolute simplest way to achieve this is what we'll call "naive-active-backup".
152
eda2c3c0
SF
153::
154
155 Naive Active Backup HA Implementation
156
157 +----------------+ +----------------+
158 | Leader | | Backup |
159 | | | |
160 | A B C | | |
161 | | | |
162 +----+-+-+-+----++ +-+--------------+
163 ^ ^ ^ ^ | |
164 | | | | | |
165 | | | | +-+------+---+
166 + + + + | ovn-northd |
167 Traffic +------------+
760d034b
EJ
168
169In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
170leader. All logical routers (A, B, C in the figure), are scheduled on this
171leader gateway and all traffic flows through it. ovn-northd monitors this
172gateway via OpenFlow echo requests (or some equivalent), and if the gateway
173dies, it recreates the routers on one of the backups.
174
175This approach basically works in most cases and should likely be the starting
176point for OVN -- it's strictly better than no HA solution and is a good
177foundation for more sophisticated solutions. That said, it's not without it's
178limitations. Specifically, this approach doesn't coordinate with the physical
179network to minimize disruption during failures, and it tightly couples failover
180to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
181leaving backup gateways completely unutilized.
182
eda2c3c0
SF
183Router Failover
184+++++++++++++++
185
186When ovn-northd notices the leader has died and decides to migrate routers to a
187backup gateway, the physical network has to be notified to direct traffic to
188the new gateway. Otherwise, traffic could be blackholed for longer than
760d034b
EJ
189necessary making failovers worse than they need to be.
190
191For now, let's assume that OVN requires all gateways to be on the same IP
eda2c3c0
SF
192subnet on the physical network. If this isn't the case, gateways would need to
193participate in routing protocols to orchestrate failovers, something which is
194difficult and out of scope of this document.
760d034b
EJ
195
196Since all gateways are on the same IP subnet, we simply need to worry about
197updating the MAC learning tables of the Ethernet switches on that subnet.
198Presumably, they all have entries for each logical router pointing to the old
199leader. If these entries aren't updated, all traffic will be sent to the (now
200defunct) old leader, instead of the new one.
201
202In order to mitigate this issue, it's recommended that the new gateway sends a
203Reverse ARP (RARP) onto the physical network for each logical router it now
204controls. A Reverse ARP is a benign protocol used by many hypervisors when
205virtual machines migrate to update L2 forwarding tables. In this case, the
206ethernet source address of the RARP is that of the logical router it
207corresponds to, and its destination is the broadcast address. This causes the
208RARP to travel to every L2 switch in the broadcast domain, updating forwarding
209tables accordingly. This strategy is recommended in all failover mechanisms
210discussed in this document -- when a router newly boots on a new leader, it
211should RARP its MAC address.
212
eda2c3c0
SF
213Controller Independent Active-backup
214~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215
216::
217
218 Controller Independent Active-Backup Implementation
219
220 +----------------+ +----------------+
221 | Leader | | Backup |
222 | | | |
223 | A B C | | |
224 | | | |
225 +----------------+ +----------------+
226 ^ ^ ^ ^
227 | | | |
228 | | | |
229 + + + +
230 Traffic
760d034b
EJ
231
232The fundamental problem with naive active-backup, is it tightly couples the
233failover solution to ovn-northd. This can significantly increase downtime in
234the event of a failover as the (often already busy) ovn-northd controller has
eda2c3c0
SF
235to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
236perform gateway failover at all. This violates the principle that control
237plane outages should have no impact on dataplane functionality.
760d034b
EJ
238
239In a controller independent active-backup configuration, ovn-northd is
240responsible for initial configuration while the HA cluster is responsible for
241monitoring the leader, and failing over to a backup if necessary. ovn-northd
242sets HA policy, but doesn't actively participate when failovers occur.
243
244Of course, in this model, ovn-northd is not without some responsibility. Its
eda2c3c0
SF
245role is to pre-plan what should happen in the event of a failure, leaving it to
246the individual switches to execute this plan. It does this by assigning each
247gateway a unique leadership priority. Once assigned, it communicates this
760d034b
EJ
248priority to each node it controls. Nodes use the leadership priority to
249determine which gateway in the cluster is the active leader by using a simple
250metric: the leader is the gateway that is healthy, with the highest priority.
251If that gateway goes down, leadership falls to the next highest priority, and
252conversely, if a new gateway comes up with a higher priority, it takes over
253leadership.
254
255Thus, in this model, leadership of the HA cluster is determined simply by the
256status of its members. Therefore if we can communicate the status of each
257gateway to each transport node, they can individually figure out which is the
258leader, and direct traffic accordingly.
259
eda2c3c0
SF
260Tunnel Monitoring
261+++++++++++++++++
262
760d034b
EJ
263Since in this model leadership is determined exclusively by the health status
264of member gateways, a key problem is how do we communicate this information to
265the relevant transport nodes. Luckily, we can do this fairly cheaply using
266tunnel monitoring protocols like BFD.
267
268The basic idea is pretty straightforward. Each transport node maintains a
eda2c3c0
SF
269tunnel to every gateway in the HA cluster (not just the leader). These tunnels
270are monitored using the BFD protocol to see which are alive. Given this
271information, hypervisors can trivially compute the highest priority live
760d034b
EJ
272gateway, and thus the leader.
273
274In practice, this leadership computation can be performed trivially using the
275bundle or group action. Rather than using OpenFlow to simply output to the
276leader, all gateways could be listed in an active-backup bundle action ordered
277by their priority. The bundle action will automatically take into account the
278tunnel monitoring status to output the packet to the highest priority live
279gateway.
280
eda2c3c0
SF
281Inter-Gateway Monitoring
282++++++++++++++++++++++++
283
760d034b
EJ
284One somewhat subtle aspect of this model, is that failovers are not globally
285atomic. When a failover occurs, it will take some time for all hypervisors to
286notice and adjust accordingly. Similarly, if a new high priority Gateway comes
287up, it may take some time for all hypervisors to switch over to the new leader.
288In order to avoid confusing the physical network, under these circumstances
289it's important for the backup gateways to drop traffic they've received
290erroneously. In order to do this, each Gateway must know whether or not it is,
291in fact active. This can be achieved by creating a mesh of tunnels between
292gateways. Each gateway monitors the other gateways its cluster to determine
293which are alive, and therefore whether or not that gateway happens to be the
294leader. If leading, the gateway forwards traffic normally, otherwise it drops
295all traffic.
296
712eb1bc
MAA
297We should note that this method works well under the assumption that there
298are no inter-gateway connectivity failures, in such case this method would fail
299to elect a single master. The simplest example is two gateways which stop seeing
300each other but can still reach the hypervisors. Protocols like VRRP or CARP
301have the same issue. A mitigation for this type of failure mode could be
302achieved by having all network elements (hypervisors and gateways) periodically
303share their link status to other endpoints.
304
eda2c3c0
SF
305Gateway Leadership Resignation
306++++++++++++++++++++++++++++++
307
760d034b
EJ
308Sometimes a gateway may be healthy, but still may not be suitable to lead the
309HA cluster. This could happen for several reasons including:
310
eda2c3c0
SF
311* The physical network is unreachable
312
313* BFD (or ping) has detected the next hop router is unreachable
314
315* The Gateway recently booted and isn't fully configured
760d034b
EJ
316
317In this case, the Gateway should resign leadership by holding its tunnels down
eda2c3c0 318using the ``other_config:cpath_down`` flag. This indicates to participating
760d034b
EJ
319hypervisors and Gateways that this gateway should be treated as if it's down,
320even though its tunnels are still healthy.
321
760d034b 322Router Specific Active-Backup
eda2c3c0
SF
323~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
324
325::
326
327 Router Specific Active-Backup
328
329 +----------------+ +----------------+
330 | | | |
331 | A C | | B D E |
332 | | | |
333 +----------------+ +----------------+
334 ^ ^ ^ ^
335 | | | |
336 | | | |
337 + + + +
338 Traffic
339
760d034b
EJ
340Controller independent active-backup is a great advance over naive
341active-backup, but it still has one glaring problem -- it under-utilizes the
342backup gateways. In ideal scenario, all traffic would split evenly among the
343live set of gateways. Getting all the way there is somewhat tricky, but as a
344step in the direction, one could use the "Router Specific Active-Backup"
345algorithm. This algorithm looks a lot like active-backup on a per logical
346router basis, with one twist. It chooses a different active Gateway for each
347logical router. Thus, in situations where there are several logical routers,
348all with somewhat balanced load, this algorithm performs better.
349
350Implementation of this strategy is quite straightforward if built on top of
351basic controller independent active-backup. On a per logical router basis, the
352algorithm is the same, leadership is determined by the liveness of the
353gateways. The key difference here is that the gateways must have a different
354leadership priority for each logical router. These leadership priorities can
355be computed by ovn-northd just as they had been in the controller independent
356active-backup model.
357
358Once we have these per logical router priorities, they simply need be
359communicated to the members of the gateway cluster and the hypervisors. The
360hypervisors in particular, need simply have an active-backup bundle action (or
361group action) per logical router listing the gateways in priority order for
362*that router*, rather than having a single bundle action shared for all the
363routers.
364
365Additionally, the gateways need to be updated to take into account individual
366router priorities. Specifically, each gateway should drop traffic of backup
367routers it's running, and forward traffic of active gateways, instead of simply
368dropping or forwarding everything. This should likely be done by having
369ovn-controller recompute OpenFlow for the gateway, though other options exist.
370
371The final complication is that ovn-northd's logic must be updated to choose
372these per logical router leadership priorities in a more sophisticated manner.
373It doesn't matter much exactly what algorithm it chooses to do this, beyond
374that it should provide good balancing in the common case. I.E. each logical
375routers priorities should be different enough that routers balance to different
376gateways even when failures occur.
377
eda2c3c0
SF
378Preemption
379++++++++++
380
760d034b
EJ
381In an active-backup setup, one issue that users will run into is that of
382gateway leader preemption. If a new Gateway is added to a cluster, or for some
383reason an existing gateway is rebooted, we could end up in a situation where
384the newly activated gateway has higher priority than any other in the HA
eda2c3c0
SF
385cluster. In this case, as soon as that gateway appears, it will preempt
386leadership from the currently active leader causing an unnecessary failover.
387Since failover can be quite expensive, this preemption may be undesirable.
760d034b
EJ
388
389The controller can optionally avoid preemption by cleverly tweaking the
390leadership priorities. For each router, new gateways should be assigned
391priorities that put them second in line or later when they eventually come up.
392Furthermore, if a gateway goes down for a significant period of time, its old
393leadership priorities should be revoked and new ones should be assigned as if
394it's a brand new gateway. Note that this should only happen if a gateway has
395been down for a while (several minutes), otherwise a flapping gateway could
396have wide ranging, unpredictable, consequences.
397
398Note that preemption avoidance should be optional depending on the deployment.
eda2c3c0
SF
399One necessarily sacrifices optimal load balancing to satisfy these requirements
400as new gateways will get no traffic on boot. Thus, this feature represents a
401trade-off which must be made on a per installation basis.
402
403Fully Active-Active HA
404~~~~~~~~~~~~~~~~~~~~~~
405
406::
407
408 Fully Active-Active HA
409
410 +----------------+ +----------------+
411 | | | |
412 | A B C D E | | A B C D E |
413 | | | |
414 +----------------+ +----------------+
415 ^ ^ ^ ^
416 | | | |
417 | | | |
418 + + + +
419 Traffic
760d034b
EJ
420
421The final step in L3HA is to have true active-active HA. In this scenario each
422router has an instance on each Gateway, and a mechanism similar to ECMP is used
423to distribute traffic evenly among all instances. This mechanism would require
424Gateways to participate in routing protocols with the physical network to
425attract traffic and alert of failures. It is out of scope of this document,
426but may eventually be necessary.
427
428L2HA
429----
eda2c3c0 430
760d034b
EJ
431L2HA is very difficult to get right. Unlike L3HA, where the consequences of
432problems are minor, in L2HA if two gateways are both transiently active, an L2
433loop triggers and a broadcast storm results. In practice to get around this,
434gateways end up implementing an overly conservative "when in doubt drop all
435traffic" policy, or they implement something like MLAG.
436
437MLAG has multiple gateways work together to pretend to be a single L2 switch
eda2c3c0
SF
438with a large LACP bond. In principle, it's the right solution to the problem
439as it solves the broadcast storm problem, and has been deployed successfully in
760d034b 440other contexts. That said, it's difficult to get right and not recommended.