]> git.proxmox.com Git - mirror_ovs.git/blame - Documentation/topics/high-availability.rst
datapath: Fix kernel panic for ovs reassemble.
[mirror_ovs.git] / Documentation / topics / high-availability.rst
CommitLineData
eda2c3c0
SF
1..
2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
5
6 http://www.apache.org/licenses/LICENSE-2.0
7
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
12 under the License.
13
14 Convention for heading levels in Open vSwitch documentation:
15
16 ======= Heading 0 (reserved for the title in a document)
17 ------- Heading 1
18 ~~~~~~~ Heading 2
19 +++++++ Heading 3
20 ''''''' Heading 4
21
22 Avoid deeper levels because they do not render well.
23
24==================================
760d034b
EJ
25OVN Gateway High Availability Plan
26==================================
eda2c3c0
SF
27
28::
29
30 OVN Gateway
31
760d034b
EJ
32 +---------------------------+
33 | |
34 | External Network |
35 | |
36 +-------------^-------------+
37 |
38 |
39 +-----------+
40 | |
41 | Gateway |
42 | |
43 +-----------+
44 ^
45 |
46 |
47 +-------------v-------------+
48 | |
49 | OVN Virtual Network |
50 | |
51 +---------------------------+
52
760d034b
EJ
53The OVN gateway is responsible for shuffling traffic between the tunneled
54overlay network (governed by ovn-northd), and the legacy physical network. In
55a naive implementation, the gateway is a single x86 server, or hardware VTEP.
56For most deployments, a single system has enough forwarding capacity to service
57the entire virtualized network, however, it introduces a single point of
58failure. If this system dies, the entire OVN deployment becomes unavailable.
59To mitigate this risk, an HA solution is critical -- by spreading
60responsibility across multiple systems, no single server failure can take down
61the network.
62
63An HA solution is both critical to the manageability of the system, and
64extremely difficult to get right. The purpose of this document, is to propose
65a plan for OVN Gateway High Availability which takes into account our past
66experience building similar systems. It should be considered a fluid changing
67proposal, not a set-in-stone decree.
68
69Basic Architecture
70------------------
eda2c3c0 71
760d034b
EJ
72In an OVN deployment, the set of hypervisors and network elements operating
73under the guidance of ovn-northd are in what's called "logical space". These
74servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
75the underlying physical network. When these systems need to communicate with
76legacy networks, traffic must be routed through a Gateway which translates from
77OVN controlled tunnel traffic, to raw physical network traffic.
78
79Since the gateway is typically the only system with a connection to the
80physical network all traffic between logical space and the WAN must travel
eda2c3c0
SF
81through it. This makes it a critical single point of failure -- if the gateway
82dies, communication with the WAN ceases for all systems in logical space.
760d034b
EJ
83
84To mitigate this risk, multiple gateways should be run in a "High Availability
85Cluster" or "HA Cluster". The HA cluster will be responsible for performing
86the duties of a gateways, while being able to recover gracefully from
87individual member failures.
88
eda2c3c0
SF
89::
90
91 OVN Gateway HA Cluster
92
93 +---------------------------+
94 | |
95 | External Network |
96 | |
97 +-------------^-------------+
98 |
99 |
100 +----------------------v----------------------+
101 | |
102 | High Availability Cluster |
103 | |
104 | +-----------+ +-----------+ +-----------+ |
105 | | | | | | | |
106 | | Gateway | | Gateway | | Gateway | |
107 | | | | | | | |
108 | +-----------+ +-----------+ +-----------+ |
109 +----------------------^----------------------+
110 |
111 |
112 +-------------v-------------+
113 | |
114 | OVN Virtual Network |
115 | |
116 +---------------------------+
117
118L2 vs L3 High Availability
119~~~~~~~~~~~~~~~~~~~~~~~~~~
760d034b 120
760d034b
EJ
121In order to achieve this goal, there are two broad approaches one can take.
122The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
123or like a giant IP Router. These approaches are called L2HA, and L3HA
124respectively. L2HA allows ethernet broadcast domains to extend into logical
125space, a significant advantage, but this comes at a cost. The need to avoid
126transient L2 loops during failover significantly complicates their design. On
127the other hand, L3HA works for most use cases, is simpler, and fails more
128gracefully. For these reasons, it is suggested that OVN supports an L3HA
129model, leaving L2HA for future work (or third party VTEP providers). Both
130models are discussed further below.
131
132L3HA
133----
eda2c3c0 134
760d034b
EJ
135In this section, we'll work through a basic simple L3HA implementation, on top
136of which we'll gradually build more sophisticated features explaining their
137motivations and implementations as we go.
138
eda2c3c0
SF
139Naive active-backup
140~~~~~~~~~~~~~~~~~~~
141
760d034b
EJ
142Let's assume that there are a collection of logical routers which a tenant has
143asked for, our task is to schedule these logical routers on one of N gateways,
144and gracefully redistribute the routers on gateways which have failed. The
145absolute simplest way to achieve this is what we'll call "naive-active-backup".
146
eda2c3c0
SF
147::
148
149 Naive Active Backup HA Implementation
150
151 +----------------+ +----------------+
152 | Leader | | Backup |
153 | | | |
154 | A B C | | |
155 | | | |
156 +----+-+-+-+----++ +-+--------------+
157 ^ ^ ^ ^ | |
158 | | | | | |
159 | | | | +-+------+---+
160 + + + + | ovn-northd |
161 Traffic +------------+
760d034b
EJ
162
163In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
164leader. All logical routers (A, B, C in the figure), are scheduled on this
165leader gateway and all traffic flows through it. ovn-northd monitors this
166gateway via OpenFlow echo requests (or some equivalent), and if the gateway
167dies, it recreates the routers on one of the backups.
168
169This approach basically works in most cases and should likely be the starting
170point for OVN -- it's strictly better than no HA solution and is a good
171foundation for more sophisticated solutions. That said, it's not without it's
172limitations. Specifically, this approach doesn't coordinate with the physical
173network to minimize disruption during failures, and it tightly couples failover
174to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
175leaving backup gateways completely unutilized.
176
eda2c3c0
SF
177Router Failover
178+++++++++++++++
179
180When ovn-northd notices the leader has died and decides to migrate routers to a
181backup gateway, the physical network has to be notified to direct traffic to
182the new gateway. Otherwise, traffic could be blackholed for longer than
760d034b
EJ
183necessary making failovers worse than they need to be.
184
185For now, let's assume that OVN requires all gateways to be on the same IP
eda2c3c0
SF
186subnet on the physical network. If this isn't the case, gateways would need to
187participate in routing protocols to orchestrate failovers, something which is
188difficult and out of scope of this document.
760d034b
EJ
189
190Since all gateways are on the same IP subnet, we simply need to worry about
191updating the MAC learning tables of the Ethernet switches on that subnet.
192Presumably, they all have entries for each logical router pointing to the old
193leader. If these entries aren't updated, all traffic will be sent to the (now
194defunct) old leader, instead of the new one.
195
196In order to mitigate this issue, it's recommended that the new gateway sends a
197Reverse ARP (RARP) onto the physical network for each logical router it now
198controls. A Reverse ARP is a benign protocol used by many hypervisors when
199virtual machines migrate to update L2 forwarding tables. In this case, the
200ethernet source address of the RARP is that of the logical router it
201corresponds to, and its destination is the broadcast address. This causes the
202RARP to travel to every L2 switch in the broadcast domain, updating forwarding
203tables accordingly. This strategy is recommended in all failover mechanisms
204discussed in this document -- when a router newly boots on a new leader, it
205should RARP its MAC address.
206
eda2c3c0
SF
207Controller Independent Active-backup
208~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
209
210::
211
212 Controller Independent Active-Backup Implementation
213
214 +----------------+ +----------------+
215 | Leader | | Backup |
216 | | | |
217 | A B C | | |
218 | | | |
219 +----------------+ +----------------+
220 ^ ^ ^ ^
221 | | | |
222 | | | |
223 + + + +
224 Traffic
760d034b
EJ
225
226The fundamental problem with naive active-backup, is it tightly couples the
227failover solution to ovn-northd. This can significantly increase downtime in
228the event of a failover as the (often already busy) ovn-northd controller has
eda2c3c0
SF
229to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
230perform gateway failover at all. This violates the principle that control
231plane outages should have no impact on dataplane functionality.
760d034b
EJ
232
233In a controller independent active-backup configuration, ovn-northd is
234responsible for initial configuration while the HA cluster is responsible for
235monitoring the leader, and failing over to a backup if necessary. ovn-northd
236sets HA policy, but doesn't actively participate when failovers occur.
237
238Of course, in this model, ovn-northd is not without some responsibility. Its
eda2c3c0
SF
239role is to pre-plan what should happen in the event of a failure, leaving it to
240the individual switches to execute this plan. It does this by assigning each
241gateway a unique leadership priority. Once assigned, it communicates this
760d034b
EJ
242priority to each node it controls. Nodes use the leadership priority to
243determine which gateway in the cluster is the active leader by using a simple
244metric: the leader is the gateway that is healthy, with the highest priority.
245If that gateway goes down, leadership falls to the next highest priority, and
246conversely, if a new gateway comes up with a higher priority, it takes over
247leadership.
248
249Thus, in this model, leadership of the HA cluster is determined simply by the
250status of its members. Therefore if we can communicate the status of each
251gateway to each transport node, they can individually figure out which is the
252leader, and direct traffic accordingly.
253
eda2c3c0
SF
254Tunnel Monitoring
255+++++++++++++++++
256
760d034b
EJ
257Since in this model leadership is determined exclusively by the health status
258of member gateways, a key problem is how do we communicate this information to
259the relevant transport nodes. Luckily, we can do this fairly cheaply using
260tunnel monitoring protocols like BFD.
261
262The basic idea is pretty straightforward. Each transport node maintains a
eda2c3c0
SF
263tunnel to every gateway in the HA cluster (not just the leader). These tunnels
264are monitored using the BFD protocol to see which are alive. Given this
265information, hypervisors can trivially compute the highest priority live
760d034b
EJ
266gateway, and thus the leader.
267
268In practice, this leadership computation can be performed trivially using the
269bundle or group action. Rather than using OpenFlow to simply output to the
270leader, all gateways could be listed in an active-backup bundle action ordered
271by their priority. The bundle action will automatically take into account the
272tunnel monitoring status to output the packet to the highest priority live
273gateway.
274
eda2c3c0
SF
275Inter-Gateway Monitoring
276++++++++++++++++++++++++
277
760d034b
EJ
278One somewhat subtle aspect of this model, is that failovers are not globally
279atomic. When a failover occurs, it will take some time for all hypervisors to
280notice and adjust accordingly. Similarly, if a new high priority Gateway comes
281up, it may take some time for all hypervisors to switch over to the new leader.
282In order to avoid confusing the physical network, under these circumstances
283it's important for the backup gateways to drop traffic they've received
284erroneously. In order to do this, each Gateway must know whether or not it is,
285in fact active. This can be achieved by creating a mesh of tunnels between
286gateways. Each gateway monitors the other gateways its cluster to determine
287which are alive, and therefore whether or not that gateway happens to be the
288leader. If leading, the gateway forwards traffic normally, otherwise it drops
289all traffic.
290
712eb1bc
MAA
291We should note that this method works well under the assumption that there
292are no inter-gateway connectivity failures, in such case this method would fail
293to elect a single master. The simplest example is two gateways which stop seeing
294each other but can still reach the hypervisors. Protocols like VRRP or CARP
295have the same issue. A mitigation for this type of failure mode could be
296achieved by having all network elements (hypervisors and gateways) periodically
297share their link status to other endpoints.
298
eda2c3c0
SF
299Gateway Leadership Resignation
300++++++++++++++++++++++++++++++
301
760d034b
EJ
302Sometimes a gateway may be healthy, but still may not be suitable to lead the
303HA cluster. This could happen for several reasons including:
304
eda2c3c0
SF
305* The physical network is unreachable
306
307* BFD (or ping) has detected the next hop router is unreachable
308
309* The Gateway recently booted and isn't fully configured
760d034b
EJ
310
311In this case, the Gateway should resign leadership by holding its tunnels down
eda2c3c0 312using the ``other_config:cpath_down`` flag. This indicates to participating
760d034b
EJ
313hypervisors and Gateways that this gateway should be treated as if it's down,
314even though its tunnels are still healthy.
315
760d034b 316Router Specific Active-Backup
eda2c3c0
SF
317~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
318
319::
320
321 Router Specific Active-Backup
322
323 +----------------+ +----------------+
324 | | | |
325 | A C | | B D E |
326 | | | |
327 +----------------+ +----------------+
328 ^ ^ ^ ^
329 | | | |
330 | | | |
331 + + + +
332 Traffic
333
760d034b
EJ
334Controller independent active-backup is a great advance over naive
335active-backup, but it still has one glaring problem -- it under-utilizes the
336backup gateways. In ideal scenario, all traffic would split evenly among the
337live set of gateways. Getting all the way there is somewhat tricky, but as a
338step in the direction, one could use the "Router Specific Active-Backup"
339algorithm. This algorithm looks a lot like active-backup on a per logical
340router basis, with one twist. It chooses a different active Gateway for each
341logical router. Thus, in situations where there are several logical routers,
342all with somewhat balanced load, this algorithm performs better.
343
344Implementation of this strategy is quite straightforward if built on top of
345basic controller independent active-backup. On a per logical router basis, the
346algorithm is the same, leadership is determined by the liveness of the
347gateways. The key difference here is that the gateways must have a different
348leadership priority for each logical router. These leadership priorities can
349be computed by ovn-northd just as they had been in the controller independent
350active-backup model.
351
352Once we have these per logical router priorities, they simply need be
353communicated to the members of the gateway cluster and the hypervisors. The
354hypervisors in particular, need simply have an active-backup bundle action (or
355group action) per logical router listing the gateways in priority order for
356*that router*, rather than having a single bundle action shared for all the
357routers.
358
359Additionally, the gateways need to be updated to take into account individual
360router priorities. Specifically, each gateway should drop traffic of backup
361routers it's running, and forward traffic of active gateways, instead of simply
362dropping or forwarding everything. This should likely be done by having
363ovn-controller recompute OpenFlow for the gateway, though other options exist.
364
365The final complication is that ovn-northd's logic must be updated to choose
366these per logical router leadership priorities in a more sophisticated manner.
367It doesn't matter much exactly what algorithm it chooses to do this, beyond
368that it should provide good balancing in the common case. I.E. each logical
369routers priorities should be different enough that routers balance to different
370gateways even when failures occur.
371
eda2c3c0
SF
372Preemption
373++++++++++
374
760d034b
EJ
375In an active-backup setup, one issue that users will run into is that of
376gateway leader preemption. If a new Gateway is added to a cluster, or for some
377reason an existing gateway is rebooted, we could end up in a situation where
378the newly activated gateway has higher priority than any other in the HA
eda2c3c0
SF
379cluster. In this case, as soon as that gateway appears, it will preempt
380leadership from the currently active leader causing an unnecessary failover.
381Since failover can be quite expensive, this preemption may be undesirable.
760d034b
EJ
382
383The controller can optionally avoid preemption by cleverly tweaking the
384leadership priorities. For each router, new gateways should be assigned
385priorities that put them second in line or later when they eventually come up.
386Furthermore, if a gateway goes down for a significant period of time, its old
387leadership priorities should be revoked and new ones should be assigned as if
388it's a brand new gateway. Note that this should only happen if a gateway has
389been down for a while (several minutes), otherwise a flapping gateway could
390have wide ranging, unpredictable, consequences.
391
392Note that preemption avoidance should be optional depending on the deployment.
eda2c3c0
SF
393One necessarily sacrifices optimal load balancing to satisfy these requirements
394as new gateways will get no traffic on boot. Thus, this feature represents a
395trade-off which must be made on a per installation basis.
396
397Fully Active-Active HA
398~~~~~~~~~~~~~~~~~~~~~~
399
400::
401
402 Fully Active-Active HA
403
404 +----------------+ +----------------+
405 | | | |
406 | A B C D E | | A B C D E |
407 | | | |
408 +----------------+ +----------------+
409 ^ ^ ^ ^
410 | | | |
411 | | | |
412 + + + +
413 Traffic
760d034b
EJ
414
415The final step in L3HA is to have true active-active HA. In this scenario each
416router has an instance on each Gateway, and a mechanism similar to ECMP is used
417to distribute traffic evenly among all instances. This mechanism would require
418Gateways to participate in routing protocols with the physical network to
419attract traffic and alert of failures. It is out of scope of this document,
420but may eventually be necessary.
421
422L2HA
423----
eda2c3c0 424
760d034b
EJ
425L2HA is very difficult to get right. Unlike L3HA, where the consequences of
426problems are minor, in L2HA if two gateways are both transiently active, an L2
427loop triggers and a broadcast storm results. In practice to get around this,
428gateways end up implementing an overly conservative "when in doubt drop all
429traffic" policy, or they implement something like MLAG.
430
431MLAG has multiple gateways work together to pretend to be a single L2 switch
eda2c3c0
SF
432with a large LACP bond. In principle, it's the right solution to the problem
433as it solves the broadcast storm problem, and has been deployed successfully in
760d034b 434other contexts. That said, it's difficult to get right and not recommended.