[mirror_ovs.git] / ovn / OVN-GW-HA.md

OVN Gateway High Availability Plan
==================================
```
         +---------------------------+
         |                           |
         |     External Network      |
         |                           |
         +-------------^-------------+
                       |
                       |
                 +-----------+
                 |           |
                 |  Gateway  |
                 |           |
                 +-----------+
                       ^
                       |
                       |
         +-------------v-------------+
         |                           |
         |    OVN Virtual Network    |
         |                           |
         +---------------------------+

OVN Gateway
```

The OVN gateway is responsible for shuffling traffic between the tunneled
overlay network (governed by ovn-northd), and the legacy physical network.  In
a naive implementation, the gateway is a single x86 server, or hardware VTEP.
For most deployments, a single system has enough forwarding capacity to service
the entire virtualized network, however, it introduces a single point of
failure.  If this system dies, the entire OVN deployment becomes unavailable.
To mitigate this risk, an HA solution is critical -- by spreading
responsibility across multiple systems, no single server failure can take down
the network.

An HA solution is both critical to the manageability of the system, and
extremely difficult to get right.  The purpose of this document, is to propose
a plan for OVN Gateway High Availability which takes into account our past
experience building similar systems.  It should be considered a fluid changing
proposal, not a set-in-stone decree.

Basic Architecture
------------------
In an OVN deployment, the set of hypervisors and network elements operating
under the guidance of ovn-northd are in what's called "logical space".  These
servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
the underlying physical network.  When these systems need to communicate with
legacy networks, traffic must be routed through a Gateway which translates from
OVN controlled tunnel traffic, to raw physical network traffic.

Since the gateway is typically the only system with a connection to the
physical network all traffic between logical space and the WAN must travel
through it.  This makes it a critical single point of failure -- if
the gateway dies, communication with the WAN ceases for all systems in logical
space.

To mitigate this risk, multiple gateways should be run in a "High Availability
Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
the duties of a gateways,  while being able to recover gracefully from
individual member failures.

```
         +---------------------------+
         |                           |
         |     External Network      |
         |                           |
         +-------------^-------------+
                       |
                       |
+----------------------v----------------------+
|                                             |
|          High Availability Cluster          |
|                                             |
| +-----------+  +-----------+  +-----------+ |
| |           |  |           |  |           | |
| |  Gateway  |  |  Gateway  |  |  Gateway  | |
| |           |  |           |  |           | |
| +-----------+  +-----------+  +-----------+ |
+----------------------^----------------------+
                       |
                       |
         +-------------v-------------+
         |                           |
         |    OVN Virtual Network    |
         |                           |
         +---------------------------+

OVN Gateway HA Cluster
```

##### L2 vs L3 High Availability
In order to achieve this goal, there are two broad approaches one can take.
The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
or like a giant IP Router. These approaches are called L2HA, and L3HA
respectively.  L2HA allows ethernet broadcast domains to extend into logical
space, a significant advantage, but this comes at a cost.  The need to avoid
transient L2 loops during failover significantly complicates their design.  On
the other hand, L3HA works for most use cases, is simpler, and fails more
gracefully.  For these reasons, it is suggested that OVN supports an L3HA
model, leaving L2HA for future work (or third party VTEP providers).  Both
models are discussed further below.

L3HA
----
In this section, we'll work through a basic simple L3HA implementation, on top
of which we'll gradually build more sophisticated features explaining their
motivations and implementations as we go.

### Naive active-backup.
Let's assume that there are a collection of logical routers which a tenant has
asked for, our task is to schedule these logical routers on one of N gateways,
and gracefully redistribute the routers on gateways which have failed.  The
absolute simplest way to achieve this is what we'll call "naive-active-backup".

```
+----------------+   +----------------+
| Leader         |   | Backup         |
|                |   |                |
|      A B C     |   |                |
|                |   |                |
+----+-+-+-+----++   +-+--------------+
     ^ ^ ^ ^    |      |
     | | | |    |      |
     | | | |  +-+------+---+
     + + + +  | ovn-northd |
     Traffic  +------------+

Naive Active Backup HA Implementation
```

In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
leader.  All logical routers (A, B, C in the figure), are scheduled on this
leader gateway and all traffic flows through it.  ovn-northd monitors this
gateway via OpenFlow echo requests (or some equivalent), and if the gateway
dies, it recreates the routers on one of the backups.

This approach basically works in most cases and should likely be the starting
point for OVN -- it's strictly better than no HA solution and is a good
foundation for more sophisticated solutions.  That said, it's not without it's
limitations. Specifically, this approach doesn't coordinate with the physical
network to minimize disruption during failures, and it tightly couples failover
to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
leaving backup gateways completely unutilized.

##### Router Failover
When ovn-northd notices the leader has died and decides to migrate routers
to a backup gateway, the physical network has to be notified to direct traffic
to the new gateway.  Otherwise, traffic could be blackholed for longer than
necessary making failovers worse than they need to be.

For now, let's assume that OVN requires all gateways to be on the same IP
subnet on the physical network.  If this isn't the case,
gateways would need to participate in routing protocols to orchestrate
failovers, something which is difficult and out of scope of this document.

Since all gateways are on the same IP subnet, we simply need to worry about
updating the MAC learning tables of the Ethernet switches on that subnet.
Presumably, they all have entries for each logical router pointing to the old
leader.  If these entries aren't updated, all traffic will be sent to the (now
defunct) old leader, instead of the new one.

In order to mitigate this issue, it's recommended that the new gateway sends a
Reverse ARP (RARP) onto the physical network for each logical router it now
controls.  A Reverse ARP is a benign protocol used by many hypervisors when
virtual machines migrate to update L2 forwarding tables.  In this case, the
ethernet source address of the RARP is that of the logical router it
corresponds to, and its destination is the broadcast address.  This causes the
RARP to travel to every L2 switch in the broadcast domain, updating forwarding
tables accordingly.  This strategy is recommended in all failover mechanisms
discussed in this document -- when a router newly boots on a new leader, it
should RARP its MAC address.

### Controller Independent Active-backup
```
+----------------+   +----------------+
| Leader         |   | Backup         |
|                |   |                |
|      A B C     |   |                |
|                |   |                |
+----------------+   +----------------+
     ^ ^ ^ ^
     | | | |
     | | | |
     + + + +
     Traffic

Controller Independent Active-Backup Implementation
```

The fundamental problem with naive active-backup, is it tightly couples the
failover solution to ovn-northd.  This can significantly increase downtime in
the event of a failover as the (often already busy) ovn-northd controller has
to recompute state for the new leader. Worse, if ovn-northd goes down, we
can't perform gateway failover at all.  This violates the principle that
control plane outages should have no impact on dataplane functionality.

In a controller independent active-backup configuration, ovn-northd is
responsible for initial configuration while the HA cluster is responsible for
monitoring the leader, and failing over to a backup if necessary.  ovn-northd
sets HA policy, but doesn't actively participate when failovers occur.

Of course, in this model, ovn-northd is not without some responsibility.  Its
role is to pre-plan what should happen in the event of a failure, leaving it
to the individual switches to execute this plan.  It does this by assigning
each gateway a unique leadership priority.  Once assigned, it communicates this
priority to each node it controls.  Nodes use the leadership priority to
determine which gateway in the cluster is the active leader by using a simple
metric: the leader is the gateway that is healthy, with the highest priority.
If that gateway goes down, leadership falls to the next highest priority, and
conversely, if a new gateway comes up with a higher priority, it takes over
leadership.

Thus, in this model, leadership of the HA cluster is determined simply by the
status of its members.  Therefore if we can communicate the status of each
gateway to each transport node, they can individually figure out which is the
leader, and direct traffic accordingly.

##### Tunnel Monitoring.
Since in this model leadership is determined exclusively by the health status
of member gateways, a key problem is how do we communicate this information to
the relevant transport nodes.  Luckily, we can do this fairly cheaply using
tunnel monitoring protocols like BFD.

The basic idea is pretty straightforward.  Each transport node maintains a
tunnel to every gateway in the HA cluster (not just the leader).  These
tunnels are monitored using the BFD protocol to see which are alive.  Given
this information, hypervisors can trivially compute the highest priority live
gateway, and thus the leader.

In practice, this leadership computation can be performed trivially using the
bundle or group action.  Rather than using OpenFlow to simply output to the
leader, all gateways could be listed in an active-backup bundle action ordered
by their priority.  The bundle action will automatically take into account the
tunnel monitoring status to output the packet to the highest priority live
gateway.

##### Inter-Gateway Monitoring
One somewhat subtle aspect of this model, is that failovers are not globally
atomic.  When a failover occurs, it will take some time for all hypervisors to
notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
up, it may take some time for all hypervisors to switch over to the new leader.
In order to avoid confusing the physical network, under these circumstances
it's important for the backup gateways to drop traffic they've received
erroneously.  In order to do this, each Gateway must know whether or not it is,
in fact active.  This can be achieved by creating a mesh of tunnels between
gateways.  Each gateway monitors the other gateways its cluster to determine
which are alive, and therefore whether or not that gateway happens to be the
leader.  If leading, the gateway forwards traffic normally, otherwise it drops
all traffic.

##### Gateway Leadership Resignation
Sometimes a gateway may be healthy, but still may not be suitable to lead the
HA cluster.  This could happen for several reasons including:

* The physical network is unreachable.
* BFD (or ping) has detected the next hop router is unreachable.
* The Gateway recently booted and isn't fully configured.

In this case, the Gateway should resign leadership by holding its tunnels down
using the other_config:cpath_down flag.  This indicates to participating
hypervisors and Gateways that this gateway should be treated as if it's down,
even though its tunnels are still healthy.

### Router Specific Active-Backup
```
+----------------+ +----------------+
|                | |                |
|      A C       | |     B D E      |
|                | |                |
+----------------+ +----------------+
              ^ ^   ^ ^
              | |   | |
              | |   | |
              + +   + +
               Traffic

Router Specific Active-Backup
```
Controller independent active-backup is a great advance over naive
active-backup, but it still has one glaring problem -- it under-utilizes the
backup gateways.  In ideal scenario, all traffic would split evenly among the
live set of gateways.  Getting all the way there is somewhat tricky, but as a
step in the direction, one could use the "Router Specific Active-Backup"
algorithm.  This algorithm looks a lot like active-backup on a per logical
router basis, with one twist.  It chooses a different active Gateway for each
logical router.  Thus, in situations where there are several logical routers,
all with somewhat balanced load, this algorithm performs better.

Implementation of this strategy is quite straightforward if built on top of
basic controller independent active-backup.  On a per logical router basis, the
algorithm is the same, leadership is determined by the liveness of the
gateways.  The key difference here is that the gateways must have a different
leadership priority for each logical router.  These leadership priorities can
be computed by ovn-northd just as they had been in the controller independent
active-backup model.

Once we have these per logical router priorities, they simply need be
communicated to the members of the gateway cluster and the hypervisors.  The
hypervisors in particular, need simply have an active-backup bundle action (or
group action) per logical router listing the gateways in priority order for
*that router*, rather than having a single bundle action shared for all the
routers.

Additionally, the gateways need to be updated to take into account individual
router priorities.  Specifically, each gateway should drop traffic of backup
routers it's running, and forward traffic of active gateways, instead of simply
dropping or forwarding everything.  This should likely be done by having
ovn-controller recompute OpenFlow for the gateway, though other options exist.

The final complication is that ovn-northd's logic must be updated to choose
these per logical router leadership priorities in a more sophisticated manner.
It doesn't matter much exactly what algorithm it chooses to do this, beyond
that it should provide good balancing in the common case.  I.E. each logical
routers priorities should be different enough that routers balance to different
gateways even when failures occur.

##### Preemption
In an active-backup setup, one issue that users will run into is that of
gateway leader preemption.  If a new Gateway is added to a cluster, or for some
reason an existing gateway is rebooted, we could end up in a situation where
the newly activated gateway has higher priority than any other in the HA
cluster.  In this case, as soon as that gateway appears, it will
preempt leadership from the currently active leader causing an unnecessary
failover.  Since failover can be quite expensive, this preemption may be
undesirable.

The controller can optionally avoid preemption by cleverly tweaking the
leadership priorities.  For each router, new gateways should be assigned
priorities that put them second in line or later when they eventually come up.
Furthermore, if a gateway goes down for a significant period of time, its old
leadership priorities should be revoked and new ones should be assigned as if
it's a brand new gateway.  Note that this should only happen if a gateway has
been down for a while (several minutes), otherwise a flapping gateway could
have wide ranging, unpredictable, consequences.

Note that preemption avoidance should be optional depending on the deployment.
One necessarily sacrifices optimal load balancing to satisfy these
requirements as new gateways will get no traffic on boot.  Thus, this feature
represents a trade-off which must be made on a per installation basis.

### Fully Active-Active HA
```
+----------------+ +----------------+
|                | |                |
|   A B C D E    | |    A B C D E   |
|                | |                |
+----------------+ +----------------+
              ^ ^   ^ ^
              | |   | |
              | |   | |
              + +   + +
               Traffic
```

The final step in L3HA is to have true active-active HA.  In this scenario each
router has an instance on each Gateway, and a mechanism similar to ECMP is used
to distribute traffic evenly among all instances.  This mechanism would require
Gateways to participate in routing protocols with the physical network to
attract traffic and alert of failures.  It is out of scope of this document,
but may eventually be necessary.

L2HA
----
L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
problems are minor, in L2HA if two gateways are both transiently active, an L2
loop triggers and a broadcast storm results.  In practice to get around this,
gateways end up implementing an overly conservative "when in doubt drop all
traffic" policy, or they implement something like MLAG.

MLAG has multiple gateways work together to pretend to be a single L2 switch
with a large LACP bond.  In principle, it's the right solution to the problem as
it solves the broadcast storm problem, and has been deployed successfully in
other contexts.  That said, it's difficult to get right and not recommended.
Commit	Line	Data
760d034b EJ	1	OVN Gateway High Availability Plan
	2	==================================
	3	```
	4	+---------------------------+
	5	\| \|
	6	\| External Network \|
	7	\| \|
	8	+-------------^-------------+
	9	\|
	10	\|
	11	+-----------+
	12	\| \|
	13	\| Gateway \|
	14	\| \|
	15	+-----------+
	16	^
	17	\|
	18	\|
	19	+-------------v-------------+
	20	\| \|
	21	\| OVN Virtual Network \|
	22	\| \|
	23	+---------------------------+
	24
	25	OVN Gateway
	26	```
	27
	28	The OVN gateway is responsible for shuffling traffic between the tunneled
	29	overlay network (governed by ovn-northd), and the legacy physical network. In
	30	a naive implementation, the gateway is a single x86 server, or hardware VTEP.
	31	For most deployments, a single system has enough forwarding capacity to service
	32	the entire virtualized network, however, it introduces a single point of
	33	failure. If this system dies, the entire OVN deployment becomes unavailable.
	34	To mitigate this risk, an HA solution is critical -- by spreading
	35	responsibility across multiple systems, no single server failure can take down
	36	the network.
	37
	38	An HA solution is both critical to the manageability of the system, and
	39	extremely difficult to get right. The purpose of this document, is to propose
	40	a plan for OVN Gateway High Availability which takes into account our past
	41	experience building similar systems. It should be considered a fluid changing
	42	proposal, not a set-in-stone decree.
	43
	44	Basic Architecture
	45	------------------
	46	In an OVN deployment, the set of hypervisors and network elements operating
	47	under the guidance of ovn-northd are in what's called "logical space". These
	48	servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
	49	the underlying physical network. When these systems need to communicate with
	50	legacy networks, traffic must be routed through a Gateway which translates from
	51	OVN controlled tunnel traffic, to raw physical network traffic.
	52
	53	Since the gateway is typically the only system with a connection to the
	54	physical network all traffic between logical space and the WAN must travel
	55	through it. This makes it a critical single point of failure -- if
	56	the gateway dies, communication with the WAN ceases for all systems in logical
	57	space.
	58
	59	To mitigate this risk, multiple gateways should be run in a "High Availability
	60	Cluster" or "HA Cluster". The HA cluster will be responsible for performing
	61	the duties of a gateways, while being able to recover gracefully from
	62	individual member failures.
	63
	64	```
65	+---------------------------+
66	\| \|
67	\| External Network \|
68	\| \|
69	+-------------^-------------+
70	\|
71	\|
72	+----------------------v----------------------+
73	\| \|
74	\| High Availability Cluster \|
75	\| \|
76	\| +-----------+ +-----------+ +-----------+ \|
77	\| \| \| \| \| \| \| \|
78	\| \| Gateway \| \| Gateway \| \| Gateway \| \|
79	\| \| \| \| \| \| \| \|
80	\| +-----------+ +-----------+ +-----------+ \|
81	+----------------------^----------------------+
82	\|
83	\|
84	+-------------v-------------+
85	\| \|
86	\| OVN Virtual Network \|
87	\| \|
88	+---------------------------+
89
90	OVN Gateway HA Cluster
91	```
92
93	##### L2 vs L3 High Availability
94	In order to achieve this goal, there are two broad approaches one can take.
95	The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
96	or like a giant IP Router. These approaches are called L2HA, and L3HA
97	respectively. L2HA allows ethernet broadcast domains to extend into logical
98	space, a significant advantage, but this comes at a cost. The need to avoid
99	transient L2 loops during failover significantly complicates their design. On
100	the other hand, L3HA works for most use cases, is simpler, and fails more
101	gracefully. For these reasons, it is suggested that OVN supports an L3HA
102	model, leaving L2HA for future work (or third party VTEP providers). Both
103	models are discussed further below.
104
105	L3HA
106	----
107	In this section, we'll work through a basic simple L3HA implementation, on top
108	of which we'll gradually build more sophisticated features explaining their
109	motivations and implementations as we go.
110
111	### Naive active-backup.
112	Let's assume that there are a collection of logical routers which a tenant has
113	asked for, our task is to schedule these logical routers on one of N gateways,
114	and gracefully redistribute the routers on gateways which have failed. The
115	absolute simplest way to achieve this is what we'll call "naive-active-backup".
116
117	```
118	+----------------+ +----------------+
119	\| Leader \| \| Backup \|
120	\| \| \| \|
121	\| A B C \| \| \|
122	\| \| \| \|
123	+----+-+-+-+----++ +-+--------------+
124	^ ^ ^ ^ \| \|
125	\| \| \| \| \| \|
126	\| \| \| \| +-+------+---+
127	+ + + + \| ovn-northd \|
128	Traffic +------------+
129
130	Naive Active Backup HA Implementation
131	```
132
133	In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
134	leader. All logical routers (A, B, C in the figure), are scheduled on this
135	leader gateway and all traffic flows through it. ovn-northd monitors this
136	gateway via OpenFlow echo requests (or some equivalent), and if the gateway
137	dies, it recreates the routers on one of the backups.
138
139	This approach basically works in most cases and should likely be the starting
140	point for OVN -- it's strictly better than no HA solution and is a good
141	foundation for more sophisticated solutions. That said, it's not without it's
142	limitations. Specifically, this approach doesn't coordinate with the physical
143	network to minimize disruption during failures, and it tightly couples failover
144	to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
145	leaving backup gateways completely unutilized.
146
147	##### Router Failover
148	When ovn-northd notices the leader has died and decides to migrate routers
149	to a backup gateway, the physical network has to be notified to direct traffic
150	to the new gateway. Otherwise, traffic could be blackholed for longer than
151	necessary making failovers worse than they need to be.
152
153	For now, let's assume that OVN requires all gateways to be on the same IP
154	subnet on the physical network. If this isn't the case,
155	gateways would need to participate in routing protocols to orchestrate
156	failovers, something which is difficult and out of scope of this document.
157
158	Since all gateways are on the same IP subnet, we simply need to worry about
159	updating the MAC learning tables of the Ethernet switches on that subnet.
160	Presumably, they all have entries for each logical router pointing to the old
161	leader. If these entries aren't updated, all traffic will be sent to the (now
162	defunct) old leader, instead of the new one.
163
164	In order to mitigate this issue, it's recommended that the new gateway sends a
165	Reverse ARP (RARP) onto the physical network for each logical router it now
166	controls. A Reverse ARP is a benign protocol used by many hypervisors when
167	virtual machines migrate to update L2 forwarding tables. In this case, the
168	ethernet source address of the RARP is that of the logical router it
169	corresponds to, and its destination is the broadcast address. This causes the
170	RARP to travel to every L2 switch in the broadcast domain, updating forwarding
171	tables accordingly. This strategy is recommended in all failover mechanisms
172	discussed in this document -- when a router newly boots on a new leader, it
173	should RARP its MAC address.
174
175	### Controller Independent Active-backup
176	```
177	+----------------+ +----------------+
178	\| Leader \| \| Backup \|
179	\| \| \| \|
180	\| A B C \| \| \|
181	\| \| \| \|
182	+----------------+ +----------------+
183	^ ^ ^ ^
184	\| \| \| \|
185	\| \| \| \|
186	+ + + +
187	Traffic
188
189	Controller Independent Active-Backup Implementation
190	```
191
192	The fundamental problem with naive active-backup, is it tightly couples the
193	failover solution to ovn-northd. This can significantly increase downtime in
194	the event of a failover as the (often already busy) ovn-northd controller has
195	to recompute state for the new leader. Worse, if ovn-northd goes down, we
196	can't perform gateway failover at all. This violates the principle that
197	control plane outages should have no impact on dataplane functionality.
198
199	In a controller independent active-backup configuration, ovn-northd is
200	responsible for initial configuration while the HA cluster is responsible for
201	monitoring the leader, and failing over to a backup if necessary. ovn-northd
202	sets HA policy, but doesn't actively participate when failovers occur.
203
204	Of course, in this model, ovn-northd is not without some responsibility. Its
205	role is to pre-plan what should happen in the event of a failure, leaving it
206	to the individual switches to execute this plan. It does this by assigning
207	each gateway a unique leadership priority. Once assigned, it communicates this
208	priority to each node it controls. Nodes use the leadership priority to
209	determine which gateway in the cluster is the active leader by using a simple
210	metric: the leader is the gateway that is healthy, with the highest priority.
211	If that gateway goes down, leadership falls to the next highest priority, and
212	conversely, if a new gateway comes up with a higher priority, it takes over
213	leadership.
214
215	Thus, in this model, leadership of the HA cluster is determined simply by the
216	status of its members. Therefore if we can communicate the status of each
217	gateway to each transport node, they can individually figure out which is the
218	leader, and direct traffic accordingly.
219
220	##### Tunnel Monitoring.
221	Since in this model leadership is determined exclusively by the health status
222	of member gateways, a key problem is how do we communicate this information to
223	the relevant transport nodes. Luckily, we can do this fairly cheaply using
224	tunnel monitoring protocols like BFD.
225
226	The basic idea is pretty straightforward. Each transport node maintains a
227	tunnel to every gateway in the HA cluster (not just the leader). These
228	tunnels are monitored using the BFD protocol to see which are alive. Given
229	this information, hypervisors can trivially compute the highest priority live
230	gateway, and thus the leader.
231
232	In practice, this leadership computation can be performed trivially using the
233	bundle or group action. Rather than using OpenFlow to simply output to the
234	leader, all gateways could be listed in an active-backup bundle action ordered
235	by their priority. The bundle action will automatically take into account the
236	tunnel monitoring status to output the packet to the highest priority live
237	gateway.
238
239	##### Inter-Gateway Monitoring
240	One somewhat subtle aspect of this model, is that failovers are not globally
241	atomic. When a failover occurs, it will take some time for all hypervisors to
242	notice and adjust accordingly. Similarly, if a new high priority Gateway comes
243	up, it may take some time for all hypervisors to switch over to the new leader.
244	In order to avoid confusing the physical network, under these circumstances
245	it's important for the backup gateways to drop traffic they've received
246	erroneously. In order to do this, each Gateway must know whether or not it is,
247	in fact active. This can be achieved by creating a mesh of tunnels between
248	gateways. Each gateway monitors the other gateways its cluster to determine
249	which are alive, and therefore whether or not that gateway happens to be the
250	leader. If leading, the gateway forwards traffic normally, otherwise it drops
251	all traffic.
252
253	##### Gateway Leadership Resignation
254	Sometimes a gateway may be healthy, but still may not be suitable to lead the
255	HA cluster. This could happen for several reasons including:
256
257	* The physical network is unreachable.
258	* BFD (or ping) has detected the next hop router is unreachable.
259	* The Gateway recently booted and isn't fully configured.
260
261	In this case, the Gateway should resign leadership by holding its tunnels down
262	using the other_config:cpath_down flag. This indicates to participating
263	hypervisors and Gateways that this gateway should be treated as if it's down,
264	even though its tunnels are still healthy.
265
266	### Router Specific Active-Backup
267	```
268	+----------------+ +----------------+
269	\| \| \| \|
270	\| A C \| \| B D E \|
271	\| \| \| \|
272	+----------------+ +----------------+
273	^ ^ ^ ^
274	\| \| \| \|
275	\| \| \| \|
276	+ + + +
277	Traffic
278
279	Router Specific Active-Backup
280	```
281	Controller independent active-backup is a great advance over naive
282	active-backup, but it still has one glaring problem -- it under-utilizes the
283	backup gateways. In ideal scenario, all traffic would split evenly among the
284	live set of gateways. Getting all the way there is somewhat tricky, but as a
285	step in the direction, one could use the "Router Specific Active-Backup"
286	algorithm. This algorithm looks a lot like active-backup on a per logical
287	router basis, with one twist. It chooses a different active Gateway for each
288	logical router. Thus, in situations where there are several logical routers,
289	all with somewhat balanced load, this algorithm performs better.
290
291	Implementation of this strategy is quite straightforward if built on top of
292	basic controller independent active-backup. On a per logical router basis, the
293	algorithm is the same, leadership is determined by the liveness of the
294	gateways. The key difference here is that the gateways must have a different
295	leadership priority for each logical router. These leadership priorities can
296	be computed by ovn-northd just as they had been in the controller independent
297	active-backup model.
298
299	Once we have these per logical router priorities, they simply need be
300	communicated to the members of the gateway cluster and the hypervisors. The
301	hypervisors in particular, need simply have an active-backup bundle action (or
302	group action) per logical router listing the gateways in priority order for
303	that router, rather than having a single bundle action shared for all the
304	routers.
305
306	Additionally, the gateways need to be updated to take into account individual
307	router priorities. Specifically, each gateway should drop traffic of backup
308	routers it's running, and forward traffic of active gateways, instead of simply
309	dropping or forwarding everything. This should likely be done by having
310	ovn-controller recompute OpenFlow for the gateway, though other options exist.
311
312	The final complication is that ovn-northd's logic must be updated to choose
313	these per logical router leadership priorities in a more sophisticated manner.
314	It doesn't matter much exactly what algorithm it chooses to do this, beyond
315	that it should provide good balancing in the common case. I.E. each logical
316	routers priorities should be different enough that routers balance to different
317	gateways even when failures occur.
318
319	##### Preemption
320	In an active-backup setup, one issue that users will run into is that of
321	gateway leader preemption. If a new Gateway is added to a cluster, or for some
322	reason an existing gateway is rebooted, we could end up in a situation where
323	the newly activated gateway has higher priority than any other in the HA
324	cluster. In this case, as soon as that gateway appears, it will
325	preempt leadership from the currently active leader causing an unnecessary
326	failover. Since failover can be quite expensive, this preemption may be
327	undesirable.
328
329	The controller can optionally avoid preemption by cleverly tweaking the
330	leadership priorities. For each router, new gateways should be assigned
331	priorities that put them second in line or later when they eventually come up.
332	Furthermore, if a gateway goes down for a significant period of time, its old
333	leadership priorities should be revoked and new ones should be assigned as if
334	it's a brand new gateway. Note that this should only happen if a gateway has
335	been down for a while (several minutes), otherwise a flapping gateway could
336	have wide ranging, unpredictable, consequences.
337
338	Note that preemption avoidance should be optional depending on the deployment.
339	One necessarily sacrifices optimal load balancing to satisfy these
340	requirements as new gateways will get no traffic on boot. Thus, this feature
341	represents a trade-off which must be made on a per installation basis.
342
343	### Fully Active-Active HA
344	```
345	+----------------+ +----------------+
346	\| \| \| \|
347	\| A B C D E \| \| A B C D E \|
348	\| \| \| \|
349	+----------------+ +----------------+
350	^ ^ ^ ^
351	\| \| \| \|
352	\| \| \| \|
353	+ + + +
354	Traffic
355	```
356
357	The final step in L3HA is to have true active-active HA. In this scenario each
358	router has an instance on each Gateway, and a mechanism similar to ECMP is used
359	to distribute traffic evenly among all instances. This mechanism would require
360	Gateways to participate in routing protocols with the physical network to
361	attract traffic and alert of failures. It is out of scope of this document,
362	but may eventually be necessary.
363
364	L2HA
365	----
366	L2HA is very difficult to get right. Unlike L3HA, where the consequences of
367	problems are minor, in L2HA if two gateways are both transiently active, an L2
368	loop triggers and a broadcast storm results. In practice to get around this,
369	gateways end up implementing an overly conservative "when in doubt drop all
370	traffic" policy, or they implement something like MLAG.
371
372	MLAG has multiple gateways work together to pretend to be a single L2 switch
373	with a large LACP bond. In principle, it's the right solution to the problem as
374	it solves the broadcast storm problem, and has been deployed successfully in
375	other contexts. That said, it's difficult to get right and not recommended.