[mirror_ovs.git] / Documentation / topics / high-availability.rst

..
      Licensed under the Apache License, Version 2.0 (the "License"); you may
      not use this file except in compliance with the License. You may obtain
      a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
      License for the specific language governing permissions and limitations
      under the License.

      Convention for heading levels in Open vSwitch documentation:

      =======  Heading 0 (reserved for the title in a document)
      -------  Heading 1
      ~~~~~~~  Heading 2
      +++++++  Heading 3
      '''''''  Heading 4

      Avoid deeper levels because they do not render well.

==================================
OVN Gateway High Availability Plan
==================================

::

    OVN Gateway

         +---------------------------+
         |                           |
         |     External Network      |
         |                           |
         +-------------^-------------+
                       |
                       |
                 +-----------+
                 |           |
                 |  Gateway  |
                 |           |
                 +-----------+
                       ^
                       |
                       |
         +-------------v-------------+
         |                           |
         |    OVN Virtual Network    |
         |                           |
         +---------------------------+

The OVN gateway is responsible for shuffling traffic between the tunneled
overlay network (governed by ovn-northd), and the legacy physical network.  In
a naive implementation, the gateway is a single x86 server, or hardware VTEP.
For most deployments, a single system has enough forwarding capacity to service
the entire virtualized network, however, it introduces a single point of
failure.  If this system dies, the entire OVN deployment becomes unavailable.
To mitigate this risk, an HA solution is critical -- by spreading
responsibility across multiple systems, no single server failure can take down
the network.

An HA solution is both critical to the manageability of the system, and
extremely difficult to get right.  The purpose of this document, is to propose
a plan for OVN Gateway High Availability which takes into account our past
experience building similar systems.  It should be considered a fluid changing
proposal, not a set-in-stone decree.

Basic Architecture
------------------

In an OVN deployment, the set of hypervisors and network elements operating
under the guidance of ovn-northd are in what's called "logical space".  These
servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
the underlying physical network.  When these systems need to communicate with
legacy networks, traffic must be routed through a Gateway which translates from
OVN controlled tunnel traffic, to raw physical network traffic.

Since the gateway is typically the only system with a connection to the
physical network all traffic between logical space and the WAN must travel
through it.  This makes it a critical single point of failure -- if the gateway
dies, communication with the WAN ceases for all systems in logical space.

To mitigate this risk, multiple gateways should be run in a "High Availability
Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
the duties of a gateways,  while being able to recover gracefully from
individual member failures.

::

    OVN Gateway HA Cluster

             +---------------------------+
             |                           |
             |     External Network      |
             |                           |
             +-------------^-------------+
                           |
                           |
    +----------------------v----------------------+
    |                                             |
    |          High Availability Cluster          |
    |                                             |
    | +-----------+  +-----------+  +-----------+ |
    | |           |  |           |  |           | |
    | |  Gateway  |  |  Gateway  |  |  Gateway  | |
    | |           |  |           |  |           | |
    | +-----------+  +-----------+  +-----------+ |
    +----------------------^----------------------+
                           |
                           |
             +-------------v-------------+
             |                           |
             |    OVN Virtual Network    |
             |                           |
             +---------------------------+

L2 vs L3 High Availability
~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to achieve this goal, there are two broad approaches one can take.
The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
or like a giant IP Router. These approaches are called L2HA, and L3HA
respectively.  L2HA allows ethernet broadcast domains to extend into logical
space, a significant advantage, but this comes at a cost.  The need to avoid
transient L2 loops during failover significantly complicates their design.  On
the other hand, L3HA works for most use cases, is simpler, and fails more
gracefully.  For these reasons, it is suggested that OVN supports an L3HA
model, leaving L2HA for future work (or third party VTEP providers).  Both
models are discussed further below.

L3HA
----

In this section, we'll work through a basic simple L3HA implementation, on top
of which we'll gradually build more sophisticated features explaining their
motivations and implementations as we go.

Naive active-backup
~~~~~~~~~~~~~~~~~~~

Let's assume that there are a collection of logical routers which a tenant has
asked for, our task is to schedule these logical routers on one of N gateways,
and gracefully redistribute the routers on gateways which have failed.  The
absolute simplest way to achieve this is what we'll call "naive-active-backup".

::

    Naive Active Backup HA Implementation

    +----------------+   +----------------+
    | Leader         |   | Backup         |
    |                |   |                |
    |      A B C     |   |                |
    |                |   |                |
    +----+-+-+-+----++   +-+--------------+
         ^ ^ ^ ^    |      |
         | | | |    |      |
         | | | |  +-+------+---+
         + + + +  | ovn-northd |
         Traffic  +------------+

In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
leader.  All logical routers (A, B, C in the figure), are scheduled on this
leader gateway and all traffic flows through it.  ovn-northd monitors this
gateway via OpenFlow echo requests (or some equivalent), and if the gateway
dies, it recreates the routers on one of the backups.

This approach basically works in most cases and should likely be the starting
point for OVN -- it's strictly better than no HA solution and is a good
foundation for more sophisticated solutions.  That said, it's not without it's
limitations. Specifically, this approach doesn't coordinate with the physical
network to minimize disruption during failures, and it tightly couples failover
to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
leaving backup gateways completely unutilized.

Router Failover
+++++++++++++++

When ovn-northd notices the leader has died and decides to migrate routers to a
backup gateway, the physical network has to be notified to direct traffic to
the new gateway.  Otherwise, traffic could be blackholed for longer than
necessary making failovers worse than they need to be.

For now, let's assume that OVN requires all gateways to be on the same IP
subnet on the physical network.  If this isn't the case, gateways would need to
participate in routing protocols to orchestrate failovers, something which is
difficult and out of scope of this document.

Since all gateways are on the same IP subnet, we simply need to worry about
updating the MAC learning tables of the Ethernet switches on that subnet.
Presumably, they all have entries for each logical router pointing to the old
leader.  If these entries aren't updated, all traffic will be sent to the (now
defunct) old leader, instead of the new one.

In order to mitigate this issue, it's recommended that the new gateway sends a
Reverse ARP (RARP) onto the physical network for each logical router it now
controls.  A Reverse ARP is a benign protocol used by many hypervisors when
virtual machines migrate to update L2 forwarding tables.  In this case, the
ethernet source address of the RARP is that of the logical router it
corresponds to, and its destination is the broadcast address.  This causes the
RARP to travel to every L2 switch in the broadcast domain, updating forwarding
tables accordingly.  This strategy is recommended in all failover mechanisms
discussed in this document -- when a router newly boots on a new leader, it
should RARP its MAC address.

Controller Independent Active-backup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    Controller Independent Active-Backup Implementation

    +----------------+   +----------------+
    | Leader         |   | Backup         |
    |                |   |                |
    |      A B C     |   |                |
    |                |   |                |
    +----------------+   +----------------+
         ^ ^ ^ ^
         | | | |
         | | | |
         + + + +
         Traffic

The fundamental problem with naive active-backup, is it tightly couples the
failover solution to ovn-northd.  This can significantly increase downtime in
the event of a failover as the (often already busy) ovn-northd controller has
to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
perform gateway failover at all.  This violates the principle that control
plane outages should have no impact on dataplane functionality.

In a controller independent active-backup configuration, ovn-northd is
responsible for initial configuration while the HA cluster is responsible for
monitoring the leader, and failing over to a backup if necessary.  ovn-northd
sets HA policy, but doesn't actively participate when failovers occur.

Of course, in this model, ovn-northd is not without some responsibility.  Its
role is to pre-plan what should happen in the event of a failure, leaving it to
the individual switches to execute this plan.  It does this by assigning each
gateway a unique leadership priority.  Once assigned, it communicates this
priority to each node it controls.  Nodes use the leadership priority to
determine which gateway in the cluster is the active leader by using a simple
metric: the leader is the gateway that is healthy, with the highest priority.
If that gateway goes down, leadership falls to the next highest priority, and
conversely, if a new gateway comes up with a higher priority, it takes over
leadership.

Thus, in this model, leadership of the HA cluster is determined simply by the
status of its members.  Therefore if we can communicate the status of each
gateway to each transport node, they can individually figure out which is the
leader, and direct traffic accordingly.

Tunnel Monitoring
+++++++++++++++++

Since in this model leadership is determined exclusively by the health status
of member gateways, a key problem is how do we communicate this information to
the relevant transport nodes.  Luckily, we can do this fairly cheaply using
tunnel monitoring protocols like BFD.

The basic idea is pretty straightforward.  Each transport node maintains a
tunnel to every gateway in the HA cluster (not just the leader).  These tunnels
are monitored using the BFD protocol to see which are alive.  Given this
information, hypervisors can trivially compute the highest priority live
gateway, and thus the leader.

In practice, this leadership computation can be performed trivially using the
bundle or group action.  Rather than using OpenFlow to simply output to the
leader, all gateways could be listed in an active-backup bundle action ordered
by their priority.  The bundle action will automatically take into account the
tunnel monitoring status to output the packet to the highest priority live
gateway.

Inter-Gateway Monitoring
++++++++++++++++++++++++

One somewhat subtle aspect of this model, is that failovers are not globally
atomic.  When a failover occurs, it will take some time for all hypervisors to
notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
up, it may take some time for all hypervisors to switch over to the new leader.
In order to avoid confusing the physical network, under these circumstances
it's important for the backup gateways to drop traffic they've received
erroneously.  In order to do this, each Gateway must know whether or not it is,
in fact active.  This can be achieved by creating a mesh of tunnels between
gateways.  Each gateway monitors the other gateways its cluster to determine
which are alive, and therefore whether or not that gateway happens to be the
leader.  If leading, the gateway forwards traffic normally, otherwise it drops
all traffic.

We should note that this method works well under the assumption that there
are no inter-gateway connectivity failures, in such case this method would fail
to elect a single master. The simplest example is two gateways which stop seeing
each other but can still reach the hypervisors. Protocols like VRRP or CARP
have the same issue. A mitigation for this type of failure mode could be
achieved by having all network elements (hypervisors and gateways) periodically
share their link status to other endpoints.

Gateway Leadership Resignation
++++++++++++++++++++++++++++++

Sometimes a gateway may be healthy, but still may not be suitable to lead the
HA cluster.  This could happen for several reasons including:

* The physical network is unreachable

* BFD (or ping) has detected the next hop router is unreachable

* The Gateway recently booted and isn't fully configured

In this case, the Gateway should resign leadership by holding its tunnels down
using the ``other_config:cpath_down`` flag.  This indicates to participating
hypervisors and Gateways that this gateway should be treated as if it's down,
even though its tunnels are still healthy.

Router Specific Active-Backup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    Router Specific Active-Backup

    +----------------+ +----------------+
    |                | |                |
    |      A C       | |     B D E      |
    |                | |                |
    +----------------+ +----------------+
                  ^ ^   ^ ^
                  | |   | |
                  | |   | |
                  + +   + +
                   Traffic

Controller independent active-backup is a great advance over naive
active-backup, but it still has one glaring problem -- it under-utilizes the
backup gateways.  In ideal scenario, all traffic would split evenly among the
live set of gateways.  Getting all the way there is somewhat tricky, but as a
step in the direction, one could use the "Router Specific Active-Backup"
algorithm.  This algorithm looks a lot like active-backup on a per logical
router basis, with one twist.  It chooses a different active Gateway for each
logical router.  Thus, in situations where there are several logical routers,
all with somewhat balanced load, this algorithm performs better.

Implementation of this strategy is quite straightforward if built on top of
basic controller independent active-backup.  On a per logical router basis, the
algorithm is the same, leadership is determined by the liveness of the
gateways.  The key difference here is that the gateways must have a different
leadership priority for each logical router.  These leadership priorities can
be computed by ovn-northd just as they had been in the controller independent
active-backup model.

Once we have these per logical router priorities, they simply need be
communicated to the members of the gateway cluster and the hypervisors.  The
hypervisors in particular, need simply have an active-backup bundle action (or
group action) per logical router listing the gateways in priority order for
*that router*, rather than having a single bundle action shared for all the
routers.

Additionally, the gateways need to be updated to take into account individual
router priorities.  Specifically, each gateway should drop traffic of backup
routers it's running, and forward traffic of active gateways, instead of simply
dropping or forwarding everything.  This should likely be done by having
ovn-controller recompute OpenFlow for the gateway, though other options exist.

The final complication is that ovn-northd's logic must be updated to choose
these per logical router leadership priorities in a more sophisticated manner.
It doesn't matter much exactly what algorithm it chooses to do this, beyond
that it should provide good balancing in the common case.  I.E. each logical
routers priorities should be different enough that routers balance to different
gateways even when failures occur.

Preemption
++++++++++

In an active-backup setup, one issue that users will run into is that of
gateway leader preemption.  If a new Gateway is added to a cluster, or for some
reason an existing gateway is rebooted, we could end up in a situation where
the newly activated gateway has higher priority than any other in the HA
cluster.  In this case, as soon as that gateway appears, it will preempt
leadership from the currently active leader causing an unnecessary failover.
Since failover can be quite expensive, this preemption may be undesirable.

The controller can optionally avoid preemption by cleverly tweaking the
leadership priorities.  For each router, new gateways should be assigned
priorities that put them second in line or later when they eventually come up.
Furthermore, if a gateway goes down for a significant period of time, its old
leadership priorities should be revoked and new ones should be assigned as if
it's a brand new gateway.  Note that this should only happen if a gateway has
been down for a while (several minutes), otherwise a flapping gateway could
have wide ranging, unpredictable, consequences.

Note that preemption avoidance should be optional depending on the deployment.
One necessarily sacrifices optimal load balancing to satisfy these requirements
as new gateways will get no traffic on boot.  Thus, this feature represents a
trade-off which must be made on a per installation basis.

Fully Active-Active HA
~~~~~~~~~~~~~~~~~~~~~~

::

    Fully Active-Active HA

    +----------------+ +----------------+
    |                | |                |
    |   A B C D E    | |    A B C D E   |
    |                | |                |
    +----------------+ +----------------+
                  ^ ^   ^ ^
                  | |   | |
                  | |   | |
                  + +   + +
                   Traffic

The final step in L3HA is to have true active-active HA.  In this scenario each
router has an instance on each Gateway, and a mechanism similar to ECMP is used
to distribute traffic evenly among all instances.  This mechanism would require
Gateways to participate in routing protocols with the physical network to
attract traffic and alert of failures.  It is out of scope of this document,
but may eventually be necessary.

L2HA
----

L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
problems are minor, in L2HA if two gateways are both transiently active, an L2
loop triggers and a broadcast storm results.  In practice to get around this,
gateways end up implementing an overly conservative "when in doubt drop all
traffic" policy, or they implement something like MLAG.

MLAG has multiple gateways work together to pretend to be a single L2 switch
with a large LACP bond.  In principle, it's the right solution to the problem
as it solves the broadcast storm problem, and has been deployed successfully in
other contexts.  That said, it's difficult to get right and not recommended.
Commit	Line	Data
eda2c3c0 SF	1	..
	2	Licensed under the Apache License, Version 2.0 (the "License"); you may
	3	not use this file except in compliance with the License. You may obtain
	4	a copy of the License at
	5
	6	http://www.apache.org/licenses/LICENSE-2.0
	7
	8	Unless required by applicable law or agreed to in writing, software
	9	distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
	10	WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
	11	License for the specific language governing permissions and limitations
	12	under the License.
	13
	14	Convention for heading levels in Open vSwitch documentation:
	15
	16	======= Heading 0 (reserved for the title in a document)
	17	------- Heading 1
	18	~~~~~~~ Heading 2
	19	+++++++ Heading 3
	20	''''''' Heading 4
	21
	22	Avoid deeper levels because they do not render well.
	23
	24	==================================
760d034b EJ	25	OVN Gateway High Availability Plan
760d034b EJ	26	==================================
eda2c3c0 SF	27
	28	::
	29
	30	OVN Gateway
	31
760d034b EJ	32	+---------------------------+
	33	\| \|
	34	\| External Network \|
	35	\| \|
	36	+-------------^-------------+
	37	\|
	38	\|
	39	+-----------+
	40	\| \|
	41	\| Gateway \|
	42	\| \|
	43	+-----------+
	44	^
	45	\|
	46	\|
	47	+-------------v-------------+
	48	\| \|
	49	\| OVN Virtual Network \|
	50	\| \|
	51	+---------------------------+
	52
760d034b EJ	53	The OVN gateway is responsible for shuffling traffic between the tunneled
	54	overlay network (governed by ovn-northd), and the legacy physical network. In
	55	a naive implementation, the gateway is a single x86 server, or hardware VTEP.
	56	For most deployments, a single system has enough forwarding capacity to service
	57	the entire virtualized network, however, it introduces a single point of
	58	failure. If this system dies, the entire OVN deployment becomes unavailable.
	59	To mitigate this risk, an HA solution is critical -- by spreading
	60	responsibility across multiple systems, no single server failure can take down
	61	the network.
	62
	63	An HA solution is both critical to the manageability of the system, and
	64	extremely difficult to get right. The purpose of this document, is to propose
	65	a plan for OVN Gateway High Availability which takes into account our past
	66	experience building similar systems. It should be considered a fluid changing
	67	proposal, not a set-in-stone decree.
	68
	69	Basic Architecture
	70	------------------
eda2c3c0	71
760d034b EJ	72	In an OVN deployment, the set of hypervisors and network elements operating
	73	under the guidance of ovn-northd are in what's called "logical space". These
	74	servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
	75	the underlying physical network. When these systems need to communicate with
	76	legacy networks, traffic must be routed through a Gateway which translates from
	77	OVN controlled tunnel traffic, to raw physical network traffic.
	78
	79	Since the gateway is typically the only system with a connection to the
	80	physical network all traffic between logical space and the WAN must travel
eda2c3c0 SF	81	through it. This makes it a critical single point of failure -- if the gateway
eda2c3c0 SF	82	dies, communication with the WAN ceases for all systems in logical space.
760d034b EJ	83
	84	To mitigate this risk, multiple gateways should be run in a "High Availability
	85	Cluster" or "HA Cluster". The HA cluster will be responsible for performing
	86	the duties of a gateways, while being able to recover gracefully from
	87	individual member failures.
	88
eda2c3c0 SF	89	::
	90
	91	OVN Gateway HA Cluster
	92
	93	+---------------------------+
	94	\| \|
	95	\| External Network \|
	96	\| \|
	97	+-------------^-------------+
	98	\|
	99	\|
	100	+----------------------v----------------------+
	101	\| \|
	102	\| High Availability Cluster \|
	103	\| \|
	104	\| +-----------+ +-----------+ +-----------+ \|
	105	\| \| \| \| \| \| \| \|
	106	\| \| Gateway \| \| Gateway \| \| Gateway \| \|
	107	\| \| \| \| \| \| \| \|
	108	\| +-----------+ +-----------+ +-----------+ \|
	109	+----------------------^----------------------+
	110	\|
	111	\|
	112	+-------------v-------------+
	113	\| \|
	114	\| OVN Virtual Network \|
	115	\| \|
	116	+---------------------------+
	117
	118	L2 vs L3 High Availability
	119	~~~~~~~~~~~~~~~~~~~~~~~~~~
760d034b	120
760d034b EJ	121	In order to achieve this goal, there are two broad approaches one can take.
	122	The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
	123	or like a giant IP Router. These approaches are called L2HA, and L3HA
	124	respectively. L2HA allows ethernet broadcast domains to extend into logical
	125	space, a significant advantage, but this comes at a cost. The need to avoid
	126	transient L2 loops during failover significantly complicates their design. On
	127	the other hand, L3HA works for most use cases, is simpler, and fails more
	128	gracefully. For these reasons, it is suggested that OVN supports an L3HA
	129	model, leaving L2HA for future work (or third party VTEP providers). Both
	130	models are discussed further below.
	131
	132	L3HA
	133	----
eda2c3c0	134
760d034b EJ	135	In this section, we'll work through a basic simple L3HA implementation, on top
	136	of which we'll gradually build more sophisticated features explaining their
	137	motivations and implementations as we go.
	138
eda2c3c0 SF	139	Naive active-backup
	140	~~~~~~~~~~~~~~~~~~~
	141
760d034b EJ	142	Let's assume that there are a collection of logical routers which a tenant has
	143	asked for, our task is to schedule these logical routers on one of N gateways,
	144	and gracefully redistribute the routers on gateways which have failed. The
	145	absolute simplest way to achieve this is what we'll call "naive-active-backup".
	146
eda2c3c0 SF	147	::
	148
	149	Naive Active Backup HA Implementation
	150
	151	+----------------+ +----------------+
	152	\| Leader \| \| Backup \|
	153	\| \| \| \|
	154	\| A B C \| \| \|
	155	\| \| \| \|
	156	+----+-+-+-+----++ +-+--------------+
	157	^ ^ ^ ^ \| \|
	158	\| \| \| \| \| \|
	159	\| \| \| \| +-+------+---+
	160	+ + + + \| ovn-northd \|
	161	Traffic +------------+
760d034b EJ	162
	163	In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
	164	leader. All logical routers (A, B, C in the figure), are scheduled on this
	165	leader gateway and all traffic flows through it. ovn-northd monitors this
	166	gateway via OpenFlow echo requests (or some equivalent), and if the gateway
	167	dies, it recreates the routers on one of the backups.
	168
	169	This approach basically works in most cases and should likely be the starting
	170	point for OVN -- it's strictly better than no HA solution and is a good
	171	foundation for more sophisticated solutions. That said, it's not without it's
	172	limitations. Specifically, this approach doesn't coordinate with the physical
	173	network to minimize disruption during failures, and it tightly couples failover
	174	to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
	175	leaving backup gateways completely unutilized.
	176
eda2c3c0 SF	177	Router Failover
	178	+++++++++++++++
	179
	180	When ovn-northd notices the leader has died and decides to migrate routers to a
	181	backup gateway, the physical network has to be notified to direct traffic to
	182	the new gateway. Otherwise, traffic could be blackholed for longer than
760d034b EJ	183	necessary making failovers worse than they need to be.
	184
	185	For now, let's assume that OVN requires all gateways to be on the same IP
eda2c3c0 SF	186	subnet on the physical network. If this isn't the case, gateways would need to
	187	participate in routing protocols to orchestrate failovers, something which is
	188	difficult and out of scope of this document.
760d034b EJ	189
	190	Since all gateways are on the same IP subnet, we simply need to worry about
	191	updating the MAC learning tables of the Ethernet switches on that subnet.
	192	Presumably, they all have entries for each logical router pointing to the old
	193	leader. If these entries aren't updated, all traffic will be sent to the (now
	194	defunct) old leader, instead of the new one.
	195
	196	In order to mitigate this issue, it's recommended that the new gateway sends a
	197	Reverse ARP (RARP) onto the physical network for each logical router it now
	198	controls. A Reverse ARP is a benign protocol used by many hypervisors when
	199	virtual machines migrate to update L2 forwarding tables. In this case, the
	200	ethernet source address of the RARP is that of the logical router it
	201	corresponds to, and its destination is the broadcast address. This causes the
	202	RARP to travel to every L2 switch in the broadcast domain, updating forwarding
	203	tables accordingly. This strategy is recommended in all failover mechanisms
	204	discussed in this document -- when a router newly boots on a new leader, it
	205	should RARP its MAC address.
	206
eda2c3c0 SF	207	Controller Independent Active-backup
	208	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	209
	210	::
	211
	212	Controller Independent Active-Backup Implementation
	213
	214	+----------------+ +----------------+
	215	\| Leader \| \| Backup \|
	216	\| \| \| \|
	217	\| A B C \| \| \|
	218	\| \| \| \|
	219	+----------------+ +----------------+
	220	^ ^ ^ ^
	221	\| \| \| \|
	222	\| \| \| \|
	223	+ + + +
	224	Traffic
760d034b EJ	225
	226	The fundamental problem with naive active-backup, is it tightly couples the
	227	failover solution to ovn-northd. This can significantly increase downtime in
	228	the event of a failover as the (often already busy) ovn-northd controller has
eda2c3c0 SF	229	to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
	230	perform gateway failover at all. This violates the principle that control
	231	plane outages should have no impact on dataplane functionality.
760d034b EJ	232
	233	In a controller independent active-backup configuration, ovn-northd is
	234	responsible for initial configuration while the HA cluster is responsible for
	235	monitoring the leader, and failing over to a backup if necessary. ovn-northd
	236	sets HA policy, but doesn't actively participate when failovers occur.
	237
	238	Of course, in this model, ovn-northd is not without some responsibility. Its
eda2c3c0 SF	239	role is to pre-plan what should happen in the event of a failure, leaving it to
	240	the individual switches to execute this plan. It does this by assigning each
	241	gateway a unique leadership priority. Once assigned, it communicates this
760d034b EJ	242	priority to each node it controls. Nodes use the leadership priority to
	243	determine which gateway in the cluster is the active leader by using a simple
	244	metric: the leader is the gateway that is healthy, with the highest priority.
	245	If that gateway goes down, leadership falls to the next highest priority, and
	246	conversely, if a new gateway comes up with a higher priority, it takes over
	247	leadership.
	248
	249	Thus, in this model, leadership of the HA cluster is determined simply by the
	250	status of its members. Therefore if we can communicate the status of each
	251	gateway to each transport node, they can individually figure out which is the
	252	leader, and direct traffic accordingly.
	253
eda2c3c0 SF	254	Tunnel Monitoring
	255	+++++++++++++++++
	256
760d034b EJ	257	Since in this model leadership is determined exclusively by the health status
	258	of member gateways, a key problem is how do we communicate this information to
	259	the relevant transport nodes. Luckily, we can do this fairly cheaply using
	260	tunnel monitoring protocols like BFD.
	261
	262	The basic idea is pretty straightforward. Each transport node maintains a
eda2c3c0 SF	263	tunnel to every gateway in the HA cluster (not just the leader). These tunnels
	264	are monitored using the BFD protocol to see which are alive. Given this
	265	information, hypervisors can trivially compute the highest priority live
760d034b EJ	266	gateway, and thus the leader.
	267
	268	In practice, this leadership computation can be performed trivially using the
	269	bundle or group action. Rather than using OpenFlow to simply output to the
	270	leader, all gateways could be listed in an active-backup bundle action ordered
	271	by their priority. The bundle action will automatically take into account the
	272	tunnel monitoring status to output the packet to the highest priority live
	273	gateway.
	274
eda2c3c0 SF	275	Inter-Gateway Monitoring
	276	++++++++++++++++++++++++
	277
760d034b EJ	278	One somewhat subtle aspect of this model, is that failovers are not globally
	279	atomic. When a failover occurs, it will take some time for all hypervisors to
	280	notice and adjust accordingly. Similarly, if a new high priority Gateway comes
	281	up, it may take some time for all hypervisors to switch over to the new leader.
	282	In order to avoid confusing the physical network, under these circumstances
	283	it's important for the backup gateways to drop traffic they've received
	284	erroneously. In order to do this, each Gateway must know whether or not it is,
	285	in fact active. This can be achieved by creating a mesh of tunnels between
	286	gateways. Each gateway monitors the other gateways its cluster to determine
	287	which are alive, and therefore whether or not that gateway happens to be the
	288	leader. If leading, the gateway forwards traffic normally, otherwise it drops
	289	all traffic.
	290
712eb1bc MAA	291	We should note that this method works well under the assumption that there
	292	are no inter-gateway connectivity failures, in such case this method would fail
	293	to elect a single master. The simplest example is two gateways which stop seeing
	294	each other but can still reach the hypervisors. Protocols like VRRP or CARP
	295	have the same issue. A mitigation for this type of failure mode could be
	296	achieved by having all network elements (hypervisors and gateways) periodically
	297	share their link status to other endpoints.
	298
eda2c3c0 SF	299	Gateway Leadership Resignation
	300	++++++++++++++++++++++++++++++
	301
760d034b EJ	302	Sometimes a gateway may be healthy, but still may not be suitable to lead the
	303	HA cluster. This could happen for several reasons including:
	304
eda2c3c0 SF	305	* The physical network is unreachable
	306
	307	* BFD (or ping) has detected the next hop router is unreachable
	308
	309	* The Gateway recently booted and isn't fully configured
760d034b EJ	310
760d034b EJ	311	In this case, the Gateway should resign leadership by holding its tunnels down
eda2c3c0	312	using the ``other_config:cpath_down`` flag. This indicates to participating
760d034b EJ	313	hypervisors and Gateways that this gateway should be treated as if it's down,
	314	even though its tunnels are still healthy.
	315
760d034b	316	Router Specific Active-Backup
eda2c3c0 SF	317	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	318
	319	::
	320
	321	Router Specific Active-Backup
	322
	323	+----------------+ +----------------+
	324	\| \| \| \|
	325	\| A C \| \| B D E \|
	326	\| \| \| \|
	327	+----------------+ +----------------+
	328	^ ^ ^ ^
	329	\| \| \| \|
	330	\| \| \| \|
	331	+ + + +
	332	Traffic
	333
760d034b EJ	334	Controller independent active-backup is a great advance over naive
	335	active-backup, but it still has one glaring problem -- it under-utilizes the
	336	backup gateways. In ideal scenario, all traffic would split evenly among the
	337	live set of gateways. Getting all the way there is somewhat tricky, but as a
	338	step in the direction, one could use the "Router Specific Active-Backup"
	339	algorithm. This algorithm looks a lot like active-backup on a per logical
	340	router basis, with one twist. It chooses a different active Gateway for each
	341	logical router. Thus, in situations where there are several logical routers,
	342	all with somewhat balanced load, this algorithm performs better.
	343
	344	Implementation of this strategy is quite straightforward if built on top of
	345	basic controller independent active-backup. On a per logical router basis, the
	346	algorithm is the same, leadership is determined by the liveness of the
	347	gateways. The key difference here is that the gateways must have a different
	348	leadership priority for each logical router. These leadership priorities can
	349	be computed by ovn-northd just as they had been in the controller independent
	350	active-backup model.
	351
	352	Once we have these per logical router priorities, they simply need be
	353	communicated to the members of the gateway cluster and the hypervisors. The
	354	hypervisors in particular, need simply have an active-backup bundle action (or
	355	group action) per logical router listing the gateways in priority order for
	356	that router, rather than having a single bundle action shared for all the
	357	routers.
	358
	359	Additionally, the gateways need to be updated to take into account individual
	360	router priorities. Specifically, each gateway should drop traffic of backup
	361	routers it's running, and forward traffic of active gateways, instead of simply
	362	dropping or forwarding everything. This should likely be done by having
	363	ovn-controller recompute OpenFlow for the gateway, though other options exist.
	364
	365	The final complication is that ovn-northd's logic must be updated to choose
	366	these per logical router leadership priorities in a more sophisticated manner.
	367	It doesn't matter much exactly what algorithm it chooses to do this, beyond
	368	that it should provide good balancing in the common case. I.E. each logical
	369	routers priorities should be different enough that routers balance to different
	370	gateways even when failures occur.
	371
eda2c3c0 SF	372	Preemption
	373	++++++++++
	374
760d034b EJ	375	In an active-backup setup, one issue that users will run into is that of
	376	gateway leader preemption. If a new Gateway is added to a cluster, or for some
	377	reason an existing gateway is rebooted, we could end up in a situation where
	378	the newly activated gateway has higher priority than any other in the HA
eda2c3c0 SF	379	cluster. In this case, as soon as that gateway appears, it will preempt
	380	leadership from the currently active leader causing an unnecessary failover.
	381	Since failover can be quite expensive, this preemption may be undesirable.
760d034b EJ	382
	383	The controller can optionally avoid preemption by cleverly tweaking the
	384	leadership priorities. For each router, new gateways should be assigned
	385	priorities that put them second in line or later when they eventually come up.
	386	Furthermore, if a gateway goes down for a significant period of time, its old
	387	leadership priorities should be revoked and new ones should be assigned as if
	388	it's a brand new gateway. Note that this should only happen if a gateway has
	389	been down for a while (several minutes), otherwise a flapping gateway could
	390	have wide ranging, unpredictable, consequences.
	391
	392	Note that preemption avoidance should be optional depending on the deployment.
eda2c3c0 SF	393	One necessarily sacrifices optimal load balancing to satisfy these requirements
	394	as new gateways will get no traffic on boot. Thus, this feature represents a
	395	trade-off which must be made on a per installation basis.
	396
	397	Fully Active-Active HA
	398	~~~~~~~~~~~~~~~~~~~~~~
	399
	400	::
	401
	402	Fully Active-Active HA
	403
	404	+----------------+ +----------------+
	405	\| \| \| \|
	406	\| A B C D E \| \| A B C D E \|
	407	\| \| \| \|
	408	+----------------+ +----------------+
	409	^ ^ ^ ^
	410	\| \| \| \|
	411	\| \| \| \|
	412	+ + + +
	413	Traffic
760d034b EJ	414
	415	The final step in L3HA is to have true active-active HA. In this scenario each
	416	router has an instance on each Gateway, and a mechanism similar to ECMP is used
	417	to distribute traffic evenly among all instances. This mechanism would require
	418	Gateways to participate in routing protocols with the physical network to
	419	attract traffic and alert of failures. It is out of scope of this document,
	420	but may eventually be necessary.
	421
	422	L2HA
	423	----
eda2c3c0	424
760d034b EJ	425	L2HA is very difficult to get right. Unlike L3HA, where the consequences of
	426	problems are minor, in L2HA if two gateways are both transiently active, an L2
	427	loop triggers and a broadcast storm results. In practice to get around this,
	428	gateways end up implementing an overly conservative "when in doubt drop all
	429	traffic" policy, or they implement something like MLAG.
	430
	431	MLAG has multiple gateways work together to pretend to be a single L2 switch
eda2c3c0 SF	432	with a large LACP bond. In principle, it's the right solution to the problem
eda2c3c0 SF	433	as it solves the broadcast storm problem, and has been deployed successfully in
760d034b	434	other contexts. That said, it's difficult to get right and not recommended.