[mirror_ovs.git] / Documentation / topics / high-availability.rst

..
      Licensed under the Apache License, Version 2.0 (the "License"); you may
      not use this file except in compliance with the License. You may obtain
      a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
      License for the specific language governing permissions and limitations
      under the License.

      Convention for heading levels in Open vSwitch documentation:

      =======  Heading 0 (reserved for the title in a document)
      -------  Heading 1
      ~~~~~~~  Heading 2
      +++++++  Heading 3
      '''''''  Heading 4

      Avoid deeper levels because they do not render well.

==================================
OVN Gateway High Availability Plan
==================================

::

    OVN Gateway

         +---------------------------+
         |                           |
         |     External Network      |
         |                           |
         +-------------^-------------+
                       |
                       |
                 +-----------+
                 |           |
                 |  Gateway  |
                 |           |
                 +-----------+
                       ^
                       |
                       |
         +-------------v-------------+
         |                           |
         |    OVN Virtual Network    |
         |                           |
         +---------------------------+

The OVN gateway is responsible for shuffling traffic between the tunneled
overlay network (governed by ovn-northd), and the legacy physical network.  In
a naive implementation, the gateway is a single x86 server, or hardware VTEP.
For most deployments, a single system has enough forwarding capacity to service
the entire virtualized network, however, it introduces a single point of
failure.  If this system dies, the entire OVN deployment becomes unavailable.
To mitigate this risk, an HA solution is critical -- by spreading
responsibility across multiple systems, no single server failure can take down
the network.

An HA solution is both critical to the manageability of the system, and
extremely difficult to get right.  The purpose of this document, is to propose
a plan for OVN Gateway High Availability which takes into account our past
experience building similar systems.  It should be considered a fluid changing
proposal, not a set-in-stone decree.

.. note::
    This document describes a range of options OVN could take to provide
    high availability for gateways.  The current implementation provides L3
    gateway high availability by the "Router Specific Active/Backup"
    approach described in this document.

Basic Architecture
------------------

In an OVN deployment, the set of hypervisors and network elements operating
under the guidance of ovn-northd are in what's called "logical space".  These
servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
the underlying physical network.  When these systems need to communicate with
legacy networks, traffic must be routed through a Gateway which translates from
OVN controlled tunnel traffic, to raw physical network traffic.

Since the gateway is typically the only system with a connection to the
physical network all traffic between logical space and the WAN must travel
through it.  This makes it a critical single point of failure -- if the gateway
dies, communication with the WAN ceases for all systems in logical space.

To mitigate this risk, multiple gateways should be run in a "High Availability
Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
the duties of a gateways,  while being able to recover gracefully from
individual member failures.

::

    OVN Gateway HA Cluster

             +---------------------------+
             |                           |
             |     External Network      |
             |                           |
             +-------------^-------------+
                           |
                           |
    +----------------------v----------------------+
    |                                             |
    |          High Availability Cluster          |
    |                                             |
    | +-----------+  +-----------+  +-----------+ |
    | |           |  |           |  |           | |
    | |  Gateway  |  |  Gateway  |  |  Gateway  | |
    | |           |  |           |  |           | |
    | +-----------+  +-----------+  +-----------+ |
    +----------------------^----------------------+
                           |
                           |
             +-------------v-------------+
             |                           |
             |    OVN Virtual Network    |
             |                           |
             +---------------------------+

L2 vs L3 High Availability
~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to achieve this goal, there are two broad approaches one can take.
The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
or like a giant IP Router. These approaches are called L2HA, and L3HA
respectively.  L2HA allows ethernet broadcast domains to extend into logical
space, a significant advantage, but this comes at a cost.  The need to avoid
transient L2 loops during failover significantly complicates their design.  On
the other hand, L3HA works for most use cases, is simpler, and fails more
gracefully.  For these reasons, it is suggested that OVN supports an L3HA
model, leaving L2HA for future work (or third party VTEP providers).  Both
models are discussed further below.

L3HA
----

In this section, we'll work through a basic simple L3HA implementation, on top
of which we'll gradually build more sophisticated features explaining their
motivations and implementations as we go.

Naive active-backup
~~~~~~~~~~~~~~~~~~~

Let's assume that there are a collection of logical routers which a tenant has
asked for, our task is to schedule these logical routers on one of N gateways,
and gracefully redistribute the routers on gateways which have failed.  The
absolute simplest way to achieve this is what we'll call "naive-active-backup".

::

    Naive Active Backup HA Implementation

    +----------------+   +----------------+
    | Leader         |   | Backup         |
    |                |   |                |
    |      A B C     |   |                |
    |                |   |                |
    +----+-+-+-+----++   +-+--------------+
         ^ ^ ^ ^    |      |
         | | | |    |      |
         | | | |  +-+------+---+
         + + + +  | ovn-northd |
         Traffic  +------------+

In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
leader.  All logical routers (A, B, C in the figure), are scheduled on this
leader gateway and all traffic flows through it.  ovn-northd monitors this
gateway via OpenFlow echo requests (or some equivalent), and if the gateway
dies, it recreates the routers on one of the backups.

This approach basically works in most cases and should likely be the starting
point for OVN -- it's strictly better than no HA solution and is a good
foundation for more sophisticated solutions.  That said, it's not without it's
limitations. Specifically, this approach doesn't coordinate with the physical
network to minimize disruption during failures, and it tightly couples failover
to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
leaving backup gateways completely unutilized.

Router Failover
+++++++++++++++

When ovn-northd notices the leader has died and decides to migrate routers to a
backup gateway, the physical network has to be notified to direct traffic to
the new gateway.  Otherwise, traffic could be blackholed for longer than
necessary making failovers worse than they need to be.

For now, let's assume that OVN requires all gateways to be on the same IP
subnet on the physical network.  If this isn't the case, gateways would need to
participate in routing protocols to orchestrate failovers, something which is
difficult and out of scope of this document.

Since all gateways are on the same IP subnet, we simply need to worry about
updating the MAC learning tables of the Ethernet switches on that subnet.
Presumably, they all have entries for each logical router pointing to the old
leader.  If these entries aren't updated, all traffic will be sent to the (now
defunct) old leader, instead of the new one.

In order to mitigate this issue, it's recommended that the new gateway sends a
Reverse ARP (RARP) onto the physical network for each logical router it now
controls.  A Reverse ARP is a benign protocol used by many hypervisors when
virtual machines migrate to update L2 forwarding tables.  In this case, the
ethernet source address of the RARP is that of the logical router it
corresponds to, and its destination is the broadcast address.  This causes the
RARP to travel to every L2 switch in the broadcast domain, updating forwarding
tables accordingly.  This strategy is recommended in all failover mechanisms
discussed in this document -- when a router newly boots on a new leader, it
should RARP its MAC address.

Controller Independent Active-backup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    Controller Independent Active-Backup Implementation

    +----------------+   +----------------+
    | Leader         |   | Backup         |
    |                |   |                |
    |      A B C     |   |                |
    |                |   |                |
    +----------------+   +----------------+
         ^ ^ ^ ^
         | | | |
         | | | |
         + + + +
         Traffic

The fundamental problem with naive active-backup, is it tightly couples the
failover solution to ovn-northd.  This can significantly increase downtime in
the event of a failover as the (often already busy) ovn-northd controller has
to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
perform gateway failover at all.  This violates the principle that control
plane outages should have no impact on dataplane functionality.

In a controller independent active-backup configuration, ovn-northd is
responsible for initial configuration while the HA cluster is responsible for
monitoring the leader, and failing over to a backup if necessary.  ovn-northd
sets HA policy, but doesn't actively participate when failovers occur.

Of course, in this model, ovn-northd is not without some responsibility.  Its
role is to pre-plan what should happen in the event of a failure, leaving it to
the individual switches to execute this plan.  It does this by assigning each
gateway a unique leadership priority.  Once assigned, it communicates this
priority to each node it controls.  Nodes use the leadership priority to
determine which gateway in the cluster is the active leader by using a simple
metric: the leader is the gateway that is healthy, with the highest priority.
If that gateway goes down, leadership falls to the next highest priority, and
conversely, if a new gateway comes up with a higher priority, it takes over
leadership.

Thus, in this model, leadership of the HA cluster is determined simply by the
status of its members.  Therefore if we can communicate the status of each
gateway to each transport node, they can individually figure out which is the
leader, and direct traffic accordingly.

Tunnel Monitoring
+++++++++++++++++

Since in this model leadership is determined exclusively by the health status
of member gateways, a key problem is how do we communicate this information to
the relevant transport nodes.  Luckily, we can do this fairly cheaply using
tunnel monitoring protocols like BFD.

The basic idea is pretty straightforward.  Each transport node maintains a
tunnel to every gateway in the HA cluster (not just the leader).  These tunnels
are monitored using the BFD protocol to see which are alive.  Given this
information, hypervisors can trivially compute the highest priority live
gateway, and thus the leader.

In practice, this leadership computation can be performed trivially using the
bundle or group action.  Rather than using OpenFlow to simply output to the
leader, all gateways could be listed in an active-backup bundle action ordered
by their priority.  The bundle action will automatically take into account the
tunnel monitoring status to output the packet to the highest priority live
gateway.

Inter-Gateway Monitoring
++++++++++++++++++++++++

One somewhat subtle aspect of this model, is that failovers are not globally
atomic.  When a failover occurs, it will take some time for all hypervisors to
notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
up, it may take some time for all hypervisors to switch over to the new leader.
In order to avoid confusing the physical network, under these circumstances
it's important for the backup gateways to drop traffic they've received
erroneously.  In order to do this, each Gateway must know whether or not it is,
in fact active.  This can be achieved by creating a mesh of tunnels between
gateways.  Each gateway monitors the other gateways its cluster to determine
which are alive, and therefore whether or not that gateway happens to be the
leader.  If leading, the gateway forwards traffic normally, otherwise it drops
all traffic.

We should note that this method works well under the assumption that there
are no inter-gateway connectivity failures, in such case this method would fail
to elect a single master. The simplest example is two gateways which stop seeing
each other but can still reach the hypervisors. Protocols like VRRP or CARP
have the same issue. A mitigation for this type of failure mode could be
achieved by having all network elements (hypervisors and gateways) periodically
share their link status to other endpoints.

Gateway Leadership Resignation
++++++++++++++++++++++++++++++

Sometimes a gateway may be healthy, but still may not be suitable to lead the
HA cluster.  This could happen for several reasons including:

* The physical network is unreachable

* BFD (or ping) has detected the next hop router is unreachable

* The Gateway recently booted and isn't fully configured

In this case, the Gateway should resign leadership by holding its tunnels down
using the ``other_config:cpath_down`` flag.  This indicates to participating
hypervisors and Gateways that this gateway should be treated as if it's down,
even though its tunnels are still healthy.

Router Specific Active-Backup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    Router Specific Active-Backup

    +----------------+ +----------------+
    |                | |                |
    |      A C       | |     B D E      |
    |                | |                |
    +----------------+ +----------------+
                  ^ ^   ^ ^
                  | |   | |
                  | |   | |
                  + +   + +
                   Traffic

Controller independent active-backup is a great advance over naive
active-backup, but it still has one glaring problem -- it under-utilizes the
backup gateways.  In ideal scenario, all traffic would split evenly among the
live set of gateways.  Getting all the way there is somewhat tricky, but as a
step in the direction, one could use the "Router Specific Active-Backup"
algorithm.  This algorithm looks a lot like active-backup on a per logical
router basis, with one twist.  It chooses a different active Gateway for each
logical router.  Thus, in situations where there are several logical routers,
all with somewhat balanced load, this algorithm performs better.

Implementation of this strategy is quite straightforward if built on top of
basic controller independent active-backup.  On a per logical router basis, the
algorithm is the same, leadership is determined by the liveness of the
gateways.  The key difference here is that the gateways must have a different
leadership priority for each logical router.  These leadership priorities can
be computed by ovn-northd just as they had been in the controller independent
active-backup model.

Once we have these per logical router priorities, they simply need be
communicated to the members of the gateway cluster and the hypervisors.  The
hypervisors in particular, need simply have an active-backup bundle action (or
group action) per logical router listing the gateways in priority order for
*that router*, rather than having a single bundle action shared for all the
routers.

Additionally, the gateways need to be updated to take into account individual
router priorities.  Specifically, each gateway should drop traffic of backup
routers it's running, and forward traffic of active gateways, instead of simply
dropping or forwarding everything.  This should likely be done by having
ovn-controller recompute OpenFlow for the gateway, though other options exist.

The final complication is that ovn-northd's logic must be updated to choose
these per logical router leadership priorities in a more sophisticated manner.
It doesn't matter much exactly what algorithm it chooses to do this, beyond
that it should provide good balancing in the common case.  I.E. each logical
routers priorities should be different enough that routers balance to different
gateways even when failures occur.

Preemption
++++++++++

In an active-backup setup, one issue that users will run into is that of
gateway leader preemption.  If a new Gateway is added to a cluster, or for some
reason an existing gateway is rebooted, we could end up in a situation where
the newly activated gateway has higher priority than any other in the HA
cluster.  In this case, as soon as that gateway appears, it will preempt
leadership from the currently active leader causing an unnecessary failover.
Since failover can be quite expensive, this preemption may be undesirable.

The controller can optionally avoid preemption by cleverly tweaking the
leadership priorities.  For each router, new gateways should be assigned
priorities that put them second in line or later when they eventually come up.
Furthermore, if a gateway goes down for a significant period of time, its old
leadership priorities should be revoked and new ones should be assigned as if
it's a brand new gateway.  Note that this should only happen if a gateway has
been down for a while (several minutes), otherwise a flapping gateway could
have wide ranging, unpredictable, consequences.

Note that preemption avoidance should be optional depending on the deployment.
One necessarily sacrifices optimal load balancing to satisfy these requirements
as new gateways will get no traffic on boot.  Thus, this feature represents a
trade-off which must be made on a per installation basis.

Fully Active-Active HA
~~~~~~~~~~~~~~~~~~~~~~

::

    Fully Active-Active HA

    +----------------+ +----------------+
    |                | |                |
    |   A B C D E    | |    A B C D E   |
    |                | |                |
    +----------------+ +----------------+
                  ^ ^   ^ ^
                  | |   | |
                  | |   | |
                  + +   + +
                   Traffic

The final step in L3HA is to have true active-active HA.  In this scenario each
router has an instance on each Gateway, and a mechanism similar to ECMP is used
to distribute traffic evenly among all instances.  This mechanism would require
Gateways to participate in routing protocols with the physical network to
attract traffic and alert of failures.  It is out of scope of this document,
but may eventually be necessary.

L2HA
----

L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
problems are minor, in L2HA if two gateways are both transiently active, an L2
loop triggers and a broadcast storm results.  In practice to get around this,
gateways end up implementing an overly conservative "when in doubt drop all
traffic" policy, or they implement something like MLAG.

MLAG has multiple gateways work together to pretend to be a single L2 switch
with a large LACP bond.  In principle, it's the right solution to the problem
as it solves the broadcast storm problem, and has been deployed successfully in
other contexts.  That said, it's difficult to get right and not recommended.
Commit	Line	Data
eda2c3c0 SF	1	..
	2	Licensed under the Apache License, Version 2.0 (the "License"); you may
	3	not use this file except in compliance with the License. You may obtain
	4	a copy of the License at
	5
	6	http://www.apache.org/licenses/LICENSE-2.0
	7
	8	Unless required by applicable law or agreed to in writing, software
	9	distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
	10	WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
	11	License for the specific language governing permissions and limitations
	12	under the License.
	13
	14	Convention for heading levels in Open vSwitch documentation:
	15
	16	======= Heading 0 (reserved for the title in a document)
	17	------- Heading 1
	18	~~~~~~~ Heading 2
	19	+++++++ Heading 3
	20	''''''' Heading 4
	21
	22	Avoid deeper levels because they do not render well.
	23
	24	==================================
760d034b EJ	25	OVN Gateway High Availability Plan
760d034b EJ	26	==================================
eda2c3c0 SF	27
	28	::
	29
	30	OVN Gateway
	31
760d034b EJ	32	+---------------------------+
	33	\| \|
	34	\| External Network \|
	35	\| \|
	36	+-------------^-------------+
	37	\|
	38	\|
	39	+-----------+
	40	\| \|
	41	\| Gateway \|
	42	\| \|
	43	+-----------+
	44	^
	45	\|
	46	\|
	47	+-------------v-------------+
	48	\| \|
	49	\| OVN Virtual Network \|
	50	\| \|
	51	+---------------------------+
	52
760d034b EJ	53	The OVN gateway is responsible for shuffling traffic between the tunneled
	54	overlay network (governed by ovn-northd), and the legacy physical network. In
	55	a naive implementation, the gateway is a single x86 server, or hardware VTEP.
	56	For most deployments, a single system has enough forwarding capacity to service
	57	the entire virtualized network, however, it introduces a single point of
	58	failure. If this system dies, the entire OVN deployment becomes unavailable.
	59	To mitigate this risk, an HA solution is critical -- by spreading
	60	responsibility across multiple systems, no single server failure can take down
	61	the network.
	62
	63	An HA solution is both critical to the manageability of the system, and
	64	extremely difficult to get right. The purpose of this document, is to propose
	65	a plan for OVN Gateway High Availability which takes into account our past
	66	experience building similar systems. It should be considered a fluid changing
	67	proposal, not a set-in-stone decree.
	68
f66cca6a RB	69	.. note::
	70	This document describes a range of options OVN could take to provide
	71	high availability for gateways. The current implementation provides L3
	72	gateway high availability by the "Router Specific Active/Backup"
	73	approach described in this document.
	74
760d034b EJ	75	Basic Architecture
760d034b EJ	76	------------------
eda2c3c0	77
760d034b EJ	78	In an OVN deployment, the set of hypervisors and network elements operating
	79	under the guidance of ovn-northd are in what's called "logical space". These
	80	servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
	81	the underlying physical network. When these systems need to communicate with
	82	legacy networks, traffic must be routed through a Gateway which translates from
	83	OVN controlled tunnel traffic, to raw physical network traffic.
	84
	85	Since the gateway is typically the only system with a connection to the
	86	physical network all traffic between logical space and the WAN must travel
eda2c3c0 SF	87	through it. This makes it a critical single point of failure -- if the gateway
eda2c3c0 SF	88	dies, communication with the WAN ceases for all systems in logical space.
760d034b EJ	89
	90	To mitigate this risk, multiple gateways should be run in a "High Availability
	91	Cluster" or "HA Cluster". The HA cluster will be responsible for performing
	92	the duties of a gateways, while being able to recover gracefully from
	93	individual member failures.
	94
eda2c3c0 SF	95	::
	96
	97	OVN Gateway HA Cluster
	98
	99	+---------------------------+
	100	\| \|
	101	\| External Network \|
	102	\| \|
	103	+-------------^-------------+
	104	\|
	105	\|
	106	+----------------------v----------------------+
	107	\| \|
	108	\| High Availability Cluster \|
	109	\| \|
	110	\| +-----------+ +-----------+ +-----------+ \|
	111	\| \| \| \| \| \| \| \|
	112	\| \| Gateway \| \| Gateway \| \| Gateway \| \|
	113	\| \| \| \| \| \| \| \|
	114	\| +-----------+ +-----------+ +-----------+ \|
	115	+----------------------^----------------------+
	116	\|
	117	\|
	118	+-------------v-------------+
	119	\| \|
	120	\| OVN Virtual Network \|
	121	\| \|
	122	+---------------------------+
	123
	124	L2 vs L3 High Availability
	125	~~~~~~~~~~~~~~~~~~~~~~~~~~
760d034b	126
760d034b EJ	127	In order to achieve this goal, there are two broad approaches one can take.
	128	The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
	129	or like a giant IP Router. These approaches are called L2HA, and L3HA
	130	respectively. L2HA allows ethernet broadcast domains to extend into logical
	131	space, a significant advantage, but this comes at a cost. The need to avoid
	132	transient L2 loops during failover significantly complicates their design. On
	133	the other hand, L3HA works for most use cases, is simpler, and fails more
	134	gracefully. For these reasons, it is suggested that OVN supports an L3HA
	135	model, leaving L2HA for future work (or third party VTEP providers). Both
	136	models are discussed further below.
	137
	138	L3HA
	139	----
eda2c3c0	140
760d034b EJ	141	In this section, we'll work through a basic simple L3HA implementation, on top
	142	of which we'll gradually build more sophisticated features explaining their
	143	motivations and implementations as we go.
	144
eda2c3c0 SF	145	Naive active-backup
	146	~~~~~~~~~~~~~~~~~~~
	147
760d034b EJ	148	Let's assume that there are a collection of logical routers which a tenant has
	149	asked for, our task is to schedule these logical routers on one of N gateways,
	150	and gracefully redistribute the routers on gateways which have failed. The
	151	absolute simplest way to achieve this is what we'll call "naive-active-backup".
	152
eda2c3c0 SF	153	::
	154
	155	Naive Active Backup HA Implementation
	156
	157	+----------------+ +----------------+
	158	\| Leader \| \| Backup \|
	159	\| \| \| \|
	160	\| A B C \| \| \|
	161	\| \| \| \|
	162	+----+-+-+-+----++ +-+--------------+
	163	^ ^ ^ ^ \| \|
	164	\| \| \| \| \| \|
	165	\| \| \| \| +-+------+---+
	166	+ + + + \| ovn-northd \|
	167	Traffic +------------+
760d034b EJ	168
	169	In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
	170	leader. All logical routers (A, B, C in the figure), are scheduled on this
	171	leader gateway and all traffic flows through it. ovn-northd monitors this
	172	gateway via OpenFlow echo requests (or some equivalent), and if the gateway
	173	dies, it recreates the routers on one of the backups.
	174
	175	This approach basically works in most cases and should likely be the starting
	176	point for OVN -- it's strictly better than no HA solution and is a good
	177	foundation for more sophisticated solutions. That said, it's not without it's
	178	limitations. Specifically, this approach doesn't coordinate with the physical
	179	network to minimize disruption during failures, and it tightly couples failover
	180	to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
	181	leaving backup gateways completely unutilized.
	182
eda2c3c0 SF	183	Router Failover
	184	+++++++++++++++
	185
	186	When ovn-northd notices the leader has died and decides to migrate routers to a
	187	backup gateway, the physical network has to be notified to direct traffic to
	188	the new gateway. Otherwise, traffic could be blackholed for longer than
760d034b EJ	189	necessary making failovers worse than they need to be.
	190
	191	For now, let's assume that OVN requires all gateways to be on the same IP
eda2c3c0 SF	192	subnet on the physical network. If this isn't the case, gateways would need to
	193	participate in routing protocols to orchestrate failovers, something which is
	194	difficult and out of scope of this document.
760d034b EJ	195
	196	Since all gateways are on the same IP subnet, we simply need to worry about
	197	updating the MAC learning tables of the Ethernet switches on that subnet.
	198	Presumably, they all have entries for each logical router pointing to the old
	199	leader. If these entries aren't updated, all traffic will be sent to the (now
	200	defunct) old leader, instead of the new one.
	201
	202	In order to mitigate this issue, it's recommended that the new gateway sends a
	203	Reverse ARP (RARP) onto the physical network for each logical router it now
	204	controls. A Reverse ARP is a benign protocol used by many hypervisors when
	205	virtual machines migrate to update L2 forwarding tables. In this case, the
	206	ethernet source address of the RARP is that of the logical router it
	207	corresponds to, and its destination is the broadcast address. This causes the
	208	RARP to travel to every L2 switch in the broadcast domain, updating forwarding
	209	tables accordingly. This strategy is recommended in all failover mechanisms
	210	discussed in this document -- when a router newly boots on a new leader, it
	211	should RARP its MAC address.
	212
eda2c3c0 SF	213	Controller Independent Active-backup
	214	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	215
	216	::
	217
	218	Controller Independent Active-Backup Implementation
	219
	220	+----------------+ +----------------+
	221	\| Leader \| \| Backup \|
	222	\| \| \| \|
	223	\| A B C \| \| \|
	224	\| \| \| \|
	225	+----------------+ +----------------+
	226	^ ^ ^ ^
	227	\| \| \| \|
	228	\| \| \| \|
	229	+ + + +
	230	Traffic
760d034b EJ	231
	232	The fundamental problem with naive active-backup, is it tightly couples the
	233	failover solution to ovn-northd. This can significantly increase downtime in
	234	the event of a failover as the (often already busy) ovn-northd controller has
eda2c3c0 SF	235	to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
	236	perform gateway failover at all. This violates the principle that control
	237	plane outages should have no impact on dataplane functionality.
760d034b EJ	238
	239	In a controller independent active-backup configuration, ovn-northd is
	240	responsible for initial configuration while the HA cluster is responsible for
	241	monitoring the leader, and failing over to a backup if necessary. ovn-northd
	242	sets HA policy, but doesn't actively participate when failovers occur.
	243
	244	Of course, in this model, ovn-northd is not without some responsibility. Its
eda2c3c0 SF	245	role is to pre-plan what should happen in the event of a failure, leaving it to
	246	the individual switches to execute this plan. It does this by assigning each
	247	gateway a unique leadership priority. Once assigned, it communicates this
760d034b EJ	248	priority to each node it controls. Nodes use the leadership priority to
	249	determine which gateway in the cluster is the active leader by using a simple
	250	metric: the leader is the gateway that is healthy, with the highest priority.
	251	If that gateway goes down, leadership falls to the next highest priority, and
	252	conversely, if a new gateway comes up with a higher priority, it takes over
	253	leadership.
	254
	255	Thus, in this model, leadership of the HA cluster is determined simply by the
	256	status of its members. Therefore if we can communicate the status of each
	257	gateway to each transport node, they can individually figure out which is the
	258	leader, and direct traffic accordingly.
	259
eda2c3c0 SF	260	Tunnel Monitoring
	261	+++++++++++++++++
	262
760d034b EJ	263	Since in this model leadership is determined exclusively by the health status
	264	of member gateways, a key problem is how do we communicate this information to
	265	the relevant transport nodes. Luckily, we can do this fairly cheaply using
	266	tunnel monitoring protocols like BFD.
	267
	268	The basic idea is pretty straightforward. Each transport node maintains a
eda2c3c0 SF	269	tunnel to every gateway in the HA cluster (not just the leader). These tunnels
	270	are monitored using the BFD protocol to see which are alive. Given this
	271	information, hypervisors can trivially compute the highest priority live
760d034b EJ	272	gateway, and thus the leader.
	273
	274	In practice, this leadership computation can be performed trivially using the
	275	bundle or group action. Rather than using OpenFlow to simply output to the
	276	leader, all gateways could be listed in an active-backup bundle action ordered
	277	by their priority. The bundle action will automatically take into account the
	278	tunnel monitoring status to output the packet to the highest priority live
	279	gateway.
	280
eda2c3c0 SF	281	Inter-Gateway Monitoring
	282	++++++++++++++++++++++++
	283
760d034b EJ	284	One somewhat subtle aspect of this model, is that failovers are not globally
	285	atomic. When a failover occurs, it will take some time for all hypervisors to
	286	notice and adjust accordingly. Similarly, if a new high priority Gateway comes
	287	up, it may take some time for all hypervisors to switch over to the new leader.
	288	In order to avoid confusing the physical network, under these circumstances
	289	it's important for the backup gateways to drop traffic they've received
	290	erroneously. In order to do this, each Gateway must know whether or not it is,
	291	in fact active. This can be achieved by creating a mesh of tunnels between
	292	gateways. Each gateway monitors the other gateways its cluster to determine
	293	which are alive, and therefore whether or not that gateway happens to be the
	294	leader. If leading, the gateway forwards traffic normally, otherwise it drops
	295	all traffic.
	296
712eb1bc MAA	297	We should note that this method works well under the assumption that there
	298	are no inter-gateway connectivity failures, in such case this method would fail
	299	to elect a single master. The simplest example is two gateways which stop seeing
	300	each other but can still reach the hypervisors. Protocols like VRRP or CARP
	301	have the same issue. A mitigation for this type of failure mode could be
	302	achieved by having all network elements (hypervisors and gateways) periodically
	303	share their link status to other endpoints.
	304
eda2c3c0 SF	305	Gateway Leadership Resignation
	306	++++++++++++++++++++++++++++++
	307
760d034b EJ	308	Sometimes a gateway may be healthy, but still may not be suitable to lead the
	309	HA cluster. This could happen for several reasons including:
	310
eda2c3c0 SF	311	* The physical network is unreachable
	312
	313	* BFD (or ping) has detected the next hop router is unreachable
	314
	315	* The Gateway recently booted and isn't fully configured
760d034b EJ	316
760d034b EJ	317	In this case, the Gateway should resign leadership by holding its tunnels down
eda2c3c0	318	using the ``other_config:cpath_down`` flag. This indicates to participating
760d034b EJ	319	hypervisors and Gateways that this gateway should be treated as if it's down,
	320	even though its tunnels are still healthy.
	321
760d034b	322	Router Specific Active-Backup
eda2c3c0 SF	323	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	324
	325	::
	326
	327	Router Specific Active-Backup
	328
	329	+----------------+ +----------------+
	330	\| \| \| \|
	331	\| A C \| \| B D E \|
	332	\| \| \| \|
	333	+----------------+ +----------------+
	334	^ ^ ^ ^
	335	\| \| \| \|
	336	\| \| \| \|
	337	+ + + +
	338	Traffic
	339
760d034b EJ	340	Controller independent active-backup is a great advance over naive
	341	active-backup, but it still has one glaring problem -- it under-utilizes the
	342	backup gateways. In ideal scenario, all traffic would split evenly among the
	343	live set of gateways. Getting all the way there is somewhat tricky, but as a
	344	step in the direction, one could use the "Router Specific Active-Backup"
	345	algorithm. This algorithm looks a lot like active-backup on a per logical
	346	router basis, with one twist. It chooses a different active Gateway for each
	347	logical router. Thus, in situations where there are several logical routers,
	348	all with somewhat balanced load, this algorithm performs better.
	349
	350	Implementation of this strategy is quite straightforward if built on top of
	351	basic controller independent active-backup. On a per logical router basis, the
	352	algorithm is the same, leadership is determined by the liveness of the
	353	gateways. The key difference here is that the gateways must have a different
	354	leadership priority for each logical router. These leadership priorities can
	355	be computed by ovn-northd just as they had been in the controller independent
	356	active-backup model.
	357
	358	Once we have these per logical router priorities, they simply need be
	359	communicated to the members of the gateway cluster and the hypervisors. The
	360	hypervisors in particular, need simply have an active-backup bundle action (or
	361	group action) per logical router listing the gateways in priority order for
	362	that router, rather than having a single bundle action shared for all the
	363	routers.
	364
	365	Additionally, the gateways need to be updated to take into account individual
	366	router priorities. Specifically, each gateway should drop traffic of backup
	367	routers it's running, and forward traffic of active gateways, instead of simply
	368	dropping or forwarding everything. This should likely be done by having
	369	ovn-controller recompute OpenFlow for the gateway, though other options exist.
	370
	371	The final complication is that ovn-northd's logic must be updated to choose
	372	these per logical router leadership priorities in a more sophisticated manner.
	373	It doesn't matter much exactly what algorithm it chooses to do this, beyond
	374	that it should provide good balancing in the common case. I.E. each logical
	375	routers priorities should be different enough that routers balance to different
	376	gateways even when failures occur.
	377
eda2c3c0 SF	378	Preemption
	379	++++++++++
	380
760d034b EJ	381	In an active-backup setup, one issue that users will run into is that of
	382	gateway leader preemption. If a new Gateway is added to a cluster, or for some
	383	reason an existing gateway is rebooted, we could end up in a situation where
	384	the newly activated gateway has higher priority than any other in the HA
eda2c3c0 SF	385	cluster. In this case, as soon as that gateway appears, it will preempt
	386	leadership from the currently active leader causing an unnecessary failover.
	387	Since failover can be quite expensive, this preemption may be undesirable.
760d034b EJ	388
	389	The controller can optionally avoid preemption by cleverly tweaking the
	390	leadership priorities. For each router, new gateways should be assigned
	391	priorities that put them second in line or later when they eventually come up.
	392	Furthermore, if a gateway goes down for a significant period of time, its old
	393	leadership priorities should be revoked and new ones should be assigned as if
	394	it's a brand new gateway. Note that this should only happen if a gateway has
	395	been down for a while (several minutes), otherwise a flapping gateway could
	396	have wide ranging, unpredictable, consequences.
	397
	398	Note that preemption avoidance should be optional depending on the deployment.
eda2c3c0 SF	399	One necessarily sacrifices optimal load balancing to satisfy these requirements
	400	as new gateways will get no traffic on boot. Thus, this feature represents a
	401	trade-off which must be made on a per installation basis.
	402
	403	Fully Active-Active HA
	404	~~~~~~~~~~~~~~~~~~~~~~
	405
	406	::
	407
	408	Fully Active-Active HA
	409
	410	+----------------+ +----------------+
	411	\| \| \| \|
	412	\| A B C D E \| \| A B C D E \|
	413	\| \| \| \|
	414	+----------------+ +----------------+
	415	^ ^ ^ ^
	416	\| \| \| \|
	417	\| \| \| \|
	418	+ + + +
	419	Traffic
760d034b EJ	420
	421	The final step in L3HA is to have true active-active HA. In this scenario each
	422	router has an instance on each Gateway, and a mechanism similar to ECMP is used
	423	to distribute traffic evenly among all instances. This mechanism would require
	424	Gateways to participate in routing protocols with the physical network to
	425	attract traffic and alert of failures. It is out of scope of this document,
	426	but may eventually be necessary.
	427
	428	L2HA
	429	----
eda2c3c0	430
760d034b EJ	431	L2HA is very difficult to get right. Unlike L3HA, where the consequences of
	432	problems are minor, in L2HA if two gateways are both transiently active, an L2
	433	loop triggers and a broadcast storm results. In practice to get around this,
	434	gateways end up implementing an overly conservative "when in doubt drop all
	435	traffic" policy, or they implement something like MLAG.
	436
	437	MLAG has multiple gateways work together to pretend to be a single L2 switch
eda2c3c0 SF	438	with a large LACP bond. In principle, it's the right solution to the problem
eda2c3c0 SF	439	as it solves the broadcast storm problem, and has been deployed successfully in
760d034b	440	other contexts. That said, it's difficult to get right and not recommended.