Documentation/topics/high-availability.rst

   1 ..
   2       Licensed under the Apache License, Version 2.0 (the "License"); you may
   3       not use this file except in compliance with the License. You may obtain
   4       a copy of the License at
   5
   6           http://www.apache.org/licenses/LICENSE-2.0
   7
   8       Unless required by applicable law or agreed to in writing, software
   9       distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
  10       WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
  11       License for the specific language governing permissions and limitations
  12       under the License.
  13
  14       Convention for heading levels in Open vSwitch documentation:
  15
  16       =======  Heading 0 (reserved for the title in a document)
  17       -------  Heading 1
  18       ~~~~~~~  Heading 2
  19       +++++++  Heading 3
  20       '''''''  Heading 4
  21
  22       Avoid deeper levels because they do not render well.
  23
  24 ==================================
  25 OVN Gateway High Availability Plan
  26 ==================================
  27
  28 ::
  29
  30     OVN Gateway
  31
  32          +---------------------------+
  33          |                           |
  34          |     External Network      |
  35          |                           |
  36          +-------------^-------------+
  37                        |
  38                        |
  39                  +-----------+
  40                  |           |
  41                  |  Gateway  |
  42                  |           |
  43                  +-----------+
  44                        ^
  45                        |
  46                        |
  47          +-------------v-------------+
  48          |                           |
  49          |    OVN Virtual Network    |
  50          |                           |
  51          +---------------------------+
  52
  53 The OVN gateway is responsible for shuffling traffic between the tunneled
  54 overlay network (governed by ovn-northd), and the legacy physical network.  In
  55 a naive implementation, the gateway is a single x86 server, or hardware VTEP.
  56 For most deployments, a single system has enough forwarding capacity to service
  57 the entire virtualized network, however, it introduces a single point of
  58 failure.  If this system dies, the entire OVN deployment becomes unavailable.
  59 To mitigate this risk, an HA solution is critical -- by spreading
  60 responsibility across multiple systems, no single server failure can take down
  61 the network.
  62
  63 An HA solution is both critical to the manageability of the system, and
  64 extremely difficult to get right.  The purpose of this document, is to propose
  65 a plan for OVN Gateway High Availability which takes into account our past
  66 experience building similar systems.  It should be considered a fluid changing
  67 proposal, not a set-in-stone decree.
  68
  69 .. note::
  70     This document describes a range of options OVN could take to provide
  71     high availability for gateways.  The current implementation provides L3
  72     gateway high availability by the "Router Specific Active/Backup"
  73     approach described in this document.
  74
  75 Basic Architecture
  76 ------------------
  77
  78 In an OVN deployment, the set of hypervisors and network elements operating
  79 under the guidance of ovn-northd are in what's called "logical space".  These
  80 servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
  81 the underlying physical network.  When these systems need to communicate with
  82 legacy networks, traffic must be routed through a Gateway which translates from
  83 OVN controlled tunnel traffic, to raw physical network traffic.
  84
  85 Since the gateway is typically the only system with a connection to the
  86 physical network all traffic between logical space and the WAN must travel
  87 through it.  This makes it a critical single point of failure -- if the gateway
  88 dies, communication with the WAN ceases for all systems in logical space.
  89
  90 To mitigate this risk, multiple gateways should be run in a "High Availability
  91 Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
  92 the duties of a gateways,  while being able to recover gracefully from
  93 individual member failures.
  94
  95 ::
  96
  97     OVN Gateway HA Cluster
  98
  99              +---------------------------+
 100              |                           |
 101              |     External Network      |
 102              |                           |
 103              +-------------^-------------+
 104                            |
 105                            |
 106     +----------------------v----------------------+
 107     |                                             |
 108     |          High Availability Cluster          |
 109     |                                             |
 110     | +-----------+  +-----------+  +-----------+ |
 111     | |           |  |           |  |           | |
 112     | |  Gateway  |  |  Gateway  |  |  Gateway  | |
 113     | |           |  |           |  |           | |
 114     | +-----------+  +-----------+  +-----------+ |
 115     +----------------------^----------------------+
 116                            |
 117                            |
 118              +-------------v-------------+
 119              |                           |
 120              |    OVN Virtual Network    |
 121              |                           |
 122              +---------------------------+
 123
 124 L2 vs L3 High Availability
 125 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 126
 127 In order to achieve this goal, there are two broad approaches one can take.
 128 The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
 129 or like a giant IP Router. These approaches are called L2HA, and L3HA
 130 respectively.  L2HA allows ethernet broadcast domains to extend into logical
 131 space, a significant advantage, but this comes at a cost.  The need to avoid
 132 transient L2 loops during failover significantly complicates their design.  On
 133 the other hand, L3HA works for most use cases, is simpler, and fails more
 134 gracefully.  For these reasons, it is suggested that OVN supports an L3HA
 135 model, leaving L2HA for future work (or third party VTEP providers).  Both
 136 models are discussed further below.
 137
 138 L3HA
 139 ----
 140
 141 In this section, we'll work through a basic simple L3HA implementation, on top
 142 of which we'll gradually build more sophisticated features explaining their
 143 motivations and implementations as we go.
 144
 145 Naive active-backup
 146 ~~~~~~~~~~~~~~~~~~~
 147
 148 Let's assume that there are a collection of logical routers which a tenant has
 149 asked for, our task is to schedule these logical routers on one of N gateways,
 150 and gracefully redistribute the routers on gateways which have failed.  The
 151 absolute simplest way to achieve this is what we'll call "naive-active-backup".
 152
 153 ::
 154
 155     Naive Active Backup HA Implementation
 156
 157     +----------------+   +----------------+
 158     | Leader         |   | Backup         |
 159     |                |   |                |
 160     |      A B C     |   |                |
 161     |                |   |                |
 162     +----+-+-+-+----++   +-+--------------+
 163          ^ ^ ^ ^    |      |
 164          | | | |    |      |
 165          | | | |  +-+------+---+
 166          + + + +  | ovn-northd |
 167          Traffic  +------------+
 168
 169 In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
 170 leader.  All logical routers (A, B, C in the figure), are scheduled on this
 171 leader gateway and all traffic flows through it.  ovn-northd monitors this
 172 gateway via OpenFlow echo requests (or some equivalent), and if the gateway
 173 dies, it recreates the routers on one of the backups.
 174
 175 This approach basically works in most cases and should likely be the starting
 176 point for OVN -- it's strictly better than no HA solution and is a good
 177 foundation for more sophisticated solutions.  That said, it's not without it's
 178 limitations. Specifically, this approach doesn't coordinate with the physical
 179 network to minimize disruption during failures, and it tightly couples failover
 180 to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
 181 leaving backup gateways completely unutilized.
 182
 183 Router Failover
 184 +++++++++++++++
 185
 186 When ovn-northd notices the leader has died and decides to migrate routers to a
 187 backup gateway, the physical network has to be notified to direct traffic to
 188 the new gateway.  Otherwise, traffic could be blackholed for longer than
 189 necessary making failovers worse than they need to be.
 190
 191 For now, let's assume that OVN requires all gateways to be on the same IP
 192 subnet on the physical network.  If this isn't the case, gateways would need to
 193 participate in routing protocols to orchestrate failovers, something which is
 194 difficult and out of scope of this document.
 195
 196 Since all gateways are on the same IP subnet, we simply need to worry about
 197 updating the MAC learning tables of the Ethernet switches on that subnet.
 198 Presumably, they all have entries for each logical router pointing to the old
 199 leader.  If these entries aren't updated, all traffic will be sent to the (now
 200 defunct) old leader, instead of the new one.
 201
 202 In order to mitigate this issue, it's recommended that the new gateway sends a
 203 Reverse ARP (RARP) onto the physical network for each logical router it now
 204 controls.  A Reverse ARP is a benign protocol used by many hypervisors when
 205 virtual machines migrate to update L2 forwarding tables.  In this case, the
 206 ethernet source address of the RARP is that of the logical router it
 207 corresponds to, and its destination is the broadcast address.  This causes the
 208 RARP to travel to every L2 switch in the broadcast domain, updating forwarding
 209 tables accordingly.  This strategy is recommended in all failover mechanisms
 210 discussed in this document -- when a router newly boots on a new leader, it
 211 should RARP its MAC address.
 212
 213 Controller Independent Active-backup
 214 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 215
 216 ::
 217
 218     Controller Independent Active-Backup Implementation
 219
 220     +----------------+   +----------------+
 221     | Leader         |   | Backup         |
 222     |                |   |                |
 223     |      A B C     |   |                |
 224     |                |   |                |
 225     +----------------+   +----------------+
 226          ^ ^ ^ ^
 227          | | | |
 228          | | | |
 229          + + + +
 230          Traffic
 231
 232 The fundamental problem with naive active-backup, is it tightly couples the
 233 failover solution to ovn-northd.  This can significantly increase downtime in
 234 the event of a failover as the (often already busy) ovn-northd controller has
 235 to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
 236 perform gateway failover at all.  This violates the principle that control
 237 plane outages should have no impact on dataplane functionality.
 238
 239 In a controller independent active-backup configuration, ovn-northd is
 240 responsible for initial configuration while the HA cluster is responsible for
 241 monitoring the leader, and failing over to a backup if necessary.  ovn-northd
 242 sets HA policy, but doesn't actively participate when failovers occur.
 243
 244 Of course, in this model, ovn-northd is not without some responsibility.  Its
 245 role is to pre-plan what should happen in the event of a failure, leaving it to
 246 the individual switches to execute this plan.  It does this by assigning each
 247 gateway a unique leadership priority.  Once assigned, it communicates this
 248 priority to each node it controls.  Nodes use the leadership priority to
 249 determine which gateway in the cluster is the active leader by using a simple
 250 metric: the leader is the gateway that is healthy, with the highest priority.
 251 If that gateway goes down, leadership falls to the next highest priority, and
 252 conversely, if a new gateway comes up with a higher priority, it takes over
 253 leadership.
 254
 255 Thus, in this model, leadership of the HA cluster is determined simply by the
 256 status of its members.  Therefore if we can communicate the status of each
 257 gateway to each transport node, they can individually figure out which is the
 258 leader, and direct traffic accordingly.
 259
 260 Tunnel Monitoring
 261 +++++++++++++++++
 262
 263 Since in this model leadership is determined exclusively by the health status
 264 of member gateways, a key problem is how do we communicate this information to
 265 the relevant transport nodes.  Luckily, we can do this fairly cheaply using
 266 tunnel monitoring protocols like BFD.
 267
 268 The basic idea is pretty straightforward.  Each transport node maintains a
 269 tunnel to every gateway in the HA cluster (not just the leader).  These tunnels
 270 are monitored using the BFD protocol to see which are alive.  Given this
 271 information, hypervisors can trivially compute the highest priority live
 272 gateway, and thus the leader.
 273
 274 In practice, this leadership computation can be performed trivially using the
 275 bundle or group action.  Rather than using OpenFlow to simply output to the
 276 leader, all gateways could be listed in an active-backup bundle action ordered
 277 by their priority.  The bundle action will automatically take into account the
 278 tunnel monitoring status to output the packet to the highest priority live
 279 gateway.
 280
 281 Inter-Gateway Monitoring
 282 ++++++++++++++++++++++++
 283
 284 One somewhat subtle aspect of this model, is that failovers are not globally
 285 atomic.  When a failover occurs, it will take some time for all hypervisors to
 286 notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
 287 up, it may take some time for all hypervisors to switch over to the new leader.
 288 In order to avoid confusing the physical network, under these circumstances
 289 it's important for the backup gateways to drop traffic they've received
 290 erroneously.  In order to do this, each Gateway must know whether or not it is,
 291 in fact active.  This can be achieved by creating a mesh of tunnels between
 292 gateways.  Each gateway monitors the other gateways its cluster to determine
 293 which are alive, and therefore whether or not that gateway happens to be the
 294 leader.  If leading, the gateway forwards traffic normally, otherwise it drops
 295 all traffic.
 296
 297 We should note that this method works well under the assumption that there
 298 are no inter-gateway connectivity failures, in such case this method would fail
 299 to elect a single master. The simplest example is two gateways which stop seeing
 300 each other but can still reach the hypervisors. Protocols like VRRP or CARP
 301 have the same issue. A mitigation for this type of failure mode could be
 302 achieved by having all network elements (hypervisors and gateways) periodically
 303 share their link status to other endpoints.
 304
 305 Gateway Leadership Resignation
 306 ++++++++++++++++++++++++++++++
 307
 308 Sometimes a gateway may be healthy, but still may not be suitable to lead the
 309 HA cluster.  This could happen for several reasons including:
 310
 311 * The physical network is unreachable
 312
 313 * BFD (or ping) has detected the next hop router is unreachable
 314
 315 * The Gateway recently booted and isn't fully configured
 316
 317 In this case, the Gateway should resign leadership by holding its tunnels down
 318 using the ``other_config:cpath_down`` flag.  This indicates to participating
 319 hypervisors and Gateways that this gateway should be treated as if it's down,
 320 even though its tunnels are still healthy.
 321
 322 Router Specific Active-Backup
 323 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 324
 325 ::
 326
 327     Router Specific Active-Backup
 328
 329     +----------------+ +----------------+
 330     |                | |                |
 331     |      A C       | |     B D E      |
 332     |                | |                |
 333     +----------------+ +----------------+
 334                   ^ ^   ^ ^
 335                   | |   | |
 336                   | |   | |
 337                   + +   + +
 338                    Traffic
 339
 340 Controller independent active-backup is a great advance over naive
 341 active-backup, but it still has one glaring problem -- it under-utilizes the
 342 backup gateways.  In ideal scenario, all traffic would split evenly among the
 343 live set of gateways.  Getting all the way there is somewhat tricky, but as a
 344 step in the direction, one could use the "Router Specific Active-Backup"
 345 algorithm.  This algorithm looks a lot like active-backup on a per logical
 346 router basis, with one twist.  It chooses a different active Gateway for each
 347 logical router.  Thus, in situations where there are several logical routers,
 348 all with somewhat balanced load, this algorithm performs better.
 349
 350 Implementation of this strategy is quite straightforward if built on top of
 351 basic controller independent active-backup.  On a per logical router basis, the
 352 algorithm is the same, leadership is determined by the liveness of the
 353 gateways.  The key difference here is that the gateways must have a different
 354 leadership priority for each logical router.  These leadership priorities can
 355 be computed by ovn-northd just as they had been in the controller independent
 356 active-backup model.
 357
 358 Once we have these per logical router priorities, they simply need be
 359 communicated to the members of the gateway cluster and the hypervisors.  The
 360 hypervisors in particular, need simply have an active-backup bundle action (or
 361 group action) per logical router listing the gateways in priority order for
 362 *that router*, rather than having a single bundle action shared for all the
 363 routers.
 364
 365 Additionally, the gateways need to be updated to take into account individual
 366 router priorities.  Specifically, each gateway should drop traffic of backup
 367 routers it's running, and forward traffic of active gateways, instead of simply
 368 dropping or forwarding everything.  This should likely be done by having
 369 ovn-controller recompute OpenFlow for the gateway, though other options exist.
 370
 371 The final complication is that ovn-northd's logic must be updated to choose
 372 these per logical router leadership priorities in a more sophisticated manner.
 373 It doesn't matter much exactly what algorithm it chooses to do this, beyond
 374 that it should provide good balancing in the common case.  I.E. each logical
 375 routers priorities should be different enough that routers balance to different
 376 gateways even when failures occur.
 377
 378 Preemption
 379 ++++++++++
 380
 381 In an active-backup setup, one issue that users will run into is that of
 382 gateway leader preemption.  If a new Gateway is added to a cluster, or for some
 383 reason an existing gateway is rebooted, we could end up in a situation where
 384 the newly activated gateway has higher priority than any other in the HA
 385 cluster.  In this case, as soon as that gateway appears, it will preempt
 386 leadership from the currently active leader causing an unnecessary failover.
 387 Since failover can be quite expensive, this preemption may be undesirable.
 388
 389 The controller can optionally avoid preemption by cleverly tweaking the
 390 leadership priorities.  For each router, new gateways should be assigned
 391 priorities that put them second in line or later when they eventually come up.
 392 Furthermore, if a gateway goes down for a significant period of time, its old
 393 leadership priorities should be revoked and new ones should be assigned as if
 394 it's a brand new gateway.  Note that this should only happen if a gateway has
 395 been down for a while (several minutes), otherwise a flapping gateway could
 396 have wide ranging, unpredictable, consequences.
 397
 398 Note that preemption avoidance should be optional depending on the deployment.
 399 One necessarily sacrifices optimal load balancing to satisfy these requirements
 400 as new gateways will get no traffic on boot.  Thus, this feature represents a
 401 trade-off which must be made on a per installation basis.
 402
 403 Fully Active-Active HA
 404 ~~~~~~~~~~~~~~~~~~~~~~
 405
 406 ::
 407
 408     Fully Active-Active HA
 409
 410     +----------------+ +----------------+
 411     |                | |                |
 412     |   A B C D E    | |    A B C D E   |
 413     |                | |                |
 414     +----------------+ +----------------+
 415                   ^ ^   ^ ^
 416                   | |   | |
 417                   | |   | |
 418                   + +   + +
 419                    Traffic
 420
 421 The final step in L3HA is to have true active-active HA.  In this scenario each
 422 router has an instance on each Gateway, and a mechanism similar to ECMP is used
 423 to distribute traffic evenly among all instances.  This mechanism would require
 424 Gateways to participate in routing protocols with the physical network to
 425 attract traffic and alert of failures.  It is out of scope of this document,
 426 but may eventually be necessary.
 427
 428 L2HA
 429 ----
 430
 431 L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
 432 problems are minor, in L2HA if two gateways are both transiently active, an L2
 433 loop triggers and a broadcast storm results.  In practice to get around this,
 434 gateways end up implementing an overly conservative "when in doubt drop all
 435 traffic" policy, or they implement something like MLAG.
 436
 437 MLAG has multiple gateways work together to pretend to be a single L2 switch
 438 with a large LACP bond.  In principle, it's the right solution to the problem
 439 as it solves the broadcast storm problem, and has been deployed successfully in
 440 other contexts.  That said, it's difficult to get right and not recommended.