]> git.proxmox.com Git - mirror_ovs.git/blob - ovn/ovn-architecture.7.xml
ovn: l3ha, CLI for logical router port gateway chassis
[mirror_ovs.git] / ovn / ovn-architecture.7.xml
1 <?xml version="1.0" encoding="utf-8"?>
2 <manpage program="ovn-architecture" section="7" title="OVN Architecture">
3 <h1>Name</h1>
4 <p>ovn-architecture -- Open Virtual Network architecture</p>
5
6 <h1>Description</h1>
7
8 <p>
9 OVN, the Open Virtual Network, is a system to support virtual network
10 abstraction. OVN complements the existing capabilities of OVS to add
11 native support for virtual network abstractions, such as virtual L2 and L3
12 overlays and security groups. Services such as DHCP are also desirable
13 features. Just like OVS, OVN's design goal is to have a production-quality
14 implementation that can operate at significant scale.
15 </p>
16
17 <p>
18 An OVN deployment consists of several components:
19 </p>
20
21 <ul>
22 <li>
23 <p>
24 A <dfn>Cloud Management System</dfn> (<dfn>CMS</dfn>), which is
25 OVN's ultimate client (via its users and administrators). OVN
26 integration requires installing a CMS-specific plugin and
27 related software (see below). OVN initially targets OpenStack
28 as CMS.
29 </p>
30
31 <p>
32 We generally speak of ``the'' CMS, but one can imagine scenarios in
33 which multiple CMSes manage different parts of an OVN deployment.
34 </p>
35 </li>
36
37 <li>
38 An OVN Database physical or virtual node (or, eventually, cluster)
39 installed in a central location.
40 </li>
41
42 <li>
43 One or more (usually many) <dfn>hypervisors</dfn>. Hypervisors must run
44 Open vSwitch and implement the interface described in
45 <code>IntegrationGuide.rst</code> in the OVS source tree. Any hypervisor
46 platform supported by Open vSwitch is acceptable.
47 </li>
48
49 <li>
50 <p>
51 Zero or more <dfn>gateways</dfn>. A gateway extends a tunnel-based
52 logical network into a physical network by bidirectionally forwarding
53 packets between tunnels and a physical Ethernet port. This allows
54 non-virtualized machines to participate in logical networks. A gateway
55 may be a physical host, a virtual machine, or an ASIC-based hardware
56 switch that supports the <code>vtep</code>(5) schema.
57 </p>
58
59 <p>
60 Hypervisors and gateways are together called <dfn>transport node</dfn>
61 or <dfn>chassis</dfn>.
62 </p>
63 </li>
64 </ul>
65
66 <p>
67 The diagram below shows how the major components of OVN and related
68 software interact. Starting at the top of the diagram, we have:
69 </p>
70
71 <ul>
72 <li>
73 The Cloud Management System, as defined above.
74 </li>
75
76 <li>
77 <p>
78 The <dfn>OVN/CMS Plugin</dfn> is the component of the CMS that
79 interfaces to OVN. In OpenStack, this is a Neutron plugin.
80 The plugin's main purpose is to translate the CMS's notion of logical
81 network configuration, stored in the CMS's configuration database in a
82 CMS-specific format, into an intermediate representation understood by
83 OVN.
84 </p>
85
86 <p>
87 This component is necessarily CMS-specific, so a new plugin needs to be
88 developed for each CMS that is integrated with OVN. All of the
89 components below this one in the diagram are CMS-independent.
90 </p>
91 </li>
92
93 <li>
94 <p>
95 The <dfn>OVN Northbound Database</dfn> receives the intermediate
96 representation of logical network configuration passed down by the
97 OVN/CMS Plugin. The database schema is meant to be ``impedance
98 matched'' with the concepts used in a CMS, so that it directly supports
99 notions of logical switches, routers, ACLs, and so on. See
100 <code>ovn-nb</code>(5) for details.
101 </p>
102
103 <p>
104 The OVN Northbound Database has only two clients: the OVN/CMS Plugin
105 above it and <code>ovn-northd</code> below it.
106 </p>
107 </li>
108
109 <li>
110 <code>ovn-northd</code>(8) connects to the OVN Northbound Database
111 above it and the OVN Southbound Database below it. It translates the
112 logical network configuration in terms of conventional network
113 concepts, taken from the OVN Northbound Database, into logical
114 datapath flows in the OVN Southbound Database below it.
115 </li>
116
117 <li>
118 <p>
119 The <dfn>OVN Southbound Database</dfn> is the center of the system.
120 Its clients are <code>ovn-northd</code>(8) above it and
121 <code>ovn-controller</code>(8) on every transport node below it.
122 </p>
123
124 <p>
125 The OVN Southbound Database contains three kinds of data: <dfn>Physical
126 Network</dfn> (PN) tables that specify how to reach hypervisor and
127 other nodes, <dfn>Logical Network</dfn> (LN) tables that describe the
128 logical network in terms of ``logical datapath flows,'' and
129 <dfn>Binding</dfn> tables that link logical network components'
130 locations to the physical network. The hypervisors populate the PN and
131 Port_Binding tables, whereas <code>ovn-northd</code>(8) populates the
132 LN tables.
133 </p>
134
135 <p>
136 OVN Southbound Database performance must scale with the number of
137 transport nodes. This will likely require some work on
138 <code>ovsdb-server</code>(1) as we encounter bottlenecks.
139 Clustering for availability may be needed.
140 </p>
141 </li>
142 </ul>
143
144 <p>
145 The remaining components are replicated onto each hypervisor:
146 </p>
147
148 <ul>
149 <li>
150 <code>ovn-controller</code>(8) is OVN's agent on each hypervisor and
151 software gateway. Northbound, it connects to the OVN Southbound
152 Database to learn about OVN configuration and status and to
153 populate the PN table and the <code>Chassis</code> column in
154 <code>Binding</code> table with the hypervisor's status.
155 Southbound, it connects to <code>ovs-vswitchd</code>(8) as an
156 OpenFlow controller, for control over network traffic, and to the
157 local <code>ovsdb-server</code>(1) to allow it to monitor and
158 control Open vSwitch configuration.
159 </li>
160
161 <li>
162 <code>ovs-vswitchd</code>(8) and <code>ovsdb-server</code>(1) are
163 conventional components of Open vSwitch.
164 </li>
165 </ul>
166
167 <pre fixed="yes">
168 CMS
169 |
170 |
171 +-----------|-----------+
172 | | |
173 | OVN/CMS Plugin |
174 | | |
175 | | |
176 | OVN Northbound DB |
177 | | |
178 | | |
179 | ovn-northd |
180 | | |
181 +-----------|-----------+
182 |
183 |
184 +-------------------+
185 | OVN Southbound DB |
186 +-------------------+
187 |
188 |
189 +------------------+------------------+
190 | | |
191 HV 1 | | HV n |
192 +---------------|---------------+ . +---------------|---------------+
193 | | | . | | |
194 | ovn-controller | . | ovn-controller |
195 | | | | . | | | |
196 | | | | | | | |
197 | ovs-vswitchd ovsdb-server | | ovs-vswitchd ovsdb-server |
198 | | | |
199 +-------------------------------+ +-------------------------------+
200 </pre>
201
202 <h2>Information Flow in OVN</h2>
203
204 <p>
205 Configuration data in OVN flows from north to south. The CMS, through its
206 OVN/CMS plugin, passes the logical network configuration to
207 <code>ovn-northd</code> via the northbound database. In turn,
208 <code>ovn-northd</code> compiles the configuration into a lower-level form
209 and passes it to all of the chassis via the southbound database.
210 </p>
211
212 <p>
213 Status information in OVN flows from south to north. OVN currently
214 provides only a few forms of status information. First,
215 <code>ovn-northd</code> populates the <code>up</code> column in the
216 northbound <code>Logical_Switch_Port</code> table: if a logical port's
217 <code>chassis</code> column in the southbound <code>Port_Binding</code>
218 table is nonempty, it sets <code>up</code> to <code>true</code>, otherwise
219 to <code>false</code>. This allows the CMS to detect when a VM's
220 networking has come up.
221 </p>
222
223 <p>
224 Second, OVN provides feedback to the CMS on the realization of its
225 configuration, that is, whether the configuration provided by the CMS has
226 taken effect. This feature requires the CMS to participate in a sequence
227 number protocol, which works the following way:
228 </p>
229
230 <ol>
231 <li>
232 When the CMS updates the configuration in the northbound database, as
233 part of the same transaction, it increments the value of the
234 <code>nb_cfg</code> column in the <code>NB_Global</code> table. (This is
235 only necessary if the CMS wants to know when the configuration has been
236 realized.)
237 </li>
238
239 <li>
240 When <code>ovn-northd</code> updates the southbound database based on a
241 given snapshot of the northbound database, it copies <code>nb_cfg</code>
242 from northbound <code>NB_Global</code> into the southbound database
243 <code>SB_Global</code> table, as part of the same transaction. (Thus, an
244 observer monitoring both databases can determine when the southbound
245 database is caught up with the northbound.)
246 </li>
247
248 <li>
249 After <code>ovn-northd</code> receives confirmation from the southbound
250 database server that its changes have committed, it updates
251 <code>sb_cfg</code> in the northbound <code>NB_Global</code> table to the
252 <code>nb_cfg</code> version that was pushed down. (Thus, the CMS or
253 another observer can determine when the southbound database is caught up
254 without a connection to the southbound database.)
255 </li>
256
257 <li>
258 The <code>ovn-controller</code> process on each chassis receives the
259 updated southbound database, with the updated <code>nb_cfg</code>. This
260 process in turn updates the physical flows installed in the chassis's
261 Open vSwitch instances. When it receives confirmation from Open vSwitch
262 that the physical flows have been updated, it updates <code>nb_cfg</code>
263 in its own <code>Chassis</code> record in the southbound database.
264 </li>
265
266 <li>
267 <code>ovn-northd</code> monitors the <code>nb_cfg</code> column in all of
268 the <code>Chassis</code> records in the southbound database. It keeps
269 track of the minimum value among all the records and copies it into the
270 <code>hv_cfg</code> column in the northbound <code>NB_Global</code>
271 table. (Thus, the CMS or another observer can determine when all of the
272 hypervisors have caught up to the northbound configuration.)
273 </li>
274 </ol>
275
276 <h2>Chassis Setup</h2>
277
278 <p>
279 Each chassis in an OVN deployment must be configured with an Open vSwitch
280 bridge dedicated for OVN's use, called the <dfn>integration bridge</dfn>.
281 System startup scripts may create this bridge prior to starting
282 <code>ovn-controller</code> if desired. If this bridge does not exist when
283 ovn-controller starts, it will be created automatically with the default
284 configuration suggested below. The ports on the integration bridge include:
285 </p>
286
287 <ul>
288 <li>
289 On any chassis, tunnel ports that OVN uses to maintain logical network
290 connectivity. <code>ovn-controller</code> adds, updates, and removes
291 these tunnel ports.
292 </li>
293
294 <li>
295 On a hypervisor, any VIFs that are to be attached to logical networks.
296 The hypervisor itself, or the integration between Open vSwitch and the
297 hypervisor (described in <code>IntegrationGuide.rst</code>) takes care of
298 this. (This is not part of OVN or new to OVN; this is pre-existing
299 integration work that has already been done on hypervisors that support
300 OVS.)
301 </li>
302
303 <li>
304 On a gateway, the physical port used for logical network connectivity.
305 System startup scripts add this port to the bridge prior to starting
306 <code>ovn-controller</code>. This can be a patch port to another bridge,
307 instead of a physical port, in more sophisticated setups.
308 </li>
309 </ul>
310
311 <p>
312 Other ports should not be attached to the integration bridge. In
313 particular, physical ports attached to the underlay network (as opposed to
314 gateway ports, which are physical ports attached to logical networks) must
315 not be attached to the integration bridge. Underlay physical ports should
316 instead be attached to a separate Open vSwitch bridge (they need not be
317 attached to any bridge at all, in fact).
318 </p>
319
320 <p>
321 The integration bridge should be configured as described below.
322 The effect of each of these settings is documented in
323 <code>ovs-vswitchd.conf.db</code>(5):
324 </p>
325
326 <!-- Keep the following in sync with create_br_int() in
327 ovn/controller/ovn-controller.c. -->
328 <dl>
329 <dt><code>fail-mode=secure</code></dt>
330 <dd>
331 Avoids switching packets between isolated logical networks before
332 <code>ovn-controller</code> starts up. See <code>Controller Failure
333 Settings</code> in <code>ovs-vsctl</code>(8) for more information.
334 </dd>
335
336 <dt><code>other-config:disable-in-band=true</code></dt>
337 <dd>
338 Suppresses in-band control flows for the integration bridge. It would be
339 unusual for such flows to show up anyway, because OVN uses a local
340 controller (over a Unix domain socket) instead of a remote controller.
341 It's possible, however, for some other bridge in the same system to have
342 an in-band remote controller, and in that case this suppresses the flows
343 that in-band control would ordinarily set up. Refer to the documentation
344 for more information.
345 </dd>
346 </dl>
347
348 <p>
349 The customary name for the integration bridge is <code>br-int</code>, but
350 another name may be used.
351 </p>
352
353 <h2>Logical Networks</h2>
354
355 <p>
356 A <dfn>logical network</dfn> implements the same concepts as physical
357 networks, but they are insulated from the physical network with tunnels or
358 other encapsulations. This allows logical networks to have separate IP and
359 other address spaces that overlap, without conflicting, with those used for
360 physical networks. Logical network topologies can be arranged without
361 regard for the topologies of the physical networks on which they run.
362 </p>
363
364 <p>
365 Logical network concepts in OVN include:
366 </p>
367
368 <ul>
369 <li>
370 <dfn>Logical switches</dfn>, the logical version of Ethernet switches.
371 </li>
372
373 <li>
374 <dfn>Logical routers</dfn>, the logical version of IP routers. Logical
375 switches and routers can be connected into sophisticated topologies.
376 </li>
377
378 <li>
379 <dfn>Logical datapaths</dfn> are the logical version of an OpenFlow
380 switch. Logical switches and routers are both implemented as logical
381 datapaths.
382 </li>
383
384 <li>
385 <p>
386 <dfn>Logical ports</dfn> represent the points of connectivity in and
387 out of logical switches and logical routers. Some common types of
388 logical ports are:
389 </p>
390
391 <ul>
392 <li>
393 Logical ports representing VIFs.
394 </li>
395
396 <li>
397 <dfn>Localnet ports</dfn> represent the points of connectivity
398 between logical switches and the physical network. They are
399 implemented as OVS patch ports between the integration bridge
400 and the separate Open vSwitch bridge that underlay physical
401 ports attach to.
402 </li>
403
404 <li>
405 <dfn>Logical patch ports</dfn> represent the points of
406 connectivity between logical switches and logical routers, and
407 in some cases between peer logical routers. There is a pair of
408 logical patch ports at each such point of connectivity, one on
409 each side.
410 </li>
411 <li>
412 <dfn>Localport ports</dfn> represent the points of local
413 connectivity between logical switches and VIFs. These ports are
414 present in every chassis (not bound to any particular one) and
415 traffic from them will never go through a tunnel. A
416 <code>localport</code> is expected to only generate traffic destined
417 for a local destination, typically in response to a request it
418 received.
419 One use case is how OpenStack Neutron uses a <code>localport</code>
420 port for serving metadata to VM's residing on every hypervisor. A
421 metadata proxy process is attached to this port on every host and all
422 VM's within the same network will reach it at the same IP/MAC address
423 without any traffic being sent over a tunnel. Further details can be
424 seen at https://docs.openstack.org/developer/networking-ovn/design/metadata_api.html.
425 </li>
426 </ul>
427 </li>
428 </ul>
429
430 <h2>Life Cycle of a VIF</h2>
431
432 <p>
433 Tables and their schemas presented in isolation are difficult to
434 understand. Here's an example.
435 </p>
436
437 <p>
438 A VIF on a hypervisor is a virtual network interface attached either
439 to a VM or a container running directly on that hypervisor (This is
440 different from the interface of a container running inside a VM).
441 </p>
442
443 <p>
444 The steps in this example refer often to details of the OVN and OVN
445 Northbound database schemas. Please see <code>ovn-sb</code>(5) and
446 <code>ovn-nb</code>(5), respectively, for the full story on these
447 databases.
448 </p>
449
450 <ol>
451 <li>
452 A VIF's life cycle begins when a CMS administrator creates a new VIF
453 using the CMS user interface or API and adds it to a switch (one
454 implemented by OVN as a logical switch). The CMS updates its own
455 configuration. This includes associating unique, persistent identifier
456 <var>vif-id</var> and Ethernet address <var>mac</var> with the VIF.
457 </li>
458
459 <li>
460 The CMS plugin updates the OVN Northbound database to include the new
461 VIF, by adding a row to the <code>Logical_Switch_Port</code>
462 table. In the new row, <code>name</code> is <var>vif-id</var>,
463 <code>mac</code> is <var>mac</var>, <code>switch</code> points to
464 the OVN logical switch's Logical_Switch record, and other columns
465 are initialized appropriately.
466 </li>
467
468 <li>
469 <code>ovn-northd</code> receives the OVN Northbound database update. In
470 turn, it makes the corresponding updates to the OVN Southbound database,
471 by adding rows to the OVN Southbound database <code>Logical_Flow</code>
472 table to reflect the new port, e.g. add a flow to recognize that packets
473 destined to the new port's MAC address should be delivered to it, and
474 update the flow that delivers broadcast and multicast packets to include
475 the new port. It also creates a record in the <code>Binding</code> table
476 and populates all its columns except the column that identifies the
477 <code>chassis</code>.
478 </li>
479
480 <li>
481 On every hypervisor, <code>ovn-controller</code> receives the
482 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
483 in the previous step. As long as the VM that owns the VIF is powered
484 off, <code>ovn-controller</code> cannot do much; it cannot, for example,
485 arrange to send packets to or receive packets from the VIF, because the
486 VIF does not actually exist anywhere.
487 </li>
488
489 <li>
490 Eventually, a user powers on the VM that owns the VIF. On the hypervisor
491 where the VM is powered on, the integration between the hypervisor and
492 Open vSwitch (described in <code>IntegrationGuide.rst</code>) adds the VIF
493 to the OVN integration bridge and stores <var>vif-id</var> in
494 <code>external_ids</code>:<code>iface-id</code> to indicate that the
495 interface is an instantiation of the new VIF. (None of this code is new
496 in OVN; this is pre-existing integration work that has already been done
497 on hypervisors that support OVS.)
498 </li>
499
500 <li>
501 On the hypervisor where the VM is powered on, <code>ovn-controller</code>
502 notices <code>external_ids</code>:<code>iface-id</code> in the new
503 Interface. In response, in the OVN Southbound DB, it updates the
504 <code>Binding</code> table's <code>chassis</code> column for the
505 row that links the logical port from <code>external_ids</code>:<code>
506 iface-id</code> to the hypervisor. Afterward, <code>ovn-controller</code>
507 updates the local hypervisor's OpenFlow tables so that packets to and from
508 the VIF are properly handled.
509 </li>
510
511 <li>
512 Some CMS systems, including OpenStack, fully start a VM only when its
513 networking is ready. To support this, <code>ovn-northd</code> notices
514 the <code>chassis</code> column updated for the row in
515 <code>Binding</code> table and pushes this upward by updating the
516 <ref column="up" table="Logical_Switch_Port" db="OVN_NB"/> column
517 in the OVN Northbound database's <ref table="Logical_Switch_Port"
518 db="OVN_NB"/> table to indicate that the VIF is now up. The CMS,
519 if it uses this feature, can then react by allowing the VM's
520 execution to proceed.
521 </li>
522
523 <li>
524 On every hypervisor but the one where the VIF resides,
525 <code>ovn-controller</code> notices the completely populated row in the
526 <code>Binding</code> table. This provides <code>ovn-controller</code>
527 the physical location of the logical port, so each instance updates the
528 OpenFlow tables of its switch (based on logical datapath flows in the OVN
529 DB <code>Logical_Flow</code> table) so that packets to and from the VIF
530 can be properly handled via tunnels.
531 </li>
532
533 <li>
534 Eventually, a user powers off the VM that owns the VIF. On the
535 hypervisor where the VM was powered off, the VIF is deleted from the OVN
536 integration bridge.
537 </li>
538
539 <li>
540 On the hypervisor where the VM was powered off,
541 <code>ovn-controller</code> notices that the VIF was deleted. In
542 response, it removes the <code>Chassis</code> column content in the
543 <code>Binding</code> table for the logical port.
544 </li>
545
546 <li>
547 On every hypervisor, <code>ovn-controller</code> notices the empty
548 <code>Chassis</code> column in the <code>Binding</code> table's row
549 for the logical port. This means that <code>ovn-controller</code> no
550 longer knows the physical location of the logical port, so each instance
551 updates its OpenFlow table to reflect that.
552 </li>
553
554 <li>
555 Eventually, when the VIF (or its entire VM) is no longer needed by
556 anyone, an administrator deletes the VIF using the CMS user interface or
557 API. The CMS updates its own configuration.
558 </li>
559
560 <li>
561 The CMS plugin removes the VIF from the OVN Northbound database,
562 by deleting its row in the <code>Logical_Switch_Port</code> table.
563 </li>
564
565 <li>
566 <code>ovn-northd</code> receives the OVN Northbound update and in turn
567 updates the OVN Southbound database accordingly, by removing or updating
568 the rows from the OVN Southbound database <code>Logical_Flow</code> table
569 and <code>Binding</code> table that were related to the now-destroyed
570 VIF.
571 </li>
572
573 <li>
574 On every hypervisor, <code>ovn-controller</code> receives the
575 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
576 in the previous step. <code>ovn-controller</code> updates OpenFlow
577 tables to reflect the update, although there may not be much to do, since
578 the VIF had already become unreachable when it was removed from the
579 <code>Binding</code> table in a previous step.
580 </li>
581 </ol>
582
583 <h2>Life Cycle of a Container Interface Inside a VM</h2>
584
585 <p>
586 OVN provides virtual network abstractions by converting information
587 written in OVN_NB database to OpenFlow flows in each hypervisor. Secure
588 virtual networking for multi-tenants can only be provided if OVN controller
589 is the only entity that can modify flows in Open vSwitch. When the
590 Open vSwitch integration bridge resides in the hypervisor, it is a
591 fair assumption to make that tenant workloads running inside VMs cannot
592 make any changes to Open vSwitch flows.
593 </p>
594
595 <p>
596 If the infrastructure provider trusts the applications inside the
597 containers not to break out and modify the Open vSwitch flows, then
598 containers can be run in hypervisors. This is also the case when
599 containers are run inside the VMs and Open vSwitch integration bridge
600 with flows added by OVN controller resides in the same VM. For both
601 the above cases, the workflow is the same as explained with an example
602 in the previous section ("Life Cycle of a VIF").
603 </p>
604
605 <p>
606 This section talks about the life cycle of a container interface (CIF)
607 when containers are created in the VMs and the Open vSwitch integration
608 bridge resides inside the hypervisor. In this case, even if a container
609 application breaks out, other tenants are not affected because the
610 containers running inside the VMs cannot modify the flows in the
611 Open vSwitch integration bridge.
612 </p>
613
614 <p>
615 When multiple containers are created inside a VM, there are multiple
616 CIFs associated with them. The network traffic associated with these
617 CIFs need to reach the Open vSwitch integration bridge running in the
618 hypervisor for OVN to support virtual network abstractions. OVN should
619 also be able to distinguish network traffic coming from different CIFs.
620 There are two ways to distinguish network traffic of CIFs.
621 </p>
622
623 <p>
624 One way is to provide one VIF for every CIF (1:1 model). This means that
625 there could be a lot of network devices in the hypervisor. This would slow
626 down OVS because of all the additional CPU cycles needed for the management
627 of all the VIFs. It would also mean that the entity creating the
628 containers in a VM should also be able to create the corresponding VIFs in
629 the hypervisor.
630 </p>
631
632 <p>
633 The second way is to provide a single VIF for all the CIFs (1:many model).
634 OVN could then distinguish network traffic coming from different CIFs via
635 a tag written in every packet. OVN uses this mechanism and uses VLAN as
636 the tagging mechanism.
637 </p>
638
639 <ol>
640 <li>
641 A CIF's life cycle begins when a container is spawned inside a VM by
642 the either the same CMS that created the VM or a tenant that owns that VM
643 or even a container Orchestration System that is different than the CMS
644 that initially created the VM. Whoever the entity is, it will need to
645 know the <var>vif-id</var> that is associated with the network interface
646 of the VM through which the container interface's network traffic is
647 expected to go through. The entity that creates the container interface
648 will also need to choose an unused VLAN inside that VM.
649 </li>
650
651 <li>
652 The container spawning entity (either directly or through the CMS that
653 manages the underlying infrastructure) updates the OVN Northbound
654 database to include the new CIF, by adding a row to the
655 <code>Logical_Switch_Port</code> table. In the new row,
656 <code>name</code> is any unique identifier,
657 <code>parent_name</code> is the <var>vif-id</var> of the VM
658 through which the CIF's network traffic is expected to go through
659 and the <code>tag</code> is the VLAN tag that identifies the
660 network traffic of that CIF.
661 </li>
662
663 <li>
664 <code>ovn-northd</code> receives the OVN Northbound database update. In
665 turn, it makes the corresponding updates to the OVN Southbound database,
666 by adding rows to the OVN Southbound database's <code>Logical_Flow</code>
667 table to reflect the new port and also by creating a new row in the
668 <code>Binding</code> table and populating all its columns except the
669 column that identifies the <code>chassis</code>.
670 </li>
671
672 <li>
673 On every hypervisor, <code>ovn-controller</code> subscribes to the
674 changes in the <code>Binding</code> table. When a new row is created
675 by <code>ovn-northd</code> that includes a value in
676 <code>parent_port</code> column of <code>Binding</code> table, the
677 <code>ovn-controller</code> in the hypervisor whose OVN integration bridge
678 has that same value in <var>vif-id</var> in
679 <code>external_ids</code>:<code>iface-id</code>
680 updates the local hypervisor's OpenFlow tables so that packets to and
681 from the VIF with the particular VLAN <code>tag</code> are properly
682 handled. Afterward it updates the <code>chassis</code> column of
683 the <code>Binding</code> to reflect the physical location.
684 </li>
685
686 <li>
687 One can only start the application inside the container after the
688 underlying network is ready. To support this, <code>ovn-northd</code>
689 notices the updated <code>chassis</code> column in <code>Binding</code>
690 table and updates the <ref column="up" table="Logical_Switch_Port"
691 db="OVN_NB"/> column in the OVN Northbound database's
692 <ref table="Logical_Switch_Port" db="OVN_NB"/> table to indicate that the
693 CIF is now up. The entity responsible to start the container application
694 queries this value and starts the application.
695 </li>
696
697 <li>
698 Eventually the entity that created and started the container, stops it.
699 The entity, through the CMS (or directly) deletes its row in the
700 <code>Logical_Switch_Port</code> table.
701 </li>
702
703 <li>
704 <code>ovn-northd</code> receives the OVN Northbound update and in turn
705 updates the OVN Southbound database accordingly, by removing or updating
706 the rows from the OVN Southbound database <code>Logical_Flow</code> table
707 that were related to the now-destroyed CIF. It also deletes the row in
708 the <code>Binding</code> table for that CIF.
709 </li>
710
711 <li>
712 On every hypervisor, <code>ovn-controller</code> receives the
713 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
714 in the previous step. <code>ovn-controller</code> updates OpenFlow
715 tables to reflect the update.
716 </li>
717 </ol>
718
719 <h2>Architectural Physical Life Cycle of a Packet</h2>
720
721 <p>
722 This section describes how a packet travels from one virtual machine or
723 container to another through OVN. This description focuses on the physical
724 treatment of a packet; for a description of the logical life cycle of a
725 packet, please refer to the <code>Logical_Flow</code> table in
726 <code>ovn-sb</code>(5).
727 </p>
728
729 <p>
730 This section mentions several data and metadata fields, for clarity
731 summarized here:
732 </p>
733
734 <dl>
735 <dt>tunnel key</dt>
736 <dd>
737 When OVN encapsulates a packet in Geneve or another tunnel, it attaches
738 extra data to it to allow the receiving OVN instance to process it
739 correctly. This takes different forms depending on the particular
740 encapsulation, but in each case we refer to it here as the ``tunnel
741 key.'' See <code>Tunnel Encapsulations</code>, below, for details.
742 </dd>
743
744 <dt>logical datapath field</dt>
745 <dd>
746 A field that denotes the logical datapath through which a packet is being
747 processed.
748 <!-- Keep the following in sync with MFF_LOG_DATAPATH in
749 ovn/lib/logical-fields.h. -->
750 OVN uses the field that OpenFlow 1.1+ simply (and confusingly) calls
751 ``metadata'' to store the logical datapath. (This field is passed across
752 tunnels as part of the tunnel key.)
753 </dd>
754
755 <dt>logical input port field</dt>
756 <dd>
757 <p>
758 A field that denotes the logical port from which the packet
759 entered the logical datapath.
760 <!-- Keep the following in sync with MFF_LOG_INPORT in
761 ovn/lib/logical-fields.h. -->
762 OVN stores this in Open vSwitch extension register number 14.
763 </p>
764
765 <p>
766 Geneve and STT tunnels pass this field as part of the tunnel key.
767 Although VXLAN tunnels do not explicitly carry a logical input port,
768 OVN only uses VXLAN to communicate with gateways that from OVN's
769 perspective consist of only a single logical port, so that OVN can set
770 the logical input port field to this one on ingress to the OVN logical
771 pipeline.
772 </p>
773 </dd>
774
775 <dt>logical output port field</dt>
776 <dd>
777 <p>
778 A field that denotes the logical port from which the packet will
779 leave the logical datapath. This is initialized to 0 at the
780 beginning of the logical ingress pipeline.
781 <!-- Keep the following in sync with MFF_LOG_OUTPORT in
782 ovn/lib/logical-fields.h. -->
783 OVN stores this in Open vSwitch extension register number 15.
784 </p>
785
786 <p>
787 Geneve and STT tunnels pass this field as part of the tunnel key.
788 VXLAN tunnels do not transmit the logical output port field.
789 Since VXLAN tunnels do not carry a logical output port field in
790 the tunnel key, when a packet is received from VXLAN tunnel by
791 an OVN hypervisor, the packet is resubmitted to table 8 to
792 determine the output port(s); when the packet reaches table 32,
793 these packets are resubmitted to table 33 for local delivery by
794 checking a MLF_RCV_FROM_VXLAN flag, which is set when the packet
795 arrives from a VXLAN tunnel.
796 </p>
797 </dd>
798
799 <dt>conntrack zone field for logical ports</dt>
800 <dd>
801 A field that denotes the connection tracking zone for logical ports.
802 The value only has local significance and is not meaningful between
803 chassis. This is initialized to 0 at the beginning of the logical
804 <!-- Keep the following in sync with MFF_LOG_CT_ZONE in
805 ovn/lib/logical-fields.h. -->
806 ingress pipeline. OVN stores this in Open vSwitch extension register
807 number 13.
808 </dd>
809
810 <dt>conntrack zone fields for routers</dt>
811 <dd>
812 Fields that denote the connection tracking zones for routers. These
813 values only have local significance and are not meaningful between
814 chassis. OVN stores the zone information for DNATting in Open vSwitch
815 <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
816 MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
817 extension register number 11 and zone information for SNATing in
818 Open vSwitch extension register number 12.
819 </dd>
820
821 <dt>logical flow flags</dt>
822 <dd>
823 The logical flags are intended to handle keeping context between
824 tables in order to decide which rules in subsequent tables are
825 matched. These values only have local significance and are not
826 meaningful between chassis. OVN stores the logical flags in
827 <!-- Keep the following in sync with MFF_LOG_FLAGS in
828 ovn/lib/logical-fields.h. -->
829 Open vSwitch extension register number 10.
830 </dd>
831
832 <dt>VLAN ID</dt>
833 <dd>
834 The VLAN ID is used as an interface between OVN and containers nested
835 inside a VM (see <code>Life Cycle of a container interface inside a
836 VM</code>, above, for more information).
837 </dd>
838 </dl>
839
840 <p>
841 Initially, a VM or container on the ingress hypervisor sends a packet on a
842 port attached to the OVN integration bridge. Then:
843 </p>
844
845 <ol>
846 <li>
847 <p>
848 OpenFlow table 0 performs physical-to-logical translation. It matches
849 the packet's ingress port. Its actions annotate the packet with
850 logical metadata, by setting the logical datapath field to identify the
851 logical datapath that the packet is traversing and the logical input
852 port field to identify the ingress port. Then it resubmits to table 8
853 to enter the logical ingress pipeline.
854 </p>
855
856 <p>
857 Packets that originate from a container nested within a VM are treated
858 in a slightly different way. The originating container can be
859 distinguished based on the VIF-specific VLAN ID, so the
860 physical-to-logical translation flows additionally match on VLAN ID and
861 the actions strip the VLAN header. Following this step, OVN treats
862 packets from containers just like any other packets.
863 </p>
864
865 <p>
866 Table 0 also processes packets that arrive from other chassis. It
867 distinguishes them from other packets by ingress port, which is a
868 tunnel. As with packets just entering the OVN pipeline, the actions
869 annotate these packets with logical datapath and logical ingress port
870 metadata. In addition, the actions set the logical output port field,
871 which is available because in OVN tunneling occurs after the logical
872 output port is known. These three pieces of information are obtained
873 from the tunnel encapsulation metadata (see <code>Tunnel
874 Encapsulations</code> for encoding details). Then the actions resubmit
875 to table 33 to enter the logical egress pipeline.
876 </p>
877 </li>
878
879 <li>
880 <p>
881 OpenFlow tables 8 through 31 execute the logical ingress pipeline from
882 the <code>Logical_Flow</code> table in the OVN Southbound database.
883 These tables are expressed entirely in terms of logical concepts like
884 logical ports and logical datapaths. A big part of
885 <code>ovn-controller</code>'s job is to translate them into equivalent
886 OpenFlow (in particular it translates the table numbers:
887 <code>Logical_Flow</code> tables 0 through 23 become OpenFlow tables 8
888 through 31).
889 </p>
890
891 <p>
892 Each logical flow maps to one or more OpenFlow flows. An actual packet
893 ordinarily matches only one of these, although in some cases it can
894 match more than one of these flows (which is not a problem because all
895 of them have the same actions). <code>ovn-controller</code> uses the
896 first 32 bits of the logical flow's UUID as the cookie for its OpenFlow
897 flow or flows. (This is not necessarily unique, since the first 32
898 bits of a logical flow's UUID is not necessarily unique.)
899 </p>
900
901 <p>
902 Some logical flows can map to the Open vSwitch ``conjunctive match''
903 extension (see <code>ovs-fields</code>(7)). Flows with a
904 <code>conjunction</code> action use an OpenFlow cookie of 0, because
905 they can correspond to multiple logical flows. The OpenFlow flow for a
906 conjunctive match includes a match on <code>conj_id</code>.
907 </p>
908
909 <p>
910 Some logical flows may not be represented in the OpenFlow tables on a
911 given hypervisor, if they could not be used on that hypervisor. For
912 example, if no VIF in a logical switch resides on a given hypervisor,
913 and the logical switch is not otherwise reachable on that hypervisor
914 (e.g. over a series of hops through logical switches and routers
915 starting from a VIF on the hypervisor), then the logical flow may not
916 be represented there.
917 </p>
918
919 <p>
920 Most OVN actions have fairly obvious implementations in OpenFlow (with
921 OVS extensions), e.g. <code>next;</code> is implemented as
922 <code>resubmit</code>, <code><var>field</var> =
923 <var>constant</var>;</code> as <code>set_field</code>. A few are worth
924 describing in more detail:
925 </p>
926
927 <dl>
928 <dt><code>output:</code></dt>
929 <dd>
930 Implemented by resubmitting the packet to table 32. If the pipeline
931 executes more than one <code>output</code> action, then each one is
932 separately resubmitted to table 32. This can be used to send
933 multiple copies of the packet to multiple ports. (If the packet was
934 not modified between the <code>output</code> actions, and some of the
935 copies are destined to the same hypervisor, then using a logical
936 multicast output port would save bandwidth between hypervisors.)
937 </dd>
938
939 <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt>
940 <dt><code>get_nd(<var>P</var>, <var>A</var>);</code></dt>
941 <dd>
942 <p>
943 Implemented by storing arguments into OpenFlow fields, then
944 resubmitting to table 66, which <code>ovn-controller</code>
945 populates with flows generated from the <code>MAC_Binding</code>
946 table in the OVN Southbound database. If there is a match in table
947 66, then its actions store the bound MAC in the Ethernet
948 destination address field.
949 </p>
950
951 <p>
952 (The OpenFlow actions save and restore the OpenFlow fields used for
953 the arguments, so that the OVN actions do not have to be aware of
954 this temporary use.)
955 </p>
956 </dd>
957
958 <dt><code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
959 <dt><code>put_nd(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
960 <dd>
961 <p>
962 Implemented by storing the arguments into OpenFlow fields, then
963 outputting a packet to <code>ovn-controller</code>, which updates
964 the <code>MAC_Binding</code> table.
965 </p>
966
967 <p>
968 (The OpenFlow actions save and restore the OpenFlow fields used for
969 the arguments, so that the OVN actions do not have to be aware of
970 this temporary use.)
971 </p>
972 </dd>
973 </dl>
974 </li>
975
976 <li>
977 <p>
978 OpenFlow tables 32 through 47 implement the <code>output</code> action
979 in the logical ingress pipeline. Specifically, table 32 handles
980 packets to remote hypervisors, table 33 handles packets to the local
981 hypervisor, and table 34 checks whether packets whose logical ingress
982 and egress port are the same should be discarded.
983 </p>
984
985 <p>
986 Logical patch ports are a special case. Logical patch ports do not
987 have a physical location and effectively reside on every hypervisor.
988 Thus, flow table 33, for output to ports on the local hypervisor,
989 naturally implements output to unicast logical patch ports too.
990 However, applying the same logic to a logical patch port that is part
991 of a logical multicast group yields packet duplication, because each
992 hypervisor that contains a logical port in the multicast group will
993 also output the packet to the logical patch port. Thus, multicast
994 groups implement output to logical patch ports in table 32.
995 </p>
996
997 <p>
998 Each flow in table 32 matches on a logical output port for unicast or
999 multicast logical ports that include a logical port on a remote
1000 hypervisor. Each flow's actions implement sending a packet to the port
1001 it matches. For unicast logical output ports on remote hypervisors,
1002 the actions set the tunnel key to the correct value, then send the
1003 packet on the tunnel port to the correct hypervisor. (When the remote
1004 hypervisor receives the packet, table 0 there will recognize it as a
1005 tunneled packet and pass it along to table 33.) For multicast logical
1006 output ports, the actions send one copy of the packet to each remote
1007 hypervisor, in the same way as for unicast destinations. If a
1008 multicast group includes a logical port or ports on the local
1009 hypervisor, then its actions also resubmit to table 33. Table 32 also
1010 includes:
1011 </p>
1012
1013 <ul>
1014 <li>
1015 A higher-priority rule to match packets received from VXLAN tunnels,
1016 based on flag MLF_RCV_FROM_VXLAN, and resubmit these packets to table
1017 33 for local delivery. Packets received from VXLAN tunnels reach
1018 here because of a lack of logical output port field in the tunnel key
1019 and thus these packets needed to be submitted to table 8 to
1020 determine the output port.
1021 </li>
1022 <li>
1023 A higher-priority rule to match packets received from ports of type
1024 <code>localport</code>, based on the logical input port, and resubmit
1025 these packets to table 33 for local delivery. Ports of type
1026 <code>localport</code> exist on every hypervisor and by definition
1027 their traffic should never go out through a tunnel.
1028 </li>
1029 <li>
1030 A fallback flow that resubmits to table 33 if there is no other
1031 match.
1032 </li>
1033 </ul>
1034
1035 <p>
1036 Flows in table 33 resemble those in table 32 but for logical ports that
1037 reside locally rather than remotely. For unicast logical output ports
1038 on the local hypervisor, the actions just resubmit to table 34. For
1039 multicast output ports that include one or more logical ports on the
1040 local hypervisor, for each such logical port <var>P</var>, the actions
1041 change the logical output port to <var>P</var>, then resubmit to table
1042 34.
1043 </p>
1044
1045 <p>
1046 A special case is that when a localnet port exists on the datapath,
1047 remote port is connected by switching to the localnet port. In this
1048 case, instead of adding a flow in table 32 to reach the remote port, a
1049 flow is added in table 33 to switch the logical outport to the localnet
1050 port, and resubmit to table 33 as if it were unicasted to a logical
1051 port on the local hypervisor.
1052 </p>
1053
1054 <p>
1055 Table 34 matches and drops packets for which the logical input and
1056 output ports are the same and the MLF_ALLOW_LOOPBACK flag is not
1057 set. It resubmits other packets to table 40.
1058 </p>
1059 </li>
1060
1061 <li>
1062 <p>
1063 OpenFlow tables 40 through 63 execute the logical egress pipeline from
1064 the <code>Logical_Flow</code> table in the OVN Southbound database.
1065 The egress pipeline can perform a final stage of validation before
1066 packet delivery. Eventually, it may execute an <code>output</code>
1067 action, which <code>ovn-controller</code> implements by resubmitting to
1068 table 64. A packet for which the pipeline never executes
1069 <code>output</code> is effectively dropped (although it may have been
1070 transmitted through a tunnel across a physical network).
1071 </p>
1072
1073 <p>
1074 The egress pipeline cannot change the logical output port or cause
1075 further tunneling.
1076 </p>
1077 </li>
1078
1079 <li>
1080 <p>
1081 Table 64 bypasses OpenFlow loopback when MLF_ALLOW_LOOPBACK is set.
1082 Logical loopback was handled in table 34, but OpenFlow by default also
1083 prevents loopback to the OpenFlow ingress port. Thus, when
1084 MLF_ALLOW_LOOPBACK is set, OpenFlow table 64 saves the OpenFlow ingress
1085 port, sets it to zero, resubmits to table 65 for logical-to-physical
1086 transformation, and then restores the OpenFlow ingress port,
1087 effectively disabling OpenFlow loopback prevents. When
1088 MLF_ALLOW_LOOPBACK is unset, table 64 flow simply resubmits to table
1089 65.
1090 </p>
1091 </li>
1092
1093 <li>
1094 <p>
1095 OpenFlow table 65 performs logical-to-physical translation, the
1096 opposite of table 0. It matches the packet's logical egress port. Its
1097 actions output the packet to the port attached to the OVN integration
1098 bridge that represents that logical port. If the logical egress port
1099 is a container nested with a VM, then before sending the packet the
1100 actions push on a VLAN header with an appropriate VLAN ID.
1101 </p>
1102 </li>
1103 </ol>
1104
1105 <h2>Logical Routers and Logical Patch Ports</h2>
1106
1107 <p>
1108 Typically logical routers and logical patch ports do not have a
1109 physical location and effectively reside on every hypervisor. This is
1110 the case for logical patch ports between logical routers and logical
1111 switches behind those logical routers, to which VMs (and VIFs) attach.
1112 </p>
1113
1114 <p>
1115 Consider a packet sent from one virtual machine or container to another
1116 VM or container that resides on a different subnet. The packet will
1117 traverse tables 0 to 65 as described in the previous section
1118 <code>Architectural Physical Life Cycle of a Packet</code>, using the
1119 logical datapath representing the logical switch that the sender is
1120 attached to. At table 32, the packet will use the fallback flow that
1121 resubmits locally to table 33 on the same hypervisor. In this case,
1122 all of the processing from table 0 to table 65 occurs on the hypervisor
1123 where the sender resides.
1124 </p>
1125
1126 <p>
1127 When the packet reaches table 65, the logical egress port is a logical
1128 patch port. The implementation in table 65 differs depending on the OVS
1129 version, although the observed behavior is meant to be the same:
1130 </p>
1131
1132 <ul>
1133 <li>
1134 In OVS versions 2.6 and earlier, table 65 outputs to an OVS patch
1135 port that represents the logical patch port. The packet re-enters
1136 the OpenFlow flow table from the OVS patch port's peer in table 0,
1137 which identifies the logical datapath and logical input port based
1138 on the OVS patch port's OpenFlow port number.
1139 </li>
1140
1141 <li>
1142 In OVS versions 2.7 and later, the packet is cloned and resubmitted
1143 directly to the first OpenFlow flow table in the ingress pipeline,
1144 setting the logical ingress port to the peer logical patch port, and
1145 using the peer logical patch port's logical datapath (that
1146 represents the logical router).
1147 </li>
1148 </ul>
1149
1150 <p>
1151 The packet re-enters the ingress pipeline in order to traverse tables
1152 8 to 65 again, this time using the logical datapath representing the
1153 logical router. The processing continues as described in the previous
1154 section <code>Architectural Physical Life Cycle of a Packet</code>.
1155 When the packet reachs table 65, the logical egress port will once
1156 again be a logical patch port. In the same manner as described above,
1157 this logical patch port will cause the packet to be resubmitted to
1158 OpenFlow tables 8 to 65, this time using the logical datapath
1159 representing the logical switch that the destination VM or container
1160 is attached to.
1161 </p>
1162
1163 <p>
1164 The packet traverses tables 8 to 65 a third and final time. If the
1165 destination VM or container resides on a remote hypervisor, then table
1166 32 will send the packet on a tunnel port from the sender's hypervisor
1167 to the remote hypervisor. Finally table 65 will output the packet
1168 directly to the destination VM or container.
1169 </p>
1170
1171 <p>
1172 The following sections describe two exceptions, where logical routers
1173 and/or logical patch ports are associated with a physical location.
1174 </p>
1175
1176 <h3>Gateway Routers</h3>
1177
1178 <p>
1179 A <dfn>gateway router</dfn> is a logical router that is bound to a
1180 physical location. This includes all of the logical patch ports of
1181 the logical router, as well as all of the peer logical patch ports on
1182 logical switches. In the OVN Southbound database, the
1183 <code>Port_Binding</code> entries for these logical patch ports use
1184 the type <code>l3gateway</code> rather than <code>patch</code>, in
1185 order to distinguish that these logical patch ports are bound to a
1186 chassis.
1187 </p>
1188
1189 <p>
1190 When a hypervisor processes a packet on a logical datapath
1191 representing a logical switch, and the logical egress port is a
1192 <code>l3gateway</code> port representing connectivity to a gateway
1193 router, the packet will match a flow in table 32 that sends the
1194 packet on a tunnel port to the chassis where the gateway router
1195 resides. This processing in table 32 is done in the same manner as
1196 for VIFs.
1197 </p>
1198
1199 <p>
1200 Gateway routers are typically used in between distributed logical
1201 routers and physical networks. The distributed logical router and
1202 the logical switches behind it, to which VMs and containers attach,
1203 effectively reside on each hypervisor. The distributed router and
1204 the gateway router are connected by another logical switch, sometimes
1205 referred to as a <code>join</code> logical switch. On the other
1206 side, the gateway router connects to another logical switch that has
1207 a localnet port connecting to the physical network.
1208 </p>
1209
1210 <p>
1211 When using gateway routers, DNAT and SNAT rules are associated with
1212 the gateway router, which provides a central location that can handle
1213 one-to-many SNAT (aka IP masquerading).
1214 </p>
1215
1216 <h3>Distributed Gateway Ports</h3>
1217
1218 <p>
1219 <dfn>Distributed gateway ports</dfn> are logical router patch ports
1220 that directly connect distributed logical routers to logical
1221 switches with localnet ports.
1222 </p>
1223
1224 <p>
1225 The primary design goal of distributed gateway ports is to allow as
1226 much traffic as possible to be handled locally on the hypervisor
1227 where a VM or container resides. Whenever possible, packets from
1228 the VM or container to the outside world should be processed
1229 completely on that VM's or container's hypervisor, eventually
1230 traversing a localnet port instance on that hypervisor to the
1231 physical network. Whenever possible, packets from the outside
1232 world to a VM or container should be directed through the physical
1233 network directly to the VM's or container's hypervisor, where the
1234 packet will enter the integration bridge through a localnet port.
1235 </p>
1236
1237 <p>
1238 In order to allow for the distributed processing of packets
1239 described in the paragraph above, distributed gateway ports need to
1240 be logical patch ports that effectively reside on every hypervisor,
1241 rather than <code>l3gateway</code> ports that are bound to a
1242 particular chassis. However, the flows associated with distributed
1243 gateway ports often need to be associated with physical locations,
1244 for the following reasons:
1245 </p>
1246
1247 <ul>
1248 <li>
1249 <p>
1250 The physical network that the localnet port is attached to
1251 typically uses L2 learning. Any Ethernet address used over the
1252 distributed gateway port must be restricted to a single physical
1253 location so that upstream L2 learning is not confused. Traffic
1254 sent out the distributed gateway port towards the localnet port
1255 with a specific Ethernet address must be sent out one specific
1256 instance of the distributed gateway port on one specific
1257 chassis. Traffic received from the localnet port (or from a VIF
1258 on the same logical switch as the localnet port) with a specific
1259 Ethernet address must be directed to the logical switch's patch
1260 port instance on that specific chassis.
1261 </p>
1262
1263 <p>
1264 Due to the implications of L2 learning, the Ethernet address and
1265 IP address of the distributed gateway port need to be restricted
1266 to a single physical location. For this reason, the user must
1267 specify one chassis associated with the distributed gateway
1268 port. Note that traffic traversing the distributed gateway port
1269 using other Ethernet addresses and IP addresses (e.g. one-to-one
1270 NAT) is not restricted to this chassis.
1271 </p>
1272
1273 <p>
1274 Replies to ARP and ND requests must be restricted to a single
1275 physical location, where the Ethernet address in the reply
1276 resides. This includes ARP and ND replies for the IP address
1277 of the distributed gateway port, which are restricted to the
1278 chassis that the user associated with the distributed gateway
1279 port.
1280 </p>
1281 </li>
1282
1283 <li>
1284 In order to support one-to-many SNAT (aka IP masquerading), where
1285 multiple logical IP addresses spread across multiple chassis are
1286 mapped to a single external IP address, it will be necessary to
1287 handle some of the logical router processing on a specific chassis
1288 in a centralized manner. Since the SNAT external IP address is
1289 typically the distributed gateway port IP address, and for
1290 simplicity, the same chassis associated with the distributed
1291 gateway port is used.
1292 </li>
1293 </ul>
1294
1295 <p>
1296 The details of flow restrictions to specific chassis are described
1297 in the <code>ovn-northd</code> documentation.
1298 </p>
1299
1300 <p>
1301 While most of the physical location dependent aspects of distributed
1302 gateway ports can be handled by restricting some flows to specific
1303 chassis, one additional mechanism is required. When a packet
1304 leaves the ingress pipeline and the logical egress port is the
1305 distributed gateway port, one of two different sets of actions is
1306 required at table 32:
1307 </p>
1308
1309 <ul>
1310 <li>
1311 If the packet can be handled locally on the sender's hypervisor
1312 (e.g. one-to-one NAT traffic), then the packet should just be
1313 resubmitted locally to table 33, in the normal manner for
1314 distributed logical patch ports.
1315 </li>
1316
1317 <li>
1318 However, if the packet needs to be handled on the chassis
1319 associated with the distributed gateway port (e.g. one-to-many
1320 SNAT traffic or non-NAT traffic), then table 32 must send the
1321 packet on a tunnel port to that chassis.
1322 </li>
1323 </ul>
1324
1325 <p>
1326 In order to trigger the second set of actions, the
1327 <code>chassisredirect</code> type of southbound
1328 <code>Port_Binding</code> has been added. Setting the logical
1329 egress port to the type <code>chassisredirect</code> logical port is
1330 simply a way to indicate that although the packet is destined for
1331 the distributed gateway port, it needs to be redirected to a
1332 different chassis. At table 32, packets with this logical egress
1333 port are sent to a specific chassis, in the same way that table 32
1334 directs packets whose logical egress port is a VIF or a type
1335 <code>l3gateway</code> port to different chassis. Once the packet
1336 arrives at that chassis, table 33 resets the logical egress port to
1337 the value representing the distributed gateway port. For each
1338 distributed gateway port, there is one type
1339 <code>chassisredirect</code> port, in addition to the distributed
1340 logical patch port representing the distributed gateway port.
1341 </p>
1342
1343 <h3>High Availability for Distributed Gateway Ports</h3>
1344
1345 <p>
1346 OVN allows you to specify a prioritized list of chassis for a distributed
1347 gateway port. This is done by associating multiple
1348 <code>Gateway_Chassis</code> rows with a <code>Logical_Router_Port</code>
1349 in the <code>OVN_Northbound</code> database.
1350 </p>
1351
1352 <p>
1353 When multiple chassis have been specified for a gateway, all chassis that
1354 may send packets to that gateway will enable BFD on tunnels to all
1355 configured gateway chassis. The current master chassis for the gateway
1356 is the highest priority gateway chassis that is currently viewed as
1357 active based on BFD status.
1358 </p>
1359
1360 <p>
1361 For more information on L3 gateway high availability, please refer to
1362 http://docs.openvswitch.org/en/latest/topics/high-availability.
1363 </p>
1364
1365 <h2>Life Cycle of a VTEP gateway</h2>
1366
1367 <p>
1368 A gateway is a chassis that forwards traffic between the OVN-managed
1369 part of a logical network and a physical VLAN, extending a
1370 tunnel-based logical network into a physical network.
1371 </p>
1372
1373 <p>
1374 The steps below refer often to details of the OVN and VTEP database
1375 schemas. Please see <code>ovn-sb</code>(5), <code>ovn-nb</code>(5)
1376 and <code>vtep</code>(5), respectively, for the full story on these
1377 databases.
1378 </p>
1379
1380 <ol>
1381 <li>
1382 A VTEP gateway's life cycle begins with the administrator registering
1383 the VTEP gateway as a <code>Physical_Switch</code> table entry in the
1384 <code>VTEP</code> database. The <code>ovn-controller-vtep</code>
1385 connected to this VTEP database, will recognize the new VTEP gateway
1386 and create a new <code>Chassis</code> table entry for it in the
1387 <code>OVN_Southbound</code> database.
1388 </li>
1389
1390 <li>
1391 The administrator can then create a new <code>Logical_Switch</code>
1392 table entry, and bind a particular vlan on a VTEP gateway's port to
1393 any VTEP logical switch. Once a VTEP logical switch is bound to
1394 a VTEP gateway, the <code>ovn-controller-vtep</code> will detect
1395 it and add its name to the <var>vtep_logical_switches</var>
1396 column of the <code>Chassis</code> table in the <code>
1397 OVN_Southbound</code> database. Note, the <var>tunnel_key</var>
1398 column of VTEP logical switch is not filled at creation. The
1399 <code>ovn-controller-vtep</code> will set the column when the
1400 correponding vtep logical switch is bound to an OVN logical network.
1401 </li>
1402
1403 <li>
1404 Now, the administrator can use the CMS to add a VTEP logical switch
1405 to the OVN logical network. To do that, the CMS must first create a
1406 new <code>Logical_Switch_Port</code> table entry in the <code>
1407 OVN_Northbound</code> database. Then, the <var>type</var> column
1408 of this entry must be set to "vtep". Next, the <var>
1409 vtep-logical-switch</var> and <var>vtep-physical-switch</var> keys
1410 in the <var>options</var> column must also be specified, since
1411 multiple VTEP gateways can attach to the same VTEP logical switch.
1412 </li>
1413
1414 <li>
1415 The newly created logical port in the <code>OVN_Northbound</code>
1416 database and its configuration will be passed down to the <code>
1417 OVN_Southbound</code> database as a new <code>Port_Binding</code>
1418 table entry. The <code>ovn-controller-vtep</code> will recognize the
1419 change and bind the logical port to the corresponding VTEP gateway
1420 chassis. Configuration of binding the same VTEP logical switch to
1421 a different OVN logical networks is not allowed and a warning will be
1422 generated in the log.
1423 </li>
1424
1425 <li>
1426 Beside binding to the VTEP gateway chassis, the <code>
1427 ovn-controller-vtep</code> will update the <var>tunnel_key</var>
1428 column of the VTEP logical switch to the corresponding <code>
1429 Datapath_Binding</code> table entry's <var>tunnel_key</var> for the
1430 bound OVN logical network.
1431 </li>
1432
1433 <li>
1434 Next, the <code>ovn-controller-vtep</code> will keep reacting to the
1435 configuration change in the <code>Port_Binding</code> in the
1436 <code>OVN_Northbound</code> database, and updating the
1437 <code>Ucast_Macs_Remote</code> table in the <code>VTEP</code> database.
1438 This allows the VTEP gateway to understand where to forward the unicast
1439 traffic coming from the extended external network.
1440 </li>
1441
1442 <li>
1443 Eventually, the VTEP gateway's life cycle ends when the administrator
1444 unregisters the VTEP gateway from the <code>VTEP</code> database.
1445 The <code>ovn-controller-vtep</code> will recognize the event and
1446 remove all related configurations (<code>Chassis</code> table entry
1447 and port bindings) in the <code>OVN_Southbound</code> database.
1448 </li>
1449
1450 <li>
1451 When the <code>ovn-controller-vtep</code> is terminated, all related
1452 configurations in the <code>OVN_Southbound</code> database and
1453 the <code>VTEP</code> database will be cleaned, including
1454 <code>Chassis</code> table entries for all registered VTEP gateways
1455 and their port bindings, and all <code>Ucast_Macs_Remote</code> table
1456 entries and the <code>Logical_Switch</code> tunnel keys.
1457 </li>
1458 </ol>
1459
1460 <h1>Security</h1>
1461
1462 <h2>Role-Based Access Controls for the Soutbound DB</h2>
1463 <p>
1464 In order to provide additional security against the possibility of an OVN
1465 chassis becoming compromised in such a way as to allow rogue software to
1466 make arbitrary modifications to the southbound database state and thus
1467 disrupt the OVN network, role-based access controls (see
1468 <code>ovsdb-server(1)</code> for additional details) are provided for the
1469 southbound database.
1470 </p>
1471
1472 <p>
1473 The implementation of role-based access controls (RBAC) requires the
1474 addition of two tables to an OVSDB schema: the <code>RBAC_Role</code>
1475 table, which is indexed by role name and maps the the names of the various
1476 tables that may be modifiable for a given role to individual rows in a
1477 permissions table containing detailed permission information for that role,
1478 and the permission table itself which consists of rows containing the
1479 following information:
1480 </p>
1481 <dl>
1482 <dt><code>Table Name</code></dt>
1483 <dd>
1484 The name of the associated table. This column exists primarily as an
1485 aid for humans reading the contents of this table.
1486 </dd>
1487
1488 <dt><code>Auth Criteria</code></dt>
1489 <dd>
1490 A set of strings containing the names of columns (or column:key pairs
1491 for columns containing string:string maps). The contents of at least
1492 one of the columns or column:key values in a row to be modified,
1493 inserted, or deleted must be equal to the ID of the client attempting
1494 to act on the row in order for the authorization check to pass. If the
1495 authorization criteria is empty, authorization checking is disabled and
1496 all clients for the role will be treated as authorized.
1497 </dd>
1498
1499 <dt><code>Insert/Delete</code></dt>
1500 <dd>
1501 Row insertion/deletion permission; boolean value indicating whether
1502 insertion and deletion of rows is allowed for the associated table.
1503 If true, insertion and deletion of rows is allowed for authorized
1504 clients.
1505 </dd>
1506
1507 <dt><code>Updatable Columns</code></dt>
1508 <dd>
1509 A set of strings containing the names of columns or column:key pairs
1510 that may be updated or mutated by authorized clients. Modifications to
1511 columns within a row are only permitted when the authorization check
1512 for the client passes and all columns to be modified are included in
1513 this set of modifiable columns.
1514 </dd>
1515 </dl>
1516
1517 <p>
1518 RBAC configuration for the OVN southbound database is maintained by
1519 ovn-northd. With RBAC enabled, modifications are only permitted for the
1520 <code>Chassis</code>, <code>Encap</code>, <code>Port_Binding</code>, and
1521 <code>MAC_Binding</code> tables, and are resstricted as follows:
1522 </p>
1523 <dl>
1524 <dt><code>Chassis</code></dt>
1525 <dd>
1526 <p>
1527 <code>Authorization</code>: client ID must match the chassis name.
1528 </p>
1529 <p>
1530 <code>Insert/Delete</code>: authorized row insertion and deletion
1531 are permitted.
1532 </p>
1533 <p>
1534 <code>Update</code>: The columns <code>nb_cfg</code>,
1535 <code>external_ids</code>, <code>encaps</code>, and
1536 <code>vtep_logical_switches</code> may be modified when authorized.
1537 </p>
1538 </dd>
1539
1540 <dt><code>Encap</code></dt>
1541 <dd>
1542 <p>
1543 <code>Authorization</code>: disabled (all clients are considered
1544 to be authorized. Future: add a "creating chassis name" column to
1545 this table and use it for authorization checking.
1546 </p>
1547 <p>
1548 <code>Insert/Delete</code>: row insertion and row deletion
1549 are permitted.
1550 </p>
1551 <p>
1552 <code>Update</code>: The columns <code>type</code>,
1553 <code>options</code>, and <code>ip</code> can be modified.
1554 </p>
1555 </dd>
1556
1557 <dt><code>Port_Binding</code></dt>
1558 <dd>
1559 <p>
1560 <code>Authorization</code>: disabled (all clients are considered
1561 authorized. A future enhancement may add columns (or keys to
1562 <code>external_ids</code>) in order to control which chassis are
1563 allowed to bind each port.
1564 </p>
1565 <p>
1566 <code>Insert/Delete</code>: row insertion/deletion are not permitted
1567 (ovn-northd maintains rows in this table.
1568 </p>
1569 <p>
1570 <code>Update</code>: Only modifications to the <code>chassis</code>
1571 column are permitted.
1572 </p>
1573 </dd>
1574
1575 <dt><code>MAC_Binding</code></dt>
1576 <dd>
1577 <p>
1578 <code>Authorization</code>: disabled (all clients are considered
1579 to be authorized).
1580 </p>
1581 <p>
1582 <code>Insert/Delete</code>: row insertion/deletion are permitted.
1583 </p>
1584 <p>
1585 <code>Update</code>: The columns <code>logical_port</code>,
1586 <code>ip</code>, <code>mac</code>, and <code>datapath</code> may be
1587 modified by ovn-controller.
1588 </p>
1589 </dd>
1590 </dl>
1591
1592 <p>
1593 Enabling RBAC for ovn-controller connections to the southbound database
1594 requires the following steps:
1595 </p>
1596
1597 <ol>
1598 <li>
1599 Creating SSL certificates for each chassis with the certificate CN field
1600 set to the chassis name (e.g. for a chassis with
1601 <code>external-ids:system-id=chassis-1</code>, via the command
1602 "<code>ovs-pki -B 1024 -u req+sign chassis-1 switch</code>").
1603 </li>
1604 <li>
1605 Configuring each ovn-controller to use SSL when connecting to the
1606 southbound database (e.g. via "<code>ovs-vsctl set open .
1607 external-ids:ovn-remote=ssl:x.x.x.x:6642</code>").
1608 </li>
1609 <li>
1610 Configuring a southbound database SSL remote with "ovn-controller" role
1611 (e.g. via "<code>ovn-sbctl set-connection role=ovn-controller
1612 pssl:6642</code>").
1613 </li>
1614 </ol>
1615
1616 <h1>Design Decisions</h1>
1617
1618 <h2>Tunnel Encapsulations</h2>
1619
1620 <p>
1621 OVN annotates logical network packets that it sends from one hypervisor to
1622 another with the following three pieces of metadata, which are encoded in
1623 an encapsulation-specific fashion:
1624 </p>
1625
1626 <ul>
1627 <li>
1628 24-bit logical datapath identifier, from the <code>tunnel_key</code>
1629 column in the OVN Southbound <code>Datapath_Binding</code> table.
1630 </li>
1631
1632 <li>
1633 15-bit logical ingress port identifier. ID 0 is reserved for internal
1634 use within OVN. IDs 1 through 32767, inclusive, may be assigned to
1635 logical ports (see the <code>tunnel_key</code> column in the OVN
1636 Southbound <code>Port_Binding</code> table).
1637 </li>
1638
1639 <li>
1640 16-bit logical egress port identifier. IDs 0 through 32767 have the same
1641 meaning as for logical ingress ports. IDs 32768 through 65535,
1642 inclusive, may be assigned to logical multicast groups (see the
1643 <code>tunnel_key</code> column in the OVN Southbound
1644 <code>Multicast_Group</code> table).
1645 </li>
1646 </ul>
1647
1648 <p>
1649 For hypervisor-to-hypervisor traffic, OVN supports only Geneve and STT
1650 encapsulations, for the following reasons:
1651 </p>
1652
1653 <ul>
1654 <li>
1655 Only STT and Geneve support the large amounts of metadata (over 32 bits
1656 per packet) that OVN uses (as described above).
1657 </li>
1658
1659 <li>
1660 STT and Geneve use randomized UDP or TCP source ports that allows
1661 efficient distribution among multiple paths in environments that use ECMP
1662 in their underlay.
1663 </li>
1664
1665 <li>
1666 NICs are available to offload STT and Geneve encapsulation and
1667 decapsulation.
1668 </li>
1669 </ul>
1670
1671 <p>
1672 Due to its flexibility, the preferred encapsulation between hypervisors is
1673 Geneve. For Geneve encapsulation, OVN transmits the logical datapath
1674 identifier in the Geneve VNI.
1675
1676 <!-- Keep the following in sync with ovn/controller/physical.h. -->
1677 OVN transmits the logical ingress and logical egress ports in a TLV with
1678 class 0x0102, type 0x80, and a 32-bit value encoded as follows, from MSB to
1679 LSB:
1680 </p>
1681
1682 <diagram>
1683 <header name="">
1684 <bits name="rsv" above="1" below="0" width=".25"/>
1685 <bits name="ingress port" above="15" width=".75"/>
1686 <bits name="egress port" above="16" width=".75"/>
1687 </header>
1688 </diagram>
1689
1690 <p>
1691 Environments whose NICs lack Geneve offload may prefer STT encapsulation
1692 for performance reasons. For STT encapsulation, OVN encodes all three
1693 pieces of logical metadata in the STT 64-bit tunnel ID as follows, from MSB
1694 to LSB:
1695 </p>
1696
1697 <diagram>
1698 <header name="">
1699 <bits name="reserved" above="9" below="0" width=".5"/>
1700 <bits name="ingress port" above="15" width=".75"/>
1701 <bits name="egress port" above="16" width=".75"/>
1702 <bits name="datapath" above="24" width="1.25"/>
1703 </header>
1704 </diagram>
1705
1706 <p>
1707 For connecting to gateways, in addition to Geneve and STT, OVN supports
1708 VXLAN, because only VXLAN support is common on top-of-rack (ToR) switches.
1709 Currently, gateways have a feature set that matches the capabilities as
1710 defined by the VTEP schema, so fewer bits of metadata are necessary. In
1711 the future, gateways that do not support encapsulations with large amounts
1712 of metadata may continue to have a reduced feature set.
1713 </p>
1714 </manpage>