]> git.proxmox.com Git - mirror_ovs.git/blame - ovn/ovn-architecture.7.xml
rhel: use full path for /sbin/depmod
[mirror_ovs.git] / ovn / ovn-architecture.7.xml
CommitLineData
fe36184b
BP
1<?xml version="1.0" encoding="utf-8"?>
2<manpage program="ovn-architecture" section="7" title="OVN Architecture">
3 <h1>Name</h1>
4 <p>ovn-architecture -- Open Virtual Network architecture</p>
5
6 <h1>Description</h1>
7
8 <p>
9 OVN, the Open Virtual Network, is a system to support virtual network
10 abstraction. OVN complements the existing capabilities of OVS to add
11 native support for virtual network abstractions, such as virtual L2 and L3
12 overlays and security groups. Services such as DHCP are also desirable
13 features. Just like OVS, OVN's design goal is to have a production-quality
14 implementation that can operate at significant scale.
15 </p>
16
17 <p>
18 An OVN deployment consists of several components:
19 </p>
20
21 <ul>
22 <li>
23 <p>
24 A <dfn>Cloud Management System</dfn> (<dfn>CMS</dfn>), which is
25 OVN's ultimate client (via its users and administrators). OVN
26 integration requires installing a CMS-specific plugin and
27 related software (see below). OVN initially targets OpenStack
28 as CMS.
29 </p>
30
31 <p>
32 We generally speak of ``the'' CMS, but one can imagine scenarios in
33 which multiple CMSes manage different parts of an OVN deployment.
34 </p>
35 </li>
36
37 <li>
38 An OVN Database physical or virtual node (or, eventually, cluster)
39 installed in a central location.
40 </li>
41
42 <li>
43 One or more (usually many) <dfn>hypervisors</dfn>. Hypervisors must run
44 Open vSwitch and implement the interface described in
2567fb84 45 <code>IntegrationGuide.rst</code> in the OVS source tree. Any hypervisor
fe36184b
BP
46 platform supported by Open vSwitch is acceptable.
47 </li>
48
49 <li>
50 <p>
fa6aeaeb
RB
51 Zero or more <dfn>gateways</dfn>. A gateway extends a tunnel-based
52 logical network into a physical network by bidirectionally forwarding
53 packets between tunnels and a physical Ethernet port. This allows
54 non-virtualized machines to participate in logical networks. A gateway
55 may be a physical host, a virtual machine, or an ASIC-based hardware
6355db7f 56 switch that supports the <code>vtep</code>(5) schema.
fe36184b
BP
57 </p>
58
59 <p>
fa6aeaeb
RB
60 Hypervisors and gateways are together called <dfn>transport node</dfn>
61 or <dfn>chassis</dfn>.
fe36184b
BP
62 </p>
63 </li>
64 </ul>
65
66 <p>
67 The diagram below shows how the major components of OVN and related
68 software interact. Starting at the top of the diagram, we have:
69 </p>
70
71 <ul>
72 <li>
73 The Cloud Management System, as defined above.
74 </li>
75
76 <li>
77 <p>
fa6aeaeb
RB
78 The <dfn>OVN/CMS Plugin</dfn> is the component of the CMS that
79 interfaces to OVN. In OpenStack, this is a Neutron plugin.
80 The plugin's main purpose is to translate the CMS's notion of logical
81 network configuration, stored in the CMS's configuration database in a
82 CMS-specific format, into an intermediate representation understood by
83 OVN.
fe36184b
BP
84 </p>
85
86 <p>
fa6aeaeb
RB
87 This component is necessarily CMS-specific, so a new plugin needs to be
88 developed for each CMS that is integrated with OVN. All of the
89 components below this one in the diagram are CMS-independent.
fe36184b
BP
90 </p>
91 </li>
92
93 <li>
94 <p>
fa6aeaeb
RB
95 The <dfn>OVN Northbound Database</dfn> receives the intermediate
96 representation of logical network configuration passed down by the
97 OVN/CMS Plugin. The database schema is meant to be ``impedance
98 matched'' with the concepts used in a CMS, so that it directly supports
99 notions of logical switches, routers, ACLs, and so on. See
5868eb24 100 <code>ovn-nb</code>(5) for details.
fe36184b
BP
101 </p>
102
103 <p>
fa6aeaeb
RB
104 The OVN Northbound Database has only two clients: the OVN/CMS Plugin
105 above it and <code>ovn-northd</code> below it.
fe36184b
BP
106 </p>
107 </li>
108
109 <li>
91ae2065
RB
110 <code>ovn-northd</code>(8) connects to the OVN Northbound Database
111 above it and the OVN Southbound Database below it. It translates the
ec78987f
JP
112 logical network configuration in terms of conventional network
113 concepts, taken from the OVN Northbound Database, into logical
114 datapath flows in the OVN Southbound Database below it.
fe36184b
BP
115 </li>
116
117 <li>
118 <p>
ec78987f 119 The <dfn>OVN Southbound Database</dfn> is the center of the system.
91ae2065 120 Its clients are <code>ovn-northd</code>(8) above it and
ec78987f 121 <code>ovn-controller</code>(8) on every transport node below it.
fe36184b
BP
122 </p>
123
124 <p>
fa6aeaeb
RB
125 The OVN Southbound Database contains three kinds of data: <dfn>Physical
126 Network</dfn> (PN) tables that specify how to reach hypervisor and
127 other nodes, <dfn>Logical Network</dfn> (LN) tables that describe the
128 logical network in terms of ``logical datapath flows,'' and
129 <dfn>Binding</dfn> tables that link logical network components'
130 locations to the physical network. The hypervisors populate the PN and
dcda6e0d
BP
131 Port_Binding tables, whereas <code>ovn-northd</code>(8) populates the
132 LN tables.
fe36184b
BP
133 </p>
134
135 <p>
ec78987f
JP
136 OVN Southbound Database performance must scale with the number of
137 transport nodes. This will likely require some work on
138 <code>ovsdb-server</code>(1) as we encounter bottlenecks.
139 Clustering for availability may be needed.
fe36184b
BP
140 </p>
141 </li>
142 </ul>
143
144 <p>
145 The remaining components are replicated onto each hypervisor:
146 </p>
147
148 <ul>
149 <li>
150 <code>ovn-controller</code>(8) is OVN's agent on each hypervisor and
ec78987f
JP
151 software gateway. Northbound, it connects to the OVN Southbound
152 Database to learn about OVN configuration and status and to
153 populate the PN table and the <code>Chassis</code> column in
e387e3e8 154 <code>Binding</code> table with the hypervisor's status.
ec78987f
JP
155 Southbound, it connects to <code>ovs-vswitchd</code>(8) as an
156 OpenFlow controller, for control over network traffic, and to the
157 local <code>ovsdb-server</code>(1) to allow it to monitor and
158 control Open vSwitch configuration.
fe36184b
BP
159 </li>
160
161 <li>
162 <code>ovs-vswitchd</code>(8) and <code>ovsdb-server</code>(1) are
163 conventional components of Open vSwitch.
164 </li>
165 </ul>
166
167 <pre fixed="yes">
168 CMS
169 |
170 |
171 +-----------|-----------+
172 | | |
173 | OVN/CMS Plugin |
174 | | |
175 | | |
176 | OVN Northbound DB |
177 | | |
178 | | |
91ae2065 179 | ovn-northd |
fe36184b
BP
180 | | |
181 +-----------|-----------+
182 |
183 |
ec78987f
JP
184 +-------------------+
185 | OVN Southbound DB |
186 +-------------------+
fe36184b
BP
187 |
188 |
189 +------------------+------------------+
190 | | |
ec78987f 191 HV 1 | | HV n |
fe36184b
BP
192+---------------|---------------+ . +---------------|---------------+
193| | | . | | |
194| ovn-controller | . | ovn-controller |
195| | | | . | | | |
196| | | | | | | |
197| ovs-vswitchd ovsdb-server | | ovs-vswitchd ovsdb-server |
198| | | |
199+-------------------------------+ +-------------------------------+
200 </pre>
201
fa183acc
BP
202 <h2>Information Flow in OVN</h2>
203
204 <p>
205 Configuration data in OVN flows from north to south. The CMS, through its
206 OVN/CMS plugin, passes the logical network configuration to
207 <code>ovn-northd</code> via the northbound database. In turn,
208 <code>ovn-northd</code> compiles the configuration into a lower-level form
209 and passes it to all of the chassis via the southbound database.
210 </p>
211
212 <p>
213 Status information in OVN flows from south to north. OVN currently
214 provides only a few forms of status information. First,
215 <code>ovn-northd</code> populates the <code>up</code> column in the
216 northbound <code>Logical_Switch_Port</code> table: if a logical port's
217 <code>chassis</code> column in the southbound <code>Port_Binding</code>
218 table is nonempty, it sets <code>up</code> to <code>true</code>, otherwise
219 to <code>false</code>. This allows the CMS to detect when a VM's
220 networking has come up.
221 </p>
222
223 <p>
224 Second, OVN provides feedback to the CMS on the realization of its
225 configuration, that is, whether the configuration provided by the CMS has
226 taken effect. This feature requires the CMS to participate in a sequence
227 number protocol, which works the following way:
228 </p>
229
230 <ol>
231 <li>
232 When the CMS updates the configuration in the northbound database, as
233 part of the same transaction, it increments the value of the
234 <code>nb_cfg</code> column in the <code>NB_Global</code> table. (This is
235 only necessary if the CMS wants to know when the configuration has been
236 realized.)
237 </li>
238
239 <li>
240 When <code>ovn-northd</code> updates the southbound database based on a
241 given snapshot of the northbound database, it copies <code>nb_cfg</code>
242 from northbound <code>NB_Global</code> into the southbound database
243 <code>SB_Global</code> table, as part of the same transaction. (Thus, an
244 observer monitoring both databases can determine when the southbound
245 database is caught up with the northbound.)
246 </li>
247
248 <li>
249 After <code>ovn-northd</code> receives confirmation from the southbound
250 database server that its changes have committed, it updates
251 <code>sb_cfg</code> in the northbound <code>NB_Global</code> table to the
252 <code>nb_cfg</code> version that was pushed down. (Thus, the CMS or
253 another observer can determine when the southbound database is caught up
254 without a connection to the southbound database.)
255 </li>
256
257 <li>
258 The <code>ovn-controller</code> process on each chassis receives the
259 updated southbound database, with the updated <code>nb_cfg</code>. This
260 process in turn updates the physical flows installed in the chassis's
261 Open vSwitch instances. When it receives confirmation from Open vSwitch
262 that the physical flows have been updated, it updates <code>nb_cfg</code>
263 in its own <code>Chassis</code> record in the southbound database.
264 </li>
265
266 <li>
267 <code>ovn-northd</code> monitors the <code>nb_cfg</code> column in all of
268 the <code>Chassis</code> records in the southbound database. It keeps
269 track of the minimum value among all the records and copies it into the
270 <code>hv_cfg</code> column in the northbound <code>NB_Global</code>
271 table. (Thus, the CMS or another observer can determine when all of the
272 hypervisors have caught up to the northbound configuration.)
273 </li>
274 </ol>
275
ca1564ec
BP
276 <h2>Chassis Setup</h2>
277
278 <p>
279 Each chassis in an OVN deployment must be configured with an Open vSwitch
280 bridge dedicated for OVN's use, called the <dfn>integration bridge</dfn>.
e43fc07c
RB
281 System startup scripts may create this bridge prior to starting
282 <code>ovn-controller</code> if desired. If this bridge does not exist when
283 ovn-controller starts, it will be created automatically with the default
284 configuration suggested below. The ports on the integration bridge include:
ca1564ec
BP
285 </p>
286
287 <ul>
288 <li>
289 On any chassis, tunnel ports that OVN uses to maintain logical network
290 connectivity. <code>ovn-controller</code> adds, updates, and removes
291 these tunnel ports.
292 </li>
293
294 <li>
295 On a hypervisor, any VIFs that are to be attached to logical networks.
296 The hypervisor itself, or the integration between Open vSwitch and the
2567fb84 297 hypervisor (described in <code>IntegrationGuide.rst</code>) takes care of
ca1564ec
BP
298 this. (This is not part of OVN or new to OVN; this is pre-existing
299 integration work that has already been done on hypervisors that support
300 OVS.)
301 </li>
302
303 <li>
304 On a gateway, the physical port used for logical network connectivity.
305 System startup scripts add this port to the bridge prior to starting
306 <code>ovn-controller</code>. This can be a patch port to another bridge,
307 instead of a physical port, in more sophisticated setups.
308 </li>
309 </ul>
310
311 <p>
312 Other ports should not be attached to the integration bridge. In
313 particular, physical ports attached to the underlay network (as opposed to
314 gateway ports, which are physical ports attached to logical networks) must
315 not be attached to the integration bridge. Underlay physical ports should
316 instead be attached to a separate Open vSwitch bridge (they need not be
317 attached to any bridge at all, in fact).
318 </p>
319
320 <p>
a42226f0
BP
321 The integration bridge should be configured as described below.
322 The effect of each of these settings is documented in
323 <code>ovs-vswitchd.conf.db</code>(5):
ca1564ec
BP
324 </p>
325
e43fc07c
RB
326 <!-- Keep the following in sync with create_br_int() in
327 ovn/controller/ovn-controller.c. -->
a42226f0
BP
328 <dl>
329 <dt><code>fail-mode=secure</code></dt>
330 <dd>
331 Avoids switching packets between isolated logical networks before
332 <code>ovn-controller</code> starts up. See <code>Controller Failure
333 Settings</code> in <code>ovs-vsctl</code>(8) for more information.
334 </dd>
335
336 <dt><code>other-config:disable-in-band=true</code></dt>
337 <dd>
338 Suppresses in-band control flows for the integration bridge. It would be
339 unusual for such flows to show up anyway, because OVN uses a local
340 controller (over a Unix domain socket) instead of a remote controller.
341 It's possible, however, for some other bridge in the same system to have
342 an in-band remote controller, and in that case this suppresses the flows
7c9afefd
SF
343 that in-band control would ordinarily set up. Refer to the documentation
344 for more information.
a42226f0
BP
345 </dd>
346 </dl>
347
ca1564ec
BP
348 <p>
349 The customary name for the integration bridge is <code>br-int</code>, but
350 another name may be used.
351 </p>
352
747b2a45
BP
353 <h2>Logical Networks</h2>
354
355 <p>
356 A <dfn>logical network</dfn> implements the same concepts as physical
357 networks, but they are insulated from the physical network with tunnels or
358 other encapsulations. This allows logical networks to have separate IP and
359 other address spaces that overlap, without conflicting, with those used for
360 physical networks. Logical network topologies can be arranged without
361 regard for the topologies of the physical networks on which they run.
362 </p>
363
364 <p>
365 Logical network concepts in OVN include:
366 </p>
367
368 <ul>
369 <li>
370 <dfn>Logical switches</dfn>, the logical version of Ethernet switches.
371 </li>
372
373 <li>
374 <dfn>Logical routers</dfn>, the logical version of IP routers. Logical
375 switches and routers can be connected into sophisticated topologies.
376 </li>
377
378 <li>
379 <dfn>Logical datapaths</dfn> are the logical version of an OpenFlow
380 switch. Logical switches and routers are both implemented as logical
381 datapaths.
382 </li>
3a77e831
MS
383
384 <li>
385 <p>
386 <dfn>Logical ports</dfn> represent the points of connectivity in and
387 out of logical switches and logical routers. Some common types of
388 logical ports are:
389 </p>
390
391 <ul>
392 <li>
393 Logical ports representing VIFs.
394 </li>
395
396 <li>
397 <dfn>Localnet ports</dfn> represent the points of connectivity
398 between logical switches and the physical network. They are
399 implemented as OVS patch ports between the integration bridge
400 and the separate Open vSwitch bridge that underlay physical
401 ports attach to.
402 </li>
403
404 <li>
405 <dfn>Logical patch ports</dfn> represent the points of
406 connectivity between logical switches and logical routers, and
407 in some cases between peer logical routers. There is a pair of
408 logical patch ports at each such point of connectivity, one on
409 each side.
410 </li>
2a38ef45
DA
411 <li>
412 <dfn>Localport ports</dfn> represent the points of local
413 connectivity between logical switches and VIFs. These ports are
414 present in every chassis (not bound to any particular one) and
415 traffic from them will never go through a tunnel. A
416 <code>localport</code> is expected to only generate traffic destined
417 for a local destination, typically in response to a request it
418 received.
419 One use case is how OpenStack Neutron uses a <code>localport</code>
420 port for serving metadata to VM's residing on every hypervisor. A
421 metadata proxy process is attached to this port on every host and all
422 VM's within the same network will reach it at the same IP/MAC address
423 without any traffic being sent over a tunnel. Further details can be
424 seen at https://docs.openstack.org/developer/networking-ovn/design/metadata_api.html.
425 </li>
3a77e831
MS
426 </ul>
427 </li>
747b2a45
BP
428 </ul>
429
ca1564ec 430 <h2>Life Cycle of a VIF</h2>
fe36184b
BP
431
432 <p>
433 Tables and their schemas presented in isolation are difficult to
434 understand. Here's an example.
435 </p>
436
9fb4636f
GS
437 <p>
438 A VIF on a hypervisor is a virtual network interface attached either
439 to a VM or a container running directly on that hypervisor (This is
440 different from the interface of a container running inside a VM).
441 </p>
442
fe36184b
BP
443 <p>
444 The steps in this example refer often to details of the OVN and OVN
ec78987f 445 Northbound database schemas. Please see <code>ovn-sb</code>(5) and
fe36184b
BP
446 <code>ovn-nb</code>(5), respectively, for the full story on these
447 databases.
448 </p>
449
450 <ol>
451 <li>
452 A VIF's life cycle begins when a CMS administrator creates a new VIF
453 using the CMS user interface or API and adds it to a switch (one
454 implemented by OVN as a logical switch). The CMS updates its own
455 configuration. This includes associating unique, persistent identifier
456 <var>vif-id</var> and Ethernet address <var>mac</var> with the VIF.
457 </li>
458
459 <li>
460 The CMS plugin updates the OVN Northbound database to include the new
80f408f4
JP
461 VIF, by adding a row to the <code>Logical_Switch_Port</code>
462 table. In the new row, <code>name</code> is <var>vif-id</var>,
463 <code>mac</code> is <var>mac</var>, <code>switch</code> points to
464 the OVN logical switch's Logical_Switch record, and other columns
465 are initialized appropriately.
fe36184b
BP
466 </li>
467
468 <li>
5868eb24
BP
469 <code>ovn-northd</code> receives the OVN Northbound database update. In
470 turn, it makes the corresponding updates to the OVN Southbound database,
471 by adding rows to the OVN Southbound database <code>Logical_Flow</code>
472 table to reflect the new port, e.g. add a flow to recognize that packets
473 destined to the new port's MAC address should be delivered to it, and
474 update the flow that delivers broadcast and multicast packets to include
475 the new port. It also creates a record in the <code>Binding</code> table
476 and populates all its columns except the column that identifies the
9fb4636f 477 <code>chassis</code>.
fe36184b
BP
478 </li>
479
480 <li>
481 On every hypervisor, <code>ovn-controller</code> receives the
48605550 482 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
5868eb24
BP
483 in the previous step. As long as the VM that owns the VIF is powered
484 off, <code>ovn-controller</code> cannot do much; it cannot, for example,
fe36184b
BP
485 arrange to send packets to or receive packets from the VIF, because the
486 VIF does not actually exist anywhere.
487 </li>
488
489 <li>
490 Eventually, a user powers on the VM that owns the VIF. On the hypervisor
491 where the VM is powered on, the integration between the hypervisor and
2567fb84 492 Open vSwitch (described in <code>IntegrationGuide.rst</code>) adds the VIF
fe36184b 493 to the OVN integration bridge and stores <var>vif-id</var> in
2f4962f1 494 <code>external_ids</code>:<code>iface-id</code> to indicate that the
fe36184b
BP
495 interface is an instantiation of the new VIF. (None of this code is new
496 in OVN; this is pre-existing integration work that has already been done
497 on hypervisors that support OVS.)
498 </li>
499
500 <li>
501 On the hypervisor where the VM is powered on, <code>ovn-controller</code>
2f4962f1 502 notices <code>external_ids</code>:<code>iface-id</code> in the new
968353c2 503 Interface. In response, in the OVN Southbound DB, it updates the
e387e3e8 504 <code>Binding</code> table's <code>chassis</code> column for the
2f4962f1 505 row that links the logical port from <code>external_ids</code>:<code>
968353c2
HK
506 iface-id</code> to the hypervisor. Afterward, <code>ovn-controller</code>
507 updates the local hypervisor's OpenFlow tables so that packets to and from
508 the VIF are properly handled.
fe36184b
BP
509 </li>
510
511 <li>
512 Some CMS systems, including OpenStack, fully start a VM only when its
91ae2065
RB
513 networking is ready. To support this, <code>ovn-northd</code> notices
514 the <code>chassis</code> column updated for the row in
e387e3e8 515 <code>Binding</code> table and pushes this upward by updating the
80f408f4
JP
516 <ref column="up" table="Logical_Switch_Port" db="OVN_NB"/> column
517 in the OVN Northbound database's <ref table="Logical_Switch_Port"
518 db="OVN_NB"/> table to indicate that the VIF is now up. The CMS,
519 if it uses this feature, can then react by allowing the VM's
520 execution to proceed.
fe36184b
BP
521 </li>
522
523 <li>
524 On every hypervisor but the one where the VIF resides,
9fb4636f 525 <code>ovn-controller</code> notices the completely populated row in the
e387e3e8 526 <code>Binding</code> table. This provides <code>ovn-controller</code>
fe36184b
BP
527 the physical location of the logical port, so each instance updates the
528 OpenFlow tables of its switch (based on logical datapath flows in the OVN
5868eb24
BP
529 DB <code>Logical_Flow</code> table) so that packets to and from the VIF
530 can be properly handled via tunnels.
fe36184b
BP
531 </li>
532
533 <li>
534 Eventually, a user powers off the VM that owns the VIF. On the
6eceebf5 535 hypervisor where the VM was powered off, the VIF is deleted from the OVN
fe36184b
BP
536 integration bridge.
537 </li>
538
539 <li>
6eceebf5 540 On the hypervisor where the VM was powered off,
fe36184b 541 <code>ovn-controller</code> notices that the VIF was deleted. In
9fb4636f 542 response, it removes the <code>Chassis</code> column content in the
e387e3e8 543 <code>Binding</code> table for the logical port.
fe36184b
BP
544 </li>
545
546 <li>
9fb4636f 547 On every hypervisor, <code>ovn-controller</code> notices the empty
e387e3e8 548 <code>Chassis</code> column in the <code>Binding</code> table's row
9fb4636f
GS
549 for the logical port. This means that <code>ovn-controller</code> no
550 longer knows the physical location of the logical port, so each instance
551 updates its OpenFlow table to reflect that.
fe36184b
BP
552 </li>
553
554 <li>
555 Eventually, when the VIF (or its entire VM) is no longer needed by
556 anyone, an administrator deletes the VIF using the CMS user interface or
557 API. The CMS updates its own configuration.
558 </li>
559
560 <li>
561 The CMS plugin removes the VIF from the OVN Northbound database,
80f408f4 562 by deleting its row in the <code>Logical_Switch_Port</code> table.
fe36184b
BP
563 </li>
564
565 <li>
91ae2065 566 <code>ovn-northd</code> receives the OVN Northbound update and in turn
5868eb24
BP
567 updates the OVN Southbound database accordingly, by removing or updating
568 the rows from the OVN Southbound database <code>Logical_Flow</code> table
569 and <code>Binding</code> table that were related to the now-destroyed
570 VIF.
fe36184b
BP
571 </li>
572
573 <li>
574 On every hypervisor, <code>ovn-controller</code> receives the
48605550 575 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
5868eb24
BP
576 in the previous step. <code>ovn-controller</code> updates OpenFlow
577 tables to reflect the update, although there may not be much to do, since
578 the VIF had already become unreachable when it was removed from the
e387e3e8 579 <code>Binding</code> table in a previous step.
fe36184b
BP
580 </li>
581 </ol>
582
a30b56d4 583 <h2>Life Cycle of a Container Interface Inside a VM</h2>
9fb4636f
GS
584
585 <p>
586 OVN provides virtual network abstractions by converting information
587 written in OVN_NB database to OpenFlow flows in each hypervisor. Secure
588 virtual networking for multi-tenants can only be provided if OVN controller
589 is the only entity that can modify flows in Open vSwitch. When the
590 Open vSwitch integration bridge resides in the hypervisor, it is a
591 fair assumption to make that tenant workloads running inside VMs cannot
592 make any changes to Open vSwitch flows.
593 </p>
594
595 <p>
596 If the infrastructure provider trusts the applications inside the
597 containers not to break out and modify the Open vSwitch flows, then
598 containers can be run in hypervisors. This is also the case when
599 containers are run inside the VMs and Open vSwitch integration bridge
600 with flows added by OVN controller resides in the same VM. For both
601 the above cases, the workflow is the same as explained with an example
602 in the previous section ("Life Cycle of a VIF").
603 </p>
604
605 <p>
606 This section talks about the life cycle of a container interface (CIF)
607 when containers are created in the VMs and the Open vSwitch integration
608 bridge resides inside the hypervisor. In this case, even if a container
609 application breaks out, other tenants are not affected because the
610 containers running inside the VMs cannot modify the flows in the
611 Open vSwitch integration bridge.
612 </p>
613
614 <p>
615 When multiple containers are created inside a VM, there are multiple
616 CIFs associated with them. The network traffic associated with these
617 CIFs need to reach the Open vSwitch integration bridge running in the
618 hypervisor for OVN to support virtual network abstractions. OVN should
619 also be able to distinguish network traffic coming from different CIFs.
620 There are two ways to distinguish network traffic of CIFs.
621 </p>
622
623 <p>
624 One way is to provide one VIF for every CIF (1:1 model). This means that
625 there could be a lot of network devices in the hypervisor. This would slow
626 down OVS because of all the additional CPU cycles needed for the management
627 of all the VIFs. It would also mean that the entity creating the
628 containers in a VM should also be able to create the corresponding VIFs in
629 the hypervisor.
630 </p>
631
632 <p>
633 The second way is to provide a single VIF for all the CIFs (1:many model).
634 OVN could then distinguish network traffic coming from different CIFs via
635 a tag written in every packet. OVN uses this mechanism and uses VLAN as
636 the tagging mechanism.
637 </p>
638
639 <ol>
640 <li>
641 A CIF's life cycle begins when a container is spawned inside a VM by
642 the either the same CMS that created the VM or a tenant that owns that VM
643 or even a container Orchestration System that is different than the CMS
644 that initially created the VM. Whoever the entity is, it will need to
645 know the <var>vif-id</var> that is associated with the network interface
646 of the VM through which the container interface's network traffic is
647 expected to go through. The entity that creates the container interface
648 will also need to choose an unused VLAN inside that VM.
649 </li>
650
651 <li>
652 The container spawning entity (either directly or through the CMS that
653 manages the underlying infrastructure) updates the OVN Northbound
654 database to include the new CIF, by adding a row to the
80f408f4
JP
655 <code>Logical_Switch_Port</code> table. In the new row,
656 <code>name</code> is any unique identifier,
657 <code>parent_name</code> is the <var>vif-id</var> of the VM
658 through which the CIF's network traffic is expected to go through
659 and the <code>tag</code> is the VLAN tag that identifies the
9fb4636f
GS
660 network traffic of that CIF.
661 </li>
662
663 <li>
5868eb24
BP
664 <code>ovn-northd</code> receives the OVN Northbound database update. In
665 turn, it makes the corresponding updates to the OVN Southbound database,
666 by adding rows to the OVN Southbound database's <code>Logical_Flow</code>
667 table to reflect the new port and also by creating a new row in the
668 <code>Binding</code> table and populating all its columns except the
669 column that identifies the <code>chassis</code>.
9fb4636f
GS
670 </li>
671
672 <li>
673 On every hypervisor, <code>ovn-controller</code> subscribes to the
e387e3e8 674 changes in the <code>Binding</code> table. When a new row is created
91ae2065 675 by <code>ovn-northd</code> that includes a value in
e387e3e8 676 <code>parent_port</code> column of <code>Binding</code> table, the
91ae2065
RB
677 <code>ovn-controller</code> in the hypervisor whose OVN integration bridge
678 has that same value in <var>vif-id</var> in
2f4962f1 679 <code>external_ids</code>:<code>iface-id</code>
9fb4636f
GS
680 updates the local hypervisor's OpenFlow tables so that packets to and
681 from the VIF with the particular VLAN <code>tag</code> are properly
682 handled. Afterward it updates the <code>chassis</code> column of
e387e3e8 683 the <code>Binding</code> to reflect the physical location.
9fb4636f
GS
684 </li>
685
686 <li>
687 One can only start the application inside the container after the
91ae2065 688 underlying network is ready. To support this, <code>ovn-northd</code>
e387e3e8 689 notices the updated <code>chassis</code> column in <code>Binding</code>
80f408f4 690 table and updates the <ref column="up" table="Logical_Switch_Port"
9fb4636f 691 db="OVN_NB"/> column in the OVN Northbound database's
80f408f4 692 <ref table="Logical_Switch_Port" db="OVN_NB"/> table to indicate that the
9fb4636f
GS
693 CIF is now up. The entity responsible to start the container application
694 queries this value and starts the application.
695 </li>
696
697 <li>
698 Eventually the entity that created and started the container, stops it.
699 The entity, through the CMS (or directly) deletes its row in the
80f408f4 700 <code>Logical_Switch_Port</code> table.
9fb4636f
GS
701 </li>
702
703 <li>
91ae2065 704 <code>ovn-northd</code> receives the OVN Northbound update and in turn
5868eb24
BP
705 updates the OVN Southbound database accordingly, by removing or updating
706 the rows from the OVN Southbound database <code>Logical_Flow</code> table
707 that were related to the now-destroyed CIF. It also deletes the row in
708 the <code>Binding</code> table for that CIF.
9fb4636f
GS
709 </li>
710
711 <li>
712 On every hypervisor, <code>ovn-controller</code> receives the
48605550
BP
713 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
714 in the previous step. <code>ovn-controller</code> updates OpenFlow
715 tables to reflect the update.
9fb4636f
GS
716 </li>
717 </ol>
b705f9ea 718
69a832cf 719 <h2>Architectural Physical Life Cycle of a Packet</h2>
b705f9ea 720
b705f9ea 721 <p>
5868eb24
BP
722 This section describes how a packet travels from one virtual machine or
723 container to another through OVN. This description focuses on the physical
724 treatment of a packet; for a description of the logical life cycle of a
725 packet, please refer to the <code>Logical_Flow</code> table in
726 <code>ovn-sb</code>(5).
b705f9ea
JP
727 </p>
728
5868eb24
BP
729 <p>
730 This section mentions several data and metadata fields, for clarity
731 summarized here:
732 </p>
733
734 <dl>
735 <dt>tunnel key</dt>
736 <dd>
737 When OVN encapsulates a packet in Geneve or another tunnel, it attaches
738 extra data to it to allow the receiving OVN instance to process it
739 correctly. This takes different forms depending on the particular
740 encapsulation, but in each case we refer to it here as the ``tunnel
741 key.'' See <code>Tunnel Encapsulations</code>, below, for details.
742 </dd>
743
744 <dt>logical datapath field</dt>
745 <dd>
746 A field that denotes the logical datapath through which a packet is being
4103f6d2
BP
747 processed.
748 <!-- Keep the following in sync with MFF_LOG_DATAPATH in
667e2b0b 749 ovn/lib/logical-fields.h. -->
4103f6d2
BP
750 OVN uses the field that OpenFlow 1.1+ simply (and confusingly) calls
751 ``metadata'' to store the logical datapath. (This field is passed across
752 tunnels as part of the tunnel key.)
5868eb24
BP
753 </dd>
754
755 <dt>logical input port field</dt>
756 <dd>
37910994
JP
757 <p>
758 A field that denotes the logical port from which the packet
759 entered the logical datapath.
760 <!-- Keep the following in sync with MFF_LOG_INPORT in
667e2b0b 761 ovn/lib/logical-fields.h. -->
b221ff0d 762 OVN stores this in Open vSwitch extension register number 14.
37910994
JP
763 </p>
764
765 <p>
766 Geneve and STT tunnels pass this field as part of the tunnel key.
767 Although VXLAN tunnels do not explicitly carry a logical input port,
768 OVN only uses VXLAN to communicate with gateways that from OVN's
769 perspective consist of only a single logical port, so that OVN can set
770 the logical input port field to this one on ingress to the OVN logical
771 pipeline.
772 </p>
5868eb24
BP
773 </dd>
774
775 <dt>logical output port field</dt>
776 <dd>
37910994
JP
777 <p>
778 A field that denotes the logical port from which the packet will
779 leave the logical datapath. This is initialized to 0 at the
780 beginning of the logical ingress pipeline.
781 <!-- Keep the following in sync with MFF_LOG_OUTPORT in
667e2b0b 782 ovn/lib/logical-fields.h. -->
b221ff0d 783 OVN stores this in Open vSwitch extension register number 15.
37910994
JP
784 </p>
785
786 <p>
787 Geneve and STT tunnels pass this field as part of the tunnel key.
788 VXLAN tunnels do not transmit the logical output port field.
475f0a2c
DB
789 Since VXLAN tunnels do not carry a logical output port field in
790 the tunnel key, when a packet is received from VXLAN tunnel by
00c875d0 791 an OVN hypervisor, the packet is resubmitted to table 8 to
475f0a2c
DB
792 determine the output port(s); when the packet reaches table 32,
793 these packets are resubmitted to table 33 for local delivery by
794 checking a MLF_RCV_FROM_VXLAN flag, which is set when the packet
795 arrives from a VXLAN tunnel.
37910994 796 </p>
5868eb24
BP
797 </dd>
798
3bd4ae23 799 <dt>conntrack zone field for logical ports</dt>
78aab811 800 <dd>
3bd4ae23
GS
801 A field that denotes the connection tracking zone for logical ports.
802 The value only has local significance and is not meaningful between
803 chassis. This is initialized to 0 at the beginning of the logical
cc5e28d8
JP
804 <!-- Keep the following in sync with MFF_LOG_CT_ZONE in
805 ovn/lib/logical-fields.h. -->
b221ff0d 806 ingress pipeline. OVN stores this in Open vSwitch extension register
cc5e28d8 807 number 13.
3bd4ae23
GS
808 </dd>
809
06a26dd2 810 <dt>conntrack zone fields for routers</dt>
3bd4ae23 811 <dd>
06a26dd2
MS
812 Fields that denote the connection tracking zones for routers. These
813 values only have local significance and are not meaningful between
b221ff0d 814 chassis. OVN stores the zone information for DNATting in Open vSwitch
cc5e28d8
JP
815 <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
816 MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
b221ff0d
JP
817 extension register number 11 and zone information for SNATing in
818 Open vSwitch extension register number 12.
78aab811
JP
819 </dd>
820
bf143492
JP
821 <dt>logical flow flags</dt>
822 <dd>
475f0a2c
DB
823 The logical flags are intended to handle keeping context between
824 tables in order to decide which rules in subsequent tables are
825 matched. These values only have local significance and are not
826 meaningful between chassis. OVN stores the logical flags in
bf143492
JP
827 <!-- Keep the following in sync with MFF_LOG_FLAGS in
828 ovn/lib/logical-fields.h. -->
475f0a2c 829 Open vSwitch extension register number 10.
bf143492
JP
830 </dd>
831
5868eb24
BP
832 <dt>VLAN ID</dt>
833 <dd>
834 The VLAN ID is used as an interface between OVN and containers nested
835 inside a VM (see <code>Life Cycle of a container interface inside a
836 VM</code>, above, for more information).
837 </dd>
838 </dl>
839
840 <p>
841 Initially, a VM or container on the ingress hypervisor sends a packet on a
842 port attached to the OVN integration bridge. Then:
843 </p>
844
845 <ol>
b705f9ea
JP
846 <li>
847 <p>
5868eb24
BP
848 OpenFlow table 0 performs physical-to-logical translation. It matches
849 the packet's ingress port. Its actions annotate the packet with
850 logical metadata, by setting the logical datapath field to identify the
851 logical datapath that the packet is traversing and the logical input
00c875d0 852 port field to identify the ingress port. Then it resubmits to table 8
5868eb24
BP
853 to enter the logical ingress pipeline.
854 </p>
855
856 <p>
857 Packets that originate from a container nested within a VM are treated
858 in a slightly different way. The originating container can be
859 distinguished based on the VIF-specific VLAN ID, so the
860 physical-to-logical translation flows additionally match on VLAN ID and
861 the actions strip the VLAN header. Following this step, OVN treats
862 packets from containers just like any other packets.
863 </p>
864
865 <p>
866 Table 0 also processes packets that arrive from other chassis. It
867 distinguishes them from other packets by ingress port, which is a
868 tunnel. As with packets just entering the OVN pipeline, the actions
869 annotate these packets with logical datapath and logical ingress port
870 metadata. In addition, the actions set the logical output port field,
871 which is available because in OVN tunneling occurs after the logical
872 output port is known. These three pieces of information are obtained
873 from the tunnel encapsulation metadata (see <code>Tunnel
874 Encapsulations</code> for encoding details). Then the actions resubmit
875 to table 33 to enter the logical egress pipeline.
b705f9ea
JP
876 </p>
877 </li>
878
879 <li>
880 <p>
00c875d0 881 OpenFlow tables 8 through 31 execute the logical ingress pipeline from
5868eb24
BP
882 the <code>Logical_Flow</code> table in the OVN Southbound database.
883 These tables are expressed entirely in terms of logical concepts like
884 logical ports and logical datapaths. A big part of
885 <code>ovn-controller</code>'s job is to translate them into equivalent
886 OpenFlow (in particular it translates the table numbers:
00c875d0 887 <code>Logical_Flow</code> tables 0 through 23 become OpenFlow tables 8
0bac7164 888 through 31).
b705f9ea 889 </p>
5868eb24 890
c80eac1f
BP
891 <p>
892 Each logical flow maps to one or more OpenFlow flows. An actual packet
893 ordinarily matches only one of these, although in some cases it can
894 match more than one of these flows (which is not a problem because all
895 of them have the same actions). <code>ovn-controller</code> uses the
896 first 32 bits of the logical flow's UUID as the cookie for its OpenFlow
897 flow or flows. (This is not necessarily unique, since the first 32
898 bits of a logical flow's UUID is not necessarily unique.)
899 </p>
900
901 <p>
902 Some logical flows can map to the Open vSwitch ``conjunctive match''
96fee5e0 903 extension (see <code>ovs-fields</code>(7)). Flows with a
c80eac1f
BP
904 <code>conjunction</code> action use an OpenFlow cookie of 0, because
905 they can correspond to multiple logical flows. The OpenFlow flow for a
906 conjunctive match includes a match on <code>conj_id</code>.
907 </p>
908
909 <p>
910 Some logical flows may not be represented in the OpenFlow tables on a
911 given hypervisor, if they could not be used on that hypervisor. For
912 example, if no VIF in a logical switch resides on a given hypervisor,
913 and the logical switch is not otherwise reachable on that hypervisor
914 (e.g. over a series of hops through logical switches and routers
915 starting from a VIF on the hypervisor), then the logical flow may not
916 be represented there.
917 </p>
918
0bac7164
BP
919 <p>
920 Most OVN actions have fairly obvious implementations in OpenFlow (with
921 OVS extensions), e.g. <code>next;</code> is implemented as
922 <code>resubmit</code>, <code><var>field</var> =
923 <var>constant</var>;</code> as <code>set_field</code>. A few are worth
924 describing in more detail:
925 </p>
926
927 <dl>
928 <dt><code>output:</code></dt>
929 <dd>
930 Implemented by resubmitting the packet to table 32. If the pipeline
931 executes more than one <code>output</code> action, then each one is
932 separately resubmitted to table 32. This can be used to send
933 multiple copies of the packet to multiple ports. (If the packet was
934 not modified between the <code>output</code> actions, and some of the
935 copies are destined to the same hypervisor, then using a logical
936 multicast output port would save bandwidth between hypervisors.)
937 </dd>
938
939 <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt>
c34a87b6 940 <dt><code>get_nd(<var>P</var>, <var>A</var>);</code></dt>
0bac7164
BP
941 <dd>
942 <p>
943 Implemented by storing arguments into OpenFlow fields, then
bf143492 944 resubmitting to table 66, which <code>ovn-controller</code>
0bac7164
BP
945 populates with flows generated from the <code>MAC_Binding</code>
946 table in the OVN Southbound database. If there is a match in table
bf143492 947 66, then its actions store the bound MAC in the Ethernet
0bac7164
BP
948 destination address field.
949 </p>
950
951 <p>
952 (The OpenFlow actions save and restore the OpenFlow fields used for
953 the arguments, so that the OVN actions do not have to be aware of
954 this temporary use.)
955 </p>
956 </dd>
957
958 <dt><code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
c34a87b6 959 <dt><code>put_nd(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
0bac7164
BP
960 <dd>
961 <p>
962 Implemented by storing the arguments into OpenFlow fields, then
963 outputting a packet to <code>ovn-controller</code>, which updates
964 the <code>MAC_Binding</code> table.
965 </p>
966
967 <p>
968 (The OpenFlow actions save and restore the OpenFlow fields used for
969 the arguments, so that the OVN actions do not have to be aware of
970 this temporary use.)
971 </p>
972 </dd>
973 </dl>
b705f9ea
JP
974 </li>
975
976 <li>
977 <p>
5868eb24
BP
978 OpenFlow tables 32 through 47 implement the <code>output</code> action
979 in the logical ingress pipeline. Specifically, table 32 handles
980 packets to remote hypervisors, table 33 handles packets to the local
bf143492
JP
981 hypervisor, and table 34 checks whether packets whose logical ingress
982 and egress port are the same should be discarded.
5868eb24
BP
983 </p>
984
0b7da177
BP
985 <p>
986 Logical patch ports are a special case. Logical patch ports do not
987 have a physical location and effectively reside on every hypervisor.
988 Thus, flow table 33, for output to ports on the local hypervisor,
989 naturally implements output to unicast logical patch ports too.
990 However, applying the same logic to a logical patch port that is part
991 of a logical multicast group yields packet duplication, because each
992 hypervisor that contains a logical port in the multicast group will
993 also output the packet to the logical patch port. Thus, multicast
994 groups implement output to logical patch ports in table 32.
995 </p>
996
5868eb24
BP
997 <p>
998 Each flow in table 32 matches on a logical output port for unicast or
999 multicast logical ports that include a logical port on a remote
1000 hypervisor. Each flow's actions implement sending a packet to the port
1001 it matches. For unicast logical output ports on remote hypervisors,
1002 the actions set the tunnel key to the correct value, then send the
1003 packet on the tunnel port to the correct hypervisor. (When the remote
1004 hypervisor receives the packet, table 0 there will recognize it as a
1005 tunneled packet and pass it along to table 33.) For multicast logical
1006 output ports, the actions send one copy of the packet to each remote
1007 hypervisor, in the same way as for unicast destinations. If a
1008 multicast group includes a logical port or ports on the local
1009 hypervisor, then its actions also resubmit to table 33. Table 32 also
2a38ef45 1010 includes:
5868eb24
BP
1011 </p>
1012
2a38ef45
DA
1013 <ul>
1014 <li>
1015 A higher-priority rule to match packets received from VXLAN tunnels,
1016 based on flag MLF_RCV_FROM_VXLAN, and resubmit these packets to table
1017 33 for local delivery. Packets received from VXLAN tunnels reach
1018 here because of a lack of logical output port field in the tunnel key
00c875d0 1019 and thus these packets needed to be submitted to table 8 to
2a38ef45
DA
1020 determine the output port.
1021 </li>
1022 <li>
1023 A higher-priority rule to match packets received from ports of type
1024 <code>localport</code>, based on the logical input port, and resubmit
1025 these packets to table 33 for local delivery. Ports of type
1026 <code>localport</code> exist on every hypervisor and by definition
1027 their traffic should never go out through a tunnel.
1028 </li>
74c2eacc
MM
1029 <li>
1030 A higher-priority rule to match packets that have the MLF_LOCAL_ONLY
1031 logical flow flag set, and whose destination is a multicast address.
1032 This flag indicates that the packet should not be delivered to remote
1033 hypervisors, even if the multicast destination includes ports on
1034 remote hypervisors. This flag is used when
1035 <code>ovn-controller</code> is the originator of the multicast packet.
1036 Since each <code>ovn-controller</code> instance is originating these
1037 packets, the packets only need to be delivered to local ports.
1038 </li>
2a38ef45
DA
1039 <li>
1040 A fallback flow that resubmits to table 33 if there is no other
1041 match.
1042 </li>
1043 </ul>
1044
5868eb24
BP
1045 <p>
1046 Flows in table 33 resemble those in table 32 but for logical ports that
0b7da177 1047 reside locally rather than remotely. For unicast logical output ports
5868eb24
BP
1048 on the local hypervisor, the actions just resubmit to table 34. For
1049 multicast output ports that include one or more logical ports on the
1050 local hypervisor, for each such logical port <var>P</var>, the actions
1051 change the logical output port to <var>P</var>, then resubmit to table
1052 34.
1053 </p>
1054
6e6c3f91
HZ
1055 <p>
1056 A special case is that when a localnet port exists on the datapath,
1057 remote port is connected by switching to the localnet port. In this
1058 case, instead of adding a flow in table 32 to reach the remote port, a
1059 flow is added in table 33 to switch the logical outport to the localnet
1060 port, and resubmit to table 33 as if it were unicasted to a logical
1061 port on the local hypervisor.
1062 </p>
1063
5868eb24
BP
1064 <p>
1065 Table 34 matches and drops packets for which the logical input and
bf143492 1066 output ports are the same and the MLF_ALLOW_LOOPBACK flag is not
00c875d0 1067 set. It resubmits other packets to table 40.
b705f9ea
JP
1068 </p>
1069 </li>
5868eb24
BP
1070
1071 <li>
1072 <p>
00c875d0 1073 OpenFlow tables 40 through 63 execute the logical egress pipeline from
5868eb24
BP
1074 the <code>Logical_Flow</code> table in the OVN Southbound database.
1075 The egress pipeline can perform a final stage of validation before
1076 packet delivery. Eventually, it may execute an <code>output</code>
1077 action, which <code>ovn-controller</code> implements by resubmitting to
1078 table 64. A packet for which the pipeline never executes
1079 <code>output</code> is effectively dropped (although it may have been
1080 transmitted through a tunnel across a physical network).
1081 </p>
1082
1083 <p>
1084 The egress pipeline cannot change the logical output port or cause
1085 further tunneling.
1086 </p>
1087 </li>
1088
bf143492
JP
1089 <li>
1090 <p>
1091 Table 64 bypasses OpenFlow loopback when MLF_ALLOW_LOOPBACK is set.
1092 Logical loopback was handled in table 34, but OpenFlow by default also
1093 prevents loopback to the OpenFlow ingress port. Thus, when
1094 MLF_ALLOW_LOOPBACK is set, OpenFlow table 64 saves the OpenFlow ingress
1095 port, sets it to zero, resubmits to table 65 for logical-to-physical
1096 transformation, and then restores the OpenFlow ingress port,
1097 effectively disabling OpenFlow loopback prevents. When
1098 MLF_ALLOW_LOOPBACK is unset, table 64 flow simply resubmits to table
1099 65.
1100 </p>
1101 </li>
1102
5868eb24
BP
1103 <li>
1104 <p>
bf143492 1105 OpenFlow table 65 performs logical-to-physical translation, the
5868eb24
BP
1106 opposite of table 0. It matches the packet's logical egress port. Its
1107 actions output the packet to the port attached to the OVN integration
1108 bridge that represents that logical port. If the logical egress port
1109 is a container nested with a VM, then before sending the packet the
1110 actions push on a VLAN header with an appropriate VLAN ID.
1111 </p>
1112 </li>
1113 </ol>
1114
3a77e831
MS
1115 <h2>Logical Routers and Logical Patch Ports</h2>
1116
1117 <p>
1118 Typically logical routers and logical patch ports do not have a
1119 physical location and effectively reside on every hypervisor. This is
1120 the case for logical patch ports between logical routers and logical
1121 switches behind those logical routers, to which VMs (and VIFs) attach.
1122 </p>
1123
1124 <p>
1125 Consider a packet sent from one virtual machine or container to another
1126 VM or container that resides on a different subnet. The packet will
1127 traverse tables 0 to 65 as described in the previous section
1128 <code>Architectural Physical Life Cycle of a Packet</code>, using the
1129 logical datapath representing the logical switch that the sender is
1130 attached to. At table 32, the packet will use the fallback flow that
1131 resubmits locally to table 33 on the same hypervisor. In this case,
1132 all of the processing from table 0 to table 65 occurs on the hypervisor
1133 where the sender resides.
1134 </p>
1135
1136 <p>
1137 When the packet reaches table 65, the logical egress port is a logical
1138 patch port. The implementation in table 65 differs depending on the OVS
1139 version, although the observed behavior is meant to be the same:
1140 </p>
1141
1142 <ul>
1143 <li>
1144 In OVS versions 2.6 and earlier, table 65 outputs to an OVS patch
1145 port that represents the logical patch port. The packet re-enters
1146 the OpenFlow flow table from the OVS patch port's peer in table 0,
1147 which identifies the logical datapath and logical input port based
1148 on the OVS patch port's OpenFlow port number.
1149 </li>
1150
1151 <li>
1152 In OVS versions 2.7 and later, the packet is cloned and resubmitted
00c875d0
MS
1153 directly to the first OpenFlow flow table in the ingress pipeline,
1154 setting the logical ingress port to the peer logical patch port, and
1155 using the peer logical patch port's logical datapath (that
1156 represents the logical router).
3a77e831
MS
1157 </li>
1158 </ul>
1159
1160 <p>
1161 The packet re-enters the ingress pipeline in order to traverse tables
00c875d0 1162 8 to 65 again, this time using the logical datapath representing the
3a77e831
MS
1163 logical router. The processing continues as described in the previous
1164 section <code>Architectural Physical Life Cycle of a Packet</code>.
1165 When the packet reachs table 65, the logical egress port will once
1166 again be a logical patch port. In the same manner as described above,
1167 this logical patch port will cause the packet to be resubmitted to
00c875d0 1168 OpenFlow tables 8 to 65, this time using the logical datapath
3a77e831
MS
1169 representing the logical switch that the destination VM or container
1170 is attached to.
1171 </p>
1172
1173 <p>
00c875d0 1174 The packet traverses tables 8 to 65 a third and final time. If the
3a77e831
MS
1175 destination VM or container resides on a remote hypervisor, then table
1176 32 will send the packet on a tunnel port from the sender's hypervisor
1177 to the remote hypervisor. Finally table 65 will output the packet
1178 directly to the destination VM or container.
1179 </p>
1180
1181 <p>
41a15b71
MS
1182 The following sections describe two exceptions, where logical routers
1183 and/or logical patch ports are associated with a physical location.
3a77e831
MS
1184 </p>
1185
1186 <h3>Gateway Routers</h3>
1187
1188 <p>
1189 A <dfn>gateway router</dfn> is a logical router that is bound to a
1190 physical location. This includes all of the logical patch ports of
1191 the logical router, as well as all of the peer logical patch ports on
1192 logical switches. In the OVN Southbound database, the
1193 <code>Port_Binding</code> entries for these logical patch ports use
1194 the type <code>l3gateway</code> rather than <code>patch</code>, in
1195 order to distinguish that these logical patch ports are bound to a
1196 chassis.
1197 </p>
1198
1199 <p>
1200 When a hypervisor processes a packet on a logical datapath
1201 representing a logical switch, and the logical egress port is a
1202 <code>l3gateway</code> port representing connectivity to a gateway
1203 router, the packet will match a flow in table 32 that sends the
1204 packet on a tunnel port to the chassis where the gateway router
1205 resides. This processing in table 32 is done in the same manner as
1206 for VIFs.
1207 </p>
1208
1209 <p>
1210 Gateway routers are typically used in between distributed logical
1211 routers and physical networks. The distributed logical router and
1212 the logical switches behind it, to which VMs and containers attach,
1213 effectively reside on each hypervisor. The distributed router and
1214 the gateway router are connected by another logical switch, sometimes
1215 referred to as a <code>join</code> logical switch. On the other
1216 side, the gateway router connects to another logical switch that has
1217 a localnet port connecting to the physical network.
1218 </p>
1219
1220 <p>
1221 When using gateway routers, DNAT and SNAT rules are associated with
1222 the gateway router, which provides a central location that can handle
1223 one-to-many SNAT (aka IP masquerading).
1224 </p>
1225
41a15b71
MS
1226 <h3>Distributed Gateway Ports</h3>
1227
1228 <p>
1229 <dfn>Distributed gateway ports</dfn> are logical router patch ports
1230 that directly connect distributed logical routers to logical
1231 switches with localnet ports.
1232 </p>
1233
1234 <p>
1235 The primary design goal of distributed gateway ports is to allow as
1236 much traffic as possible to be handled locally on the hypervisor
1237 where a VM or container resides. Whenever possible, packets from
1238 the VM or container to the outside world should be processed
1239 completely on that VM's or container's hypervisor, eventually
1240 traversing a localnet port instance on that hypervisor to the
1241 physical network. Whenever possible, packets from the outside
1242 world to a VM or container should be directed through the physical
1243 network directly to the VM's or container's hypervisor, where the
1244 packet will enter the integration bridge through a localnet port.
1245 </p>
1246
1247 <p>
1248 In order to allow for the distributed processing of packets
1249 described in the paragraph above, distributed gateway ports need to
1250 be logical patch ports that effectively reside on every hypervisor,
1251 rather than <code>l3gateway</code> ports that are bound to a
1252 particular chassis. However, the flows associated with distributed
1253 gateway ports often need to be associated with physical locations,
1254 for the following reasons:
1255 </p>
1256
1257 <ul>
1258 <li>
1259 <p>
1260 The physical network that the localnet port is attached to
1261 typically uses L2 learning. Any Ethernet address used over the
1262 distributed gateway port must be restricted to a single physical
1263 location so that upstream L2 learning is not confused. Traffic
1264 sent out the distributed gateway port towards the localnet port
1265 with a specific Ethernet address must be sent out one specific
1266 instance of the distributed gateway port on one specific
1267 chassis. Traffic received from the localnet port (or from a VIF
1268 on the same logical switch as the localnet port) with a specific
1269 Ethernet address must be directed to the logical switch's patch
1270 port instance on that specific chassis.
1271 </p>
1272
1273 <p>
1274 Due to the implications of L2 learning, the Ethernet address and
1275 IP address of the distributed gateway port need to be restricted
1276 to a single physical location. For this reason, the user must
1277 specify one chassis associated with the distributed gateway
1278 port. Note that traffic traversing the distributed gateway port
1279 using other Ethernet addresses and IP addresses (e.g. one-to-one
1280 NAT) is not restricted to this chassis.
1281 </p>
1282
1283 <p>
1284 Replies to ARP and ND requests must be restricted to a single
1285 physical location, where the Ethernet address in the reply
1286 resides. This includes ARP and ND replies for the IP address
1287 of the distributed gateway port, which are restricted to the
1288 chassis that the user associated with the distributed gateway
1289 port.
1290 </p>
1291 </li>
1292
1293 <li>
1294 In order to support one-to-many SNAT (aka IP masquerading), where
1295 multiple logical IP addresses spread across multiple chassis are
1296 mapped to a single external IP address, it will be necessary to
1297 handle some of the logical router processing on a specific chassis
1298 in a centralized manner. Since the SNAT external IP address is
1299 typically the distributed gateway port IP address, and for
1300 simplicity, the same chassis associated with the distributed
1301 gateway port is used.
1302 </li>
1303 </ul>
1304
1305 <p>
1306 The details of flow restrictions to specific chassis are described
1307 in the <code>ovn-northd</code> documentation.
1308 </p>
1309
1310 <p>
1311 While most of the physical location dependent aspects of distributed
1312 gateway ports can be handled by restricting some flows to specific
1313 chassis, one additional mechanism is required. When a packet
1314 leaves the ingress pipeline and the logical egress port is the
1315 distributed gateway port, one of two different sets of actions is
1316 required at table 32:
1317 </p>
1318
1319 <ul>
1320 <li>
1321 If the packet can be handled locally on the sender's hypervisor
1322 (e.g. one-to-one NAT traffic), then the packet should just be
1323 resubmitted locally to table 33, in the normal manner for
1324 distributed logical patch ports.
1325 </li>
1326
1327 <li>
1328 However, if the packet needs to be handled on the chassis
1329 associated with the distributed gateway port (e.g. one-to-many
1330 SNAT traffic or non-NAT traffic), then table 32 must send the
1331 packet on a tunnel port to that chassis.
1332 </li>
1333 </ul>
1334
1335 <p>
1336 In order to trigger the second set of actions, the
1337 <code>chassisredirect</code> type of southbound
1338 <code>Port_Binding</code> has been added. Setting the logical
1339 egress port to the type <code>chassisredirect</code> logical port is
1340 simply a way to indicate that although the packet is destined for
1341 the distributed gateway port, it needs to be redirected to a
1342 different chassis. At table 32, packets with this logical egress
1343 port are sent to a specific chassis, in the same way that table 32
1344 directs packets whose logical egress port is a VIF or a type
1345 <code>l3gateway</code> port to different chassis. Once the packet
1346 arrives at that chassis, table 33 resets the logical egress port to
1347 the value representing the distributed gateway port. For each
1348 distributed gateway port, there is one type
1349 <code>chassisredirect</code> port, in addition to the distributed
1350 logical patch port representing the distributed gateway port.
1351 </p>
1352
52425efb
RB
1353 <h3>High Availability for Distributed Gateway Ports</h3>
1354
1355 <p>
1356 OVN allows you to specify a prioritized list of chassis for a distributed
1357 gateway port. This is done by associating multiple
1358 <code>Gateway_Chassis</code> rows with a <code>Logical_Router_Port</code>
1359 in the <code>OVN_Northbound</code> database.
1360 </p>
1361
1362 <p>
1363 When multiple chassis have been specified for a gateway, all chassis that
1364 may send packets to that gateway will enable BFD on tunnels to all
1365 configured gateway chassis. The current master chassis for the gateway
1366 is the highest priority gateway chassis that is currently viewed as
1367 active based on BFD status.
1368 </p>
1369
1370 <p>
1371 For more information on L3 gateway high availability, please refer to
1372 http://docs.openvswitch.org/en/latest/topics/high-availability.
1373 </p>
1374
88058f19
AW
1375 <h2>Life Cycle of a VTEP gateway</h2>
1376
1377 <p>
1378 A gateway is a chassis that forwards traffic between the OVN-managed
1379 part of a logical network and a physical VLAN, extending a
1380 tunnel-based logical network into a physical network.
1381 </p>
1382
1383 <p>
1384 The steps below refer often to details of the OVN and VTEP database
1385 schemas. Please see <code>ovn-sb</code>(5), <code>ovn-nb</code>(5)
1386 and <code>vtep</code>(5), respectively, for the full story on these
1387 databases.
1388 </p>
1389
1390 <ol>
1391 <li>
1392 A VTEP gateway's life cycle begins with the administrator registering
1393 the VTEP gateway as a <code>Physical_Switch</code> table entry in the
1394 <code>VTEP</code> database. The <code>ovn-controller-vtep</code>
1395 connected to this VTEP database, will recognize the new VTEP gateway
1396 and create a new <code>Chassis</code> table entry for it in the
1397 <code>OVN_Southbound</code> database.
1398 </li>
1399
1400 <li>
1401 The administrator can then create a new <code>Logical_Switch</code>
1402 table entry, and bind a particular vlan on a VTEP gateway's port to
1403 any VTEP logical switch. Once a VTEP logical switch is bound to
1404 a VTEP gateway, the <code>ovn-controller-vtep</code> will detect
1405 it and add its name to the <var>vtep_logical_switches</var>
1406 column of the <code>Chassis</code> table in the <code>
1407 OVN_Southbound</code> database. Note, the <var>tunnel_key</var>
1408 column of VTEP logical switch is not filled at creation. The
1409 <code>ovn-controller-vtep</code> will set the column when the
1410 correponding vtep logical switch is bound to an OVN logical network.
1411 </li>
1412
1413 <li>
1414 Now, the administrator can use the CMS to add a VTEP logical switch
1415 to the OVN logical network. To do that, the CMS must first create a
80f408f4 1416 new <code>Logical_Switch_Port</code> table entry in the <code>
88058f19
AW
1417 OVN_Northbound</code> database. Then, the <var>type</var> column
1418 of this entry must be set to "vtep". Next, the <var>
1419 vtep-logical-switch</var> and <var>vtep-physical-switch</var> keys
1420 in the <var>options</var> column must also be specified, since
1421 multiple VTEP gateways can attach to the same VTEP logical switch.
1422 </li>
1423
1424 <li>
1425 The newly created logical port in the <code>OVN_Northbound</code>
1426 database and its configuration will be passed down to the <code>
1427 OVN_Southbound</code> database as a new <code>Port_Binding</code>
1428 table entry. The <code>ovn-controller-vtep</code> will recognize the
1429 change and bind the logical port to the corresponding VTEP gateway
1430 chassis. Configuration of binding the same VTEP logical switch to
1431 a different OVN logical networks is not allowed and a warning will be
1432 generated in the log.
1433 </li>
1434
1435 <li>
1436 Beside binding to the VTEP gateway chassis, the <code>
1437 ovn-controller-vtep</code> will update the <var>tunnel_key</var>
1438 column of the VTEP logical switch to the corresponding <code>
1439 Datapath_Binding</code> table entry's <var>tunnel_key</var> for the
1440 bound OVN logical network.
1441 </li>
1442
1443 <li>
1444 Next, the <code>ovn-controller-vtep</code> will keep reacting to the
1445 configuration change in the <code>Port_Binding</code> in the
1446 <code>OVN_Northbound</code> database, and updating the
1447 <code>Ucast_Macs_Remote</code> table in the <code>VTEP</code> database.
1448 This allows the VTEP gateway to understand where to forward the unicast
1449 traffic coming from the extended external network.
1450 </li>
1451
1452 <li>
1453 Eventually, the VTEP gateway's life cycle ends when the administrator
1454 unregisters the VTEP gateway from the <code>VTEP</code> database.
1455 The <code>ovn-controller-vtep</code> will recognize the event and
1456 remove all related configurations (<code>Chassis</code> table entry
1457 and port bindings) in the <code>OVN_Southbound</code> database.
1458 </li>
1459
1460 <li>
1461 When the <code>ovn-controller-vtep</code> is terminated, all related
1462 configurations in the <code>OVN_Southbound</code> database and
1463 the <code>VTEP</code> database will be cleaned, including
1464 <code>Chassis</code> table entries for all registered VTEP gateways
1465 and their port bindings, and all <code>Ucast_Macs_Remote</code> table
1466 entries and the <code>Logical_Switch</code> tunnel keys.
1467 </li>
1468 </ol>
1469
75ddb5f4
LR
1470 <h1>Security</h1>
1471
1472 <h2>Role-Based Access Controls for the Soutbound DB</h2>
1473 <p>
1474 In order to provide additional security against the possibility of an OVN
1475 chassis becoming compromised in such a way as to allow rogue software to
1476 make arbitrary modifications to the southbound database state and thus
1477 disrupt the OVN network, role-based access controls (see
1478 <code>ovsdb-server(1)</code> for additional details) are provided for the
1479 southbound database.
1480 </p>
1481
1482 <p>
1483 The implementation of role-based access controls (RBAC) requires the
1484 addition of two tables to an OVSDB schema: the <code>RBAC_Role</code>
1485 table, which is indexed by role name and maps the the names of the various
1486 tables that may be modifiable for a given role to individual rows in a
1487 permissions table containing detailed permission information for that role,
1488 and the permission table itself which consists of rows containing the
1489 following information:
1490 </p>
1491 <dl>
1492 <dt><code>Table Name</code></dt>
1493 <dd>
1494 The name of the associated table. This column exists primarily as an
1495 aid for humans reading the contents of this table.
1496 </dd>
1497
1498 <dt><code>Auth Criteria</code></dt>
1499 <dd>
1500 A set of strings containing the names of columns (or column:key pairs
1501 for columns containing string:string maps). The contents of at least
1502 one of the columns or column:key values in a row to be modified,
1503 inserted, or deleted must be equal to the ID of the client attempting
1504 to act on the row in order for the authorization check to pass. If the
1505 authorization criteria is empty, authorization checking is disabled and
1506 all clients for the role will be treated as authorized.
1507 </dd>
1508
1509 <dt><code>Insert/Delete</code></dt>
1510 <dd>
1511 Row insertion/deletion permission; boolean value indicating whether
1512 insertion and deletion of rows is allowed for the associated table.
1513 If true, insertion and deletion of rows is allowed for authorized
1514 clients.
1515 </dd>
1516
1517 <dt><code>Updatable Columns</code></dt>
1518 <dd>
1519 A set of strings containing the names of columns or column:key pairs
1520 that may be updated or mutated by authorized clients. Modifications to
1521 columns within a row are only permitted when the authorization check
1522 for the client passes and all columns to be modified are included in
1523 this set of modifiable columns.
1524 </dd>
1525 </dl>
1526
1527 <p>
1528 RBAC configuration for the OVN southbound database is maintained by
1529 ovn-northd. With RBAC enabled, modifications are only permitted for the
1530 <code>Chassis</code>, <code>Encap</code>, <code>Port_Binding</code>, and
1531 <code>MAC_Binding</code> tables, and are resstricted as follows:
1532 </p>
1533 <dl>
1534 <dt><code>Chassis</code></dt>
1535 <dd>
1536 <p>
1537 <code>Authorization</code>: client ID must match the chassis name.
1538 </p>
1539 <p>
1540 <code>Insert/Delete</code>: authorized row insertion and deletion
1541 are permitted.
1542 </p>
1543 <p>
1544 <code>Update</code>: The columns <code>nb_cfg</code>,
1545 <code>external_ids</code>, <code>encaps</code>, and
1546 <code>vtep_logical_switches</code> may be modified when authorized.
1547 </p>
1548 </dd>
1549
1550 <dt><code>Encap</code></dt>
1551 <dd>
1552 <p>
5dbf6b17 1553 <code>Authorization</code>: client ID must match the chassis name.
75ddb5f4
LR
1554 </p>
1555 <p>
1556 <code>Insert/Delete</code>: row insertion and row deletion
1557 are permitted.
1558 </p>
1559 <p>
1560 <code>Update</code>: The columns <code>type</code>,
1561 <code>options</code>, and <code>ip</code> can be modified.
1562 </p>
1563 </dd>
1564
1565 <dt><code>Port_Binding</code></dt>
1566 <dd>
1567 <p>
1568 <code>Authorization</code>: disabled (all clients are considered
1569 authorized. A future enhancement may add columns (or keys to
1570 <code>external_ids</code>) in order to control which chassis are
1571 allowed to bind each port.
1572 </p>
1573 <p>
1574 <code>Insert/Delete</code>: row insertion/deletion are not permitted
1575 (ovn-northd maintains rows in this table.
1576 </p>
1577 <p>
1578 <code>Update</code>: Only modifications to the <code>chassis</code>
1579 column are permitted.
1580 </p>
1581 </dd>
1582
1583 <dt><code>MAC_Binding</code></dt>
1584 <dd>
1585 <p>
1586 <code>Authorization</code>: disabled (all clients are considered
1587 to be authorized).
1588 </p>
1589 <p>
1590 <code>Insert/Delete</code>: row insertion/deletion are permitted.
1591 </p>
1592 <p>
1593 <code>Update</code>: The columns <code>logical_port</code>,
1594 <code>ip</code>, <code>mac</code>, and <code>datapath</code> may be
1595 modified by ovn-controller.
1596 </p>
1597 </dd>
1598 </dl>
1599
1600 <p>
1601 Enabling RBAC for ovn-controller connections to the southbound database
1602 requires the following steps:
1603 </p>
1604
1605 <ol>
1606 <li>
1607 Creating SSL certificates for each chassis with the certificate CN field
1608 set to the chassis name (e.g. for a chassis with
1609 <code>external-ids:system-id=chassis-1</code>, via the command
48745e75 1610 "<code>ovs-pki -u req+sign chassis-1 switch</code>").
75ddb5f4
LR
1611 </li>
1612 <li>
1613 Configuring each ovn-controller to use SSL when connecting to the
1614 southbound database (e.g. via "<code>ovs-vsctl set open .
1615 external-ids:ovn-remote=ssl:x.x.x.x:6642</code>").
1616 </li>
1617 <li>
1618 Configuring a southbound database SSL remote with "ovn-controller" role
1619 (e.g. via "<code>ovn-sbctl set-connection role=ovn-controller
1620 pssl:6642</code>").
1621 </li>
1622 </ol>
1623
5868eb24
BP
1624 <h1>Design Decisions</h1>
1625
1626 <h2>Tunnel Encapsulations</h2>
1627
1628 <p>
1629 OVN annotates logical network packets that it sends from one hypervisor to
1630 another with the following three pieces of metadata, which are encoded in
1631 an encapsulation-specific fashion:
1632 </p>
1633
1634 <ul>
1635 <li>
1636 24-bit logical datapath identifier, from the <code>tunnel_key</code>
1637 column in the OVN Southbound <code>Datapath_Binding</code> table.
1638 </li>
1639
1640 <li>
1641 15-bit logical ingress port identifier. ID 0 is reserved for internal
1642 use within OVN. IDs 1 through 32767, inclusive, may be assigned to
1643 logical ports (see the <code>tunnel_key</code> column in the OVN
1644 Southbound <code>Port_Binding</code> table).
1645 </li>
1646
1647 <li>
1648 16-bit logical egress port identifier. IDs 0 through 32767 have the same
1649 meaning as for logical ingress ports. IDs 32768 through 65535,
1650 inclusive, may be assigned to logical multicast groups (see the
1651 <code>tunnel_key</code> column in the OVN Southbound
1652 <code>Multicast_Group</code> table).
1653 </li>
b705f9ea
JP
1654 </ul>
1655
1656 <p>
5868eb24
BP
1657 For hypervisor-to-hypervisor traffic, OVN supports only Geneve and STT
1658 encapsulations, for the following reasons:
b705f9ea
JP
1659 </p>
1660
5868eb24
BP
1661 <ul>
1662 <li>
1663 Only STT and Geneve support the large amounts of metadata (over 32 bits
1664 per packet) that OVN uses (as described above).
1665 </li>
1666
1667 <li>
1668 STT and Geneve use randomized UDP or TCP source ports that allows
1669 efficient distribution among multiple paths in environments that use ECMP
1670 in their underlay.
1671 </li>
1672
1673 <li>
1674 NICs are available to offload STT and Geneve encapsulation and
1675 decapsulation.
1676 </li>
1677 </ul>
1678
1679 <p>
1680 Due to its flexibility, the preferred encapsulation between hypervisors is
1681 Geneve. For Geneve encapsulation, OVN transmits the logical datapath
1682 identifier in the Geneve VNI.
1683
1684 <!-- Keep the following in sync with ovn/controller/physical.h. -->
1685 OVN transmits the logical ingress and logical egress ports in a TLV with
617609b8 1686 class 0x0102, type 0x80, and a 32-bit value encoded as follows, from MSB to
5868eb24
BP
1687 LSB:
1688 </p>
1689
1690 <diagram>
1691 <header name="">
1692 <bits name="rsv" above="1" below="0" width=".25"/>
1693 <bits name="ingress port" above="15" width=".75"/>
1694 <bits name="egress port" above="16" width=".75"/>
1695 </header>
1696 </diagram>
1697
1698 <p>
1699 Environments whose NICs lack Geneve offload may prefer STT encapsulation
1700 for performance reasons. For STT encapsulation, OVN encodes all three
1701 pieces of logical metadata in the STT 64-bit tunnel ID as follows, from MSB
1702 to LSB:
1703 </p>
1704
1705 <diagram>
1706 <header name="">
1707 <bits name="reserved" above="9" below="0" width=".5"/>
1708 <bits name="ingress port" above="15" width=".75"/>
1709 <bits name="egress port" above="16" width=".75"/>
1710 <bits name="datapath" above="24" width="1.25"/>
1711 </header>
1712 </diagram>
1713
b705f9ea 1714 <p>
5868eb24
BP
1715 For connecting to gateways, in addition to Geneve and STT, OVN supports
1716 VXLAN, because only VXLAN support is common on top-of-rack (ToR) switches.
1717 Currently, gateways have a feature set that matches the capabilities as
1718 defined by the VTEP schema, so fewer bits of metadata are necessary. In
1719 the future, gateways that do not support encapsulations with large amounts
1720 of metadata may continue to have a reduced feature set.
b705f9ea 1721 </p>
fe36184b 1722</manpage>