]> git.proxmox.com Git - mirror_ovs.git/blame - ovn/ovn-architecture.7.xml
conntrack: Move ct_state parsing to lib/flow.c
[mirror_ovs.git] / ovn / ovn-architecture.7.xml
CommitLineData
fe36184b
BP
1<?xml version="1.0" encoding="utf-8"?>
2<manpage program="ovn-architecture" section="7" title="OVN Architecture">
3 <h1>Name</h1>
4 <p>ovn-architecture -- Open Virtual Network architecture</p>
5
6 <h1>Description</h1>
7
8 <p>
9 OVN, the Open Virtual Network, is a system to support virtual network
10 abstraction. OVN complements the existing capabilities of OVS to add
11 native support for virtual network abstractions, such as virtual L2 and L3
12 overlays and security groups. Services such as DHCP are also desirable
13 features. Just like OVS, OVN's design goal is to have a production-quality
14 implementation that can operate at significant scale.
15 </p>
16
17 <p>
18 An OVN deployment consists of several components:
19 </p>
20
21 <ul>
22 <li>
23 <p>
24 A <dfn>Cloud Management System</dfn> (<dfn>CMS</dfn>), which is
25 OVN's ultimate client (via its users and administrators). OVN
26 integration requires installing a CMS-specific plugin and
27 related software (see below). OVN initially targets OpenStack
28 as CMS.
29 </p>
30
31 <p>
32 We generally speak of ``the'' CMS, but one can imagine scenarios in
33 which multiple CMSes manage different parts of an OVN deployment.
34 </p>
35 </li>
36
37 <li>
38 An OVN Database physical or virtual node (or, eventually, cluster)
39 installed in a central location.
40 </li>
41
42 <li>
43 One or more (usually many) <dfn>hypervisors</dfn>. Hypervisors must run
44 Open vSwitch and implement the interface described in
2567fb84 45 <code>IntegrationGuide.rst</code> in the OVS source tree. Any hypervisor
fe36184b
BP
46 platform supported by Open vSwitch is acceptable.
47 </li>
48
49 <li>
50 <p>
fa6aeaeb
RB
51 Zero or more <dfn>gateways</dfn>. A gateway extends a tunnel-based
52 logical network into a physical network by bidirectionally forwarding
53 packets between tunnels and a physical Ethernet port. This allows
54 non-virtualized machines to participate in logical networks. A gateway
55 may be a physical host, a virtual machine, or an ASIC-based hardware
56 switch that supports the <code>vtep</code>(5) schema. (Support for the
57 latter will come later in OVN implementation.)
fe36184b
BP
58 </p>
59
60 <p>
fa6aeaeb
RB
61 Hypervisors and gateways are together called <dfn>transport node</dfn>
62 or <dfn>chassis</dfn>.
fe36184b
BP
63 </p>
64 </li>
65 </ul>
66
67 <p>
68 The diagram below shows how the major components of OVN and related
69 software interact. Starting at the top of the diagram, we have:
70 </p>
71
72 <ul>
73 <li>
74 The Cloud Management System, as defined above.
75 </li>
76
77 <li>
78 <p>
fa6aeaeb
RB
79 The <dfn>OVN/CMS Plugin</dfn> is the component of the CMS that
80 interfaces to OVN. In OpenStack, this is a Neutron plugin.
81 The plugin's main purpose is to translate the CMS's notion of logical
82 network configuration, stored in the CMS's configuration database in a
83 CMS-specific format, into an intermediate representation understood by
84 OVN.
fe36184b
BP
85 </p>
86
87 <p>
fa6aeaeb
RB
88 This component is necessarily CMS-specific, so a new plugin needs to be
89 developed for each CMS that is integrated with OVN. All of the
90 components below this one in the diagram are CMS-independent.
fe36184b
BP
91 </p>
92 </li>
93
94 <li>
95 <p>
fa6aeaeb
RB
96 The <dfn>OVN Northbound Database</dfn> receives the intermediate
97 representation of logical network configuration passed down by the
98 OVN/CMS Plugin. The database schema is meant to be ``impedance
99 matched'' with the concepts used in a CMS, so that it directly supports
100 notions of logical switches, routers, ACLs, and so on. See
5868eb24 101 <code>ovn-nb</code>(5) for details.
fe36184b
BP
102 </p>
103
104 <p>
fa6aeaeb
RB
105 The OVN Northbound Database has only two clients: the OVN/CMS Plugin
106 above it and <code>ovn-northd</code> below it.
fe36184b
BP
107 </p>
108 </li>
109
110 <li>
91ae2065
RB
111 <code>ovn-northd</code>(8) connects to the OVN Northbound Database
112 above it and the OVN Southbound Database below it. It translates the
ec78987f
JP
113 logical network configuration in terms of conventional network
114 concepts, taken from the OVN Northbound Database, into logical
115 datapath flows in the OVN Southbound Database below it.
fe36184b
BP
116 </li>
117
118 <li>
119 <p>
ec78987f 120 The <dfn>OVN Southbound Database</dfn> is the center of the system.
91ae2065 121 Its clients are <code>ovn-northd</code>(8) above it and
ec78987f 122 <code>ovn-controller</code>(8) on every transport node below it.
fe36184b
BP
123 </p>
124
125 <p>
fa6aeaeb
RB
126 The OVN Southbound Database contains three kinds of data: <dfn>Physical
127 Network</dfn> (PN) tables that specify how to reach hypervisor and
128 other nodes, <dfn>Logical Network</dfn> (LN) tables that describe the
129 logical network in terms of ``logical datapath flows,'' and
130 <dfn>Binding</dfn> tables that link logical network components'
131 locations to the physical network. The hypervisors populate the PN and
dcda6e0d
BP
132 Port_Binding tables, whereas <code>ovn-northd</code>(8) populates the
133 LN tables.
fe36184b
BP
134 </p>
135
136 <p>
ec78987f
JP
137 OVN Southbound Database performance must scale with the number of
138 transport nodes. This will likely require some work on
139 <code>ovsdb-server</code>(1) as we encounter bottlenecks.
140 Clustering for availability may be needed.
fe36184b
BP
141 </p>
142 </li>
143 </ul>
144
145 <p>
146 The remaining components are replicated onto each hypervisor:
147 </p>
148
149 <ul>
150 <li>
151 <code>ovn-controller</code>(8) is OVN's agent on each hypervisor and
ec78987f
JP
152 software gateway. Northbound, it connects to the OVN Southbound
153 Database to learn about OVN configuration and status and to
154 populate the PN table and the <code>Chassis</code> column in
e387e3e8 155 <code>Binding</code> table with the hypervisor's status.
ec78987f
JP
156 Southbound, it connects to <code>ovs-vswitchd</code>(8) as an
157 OpenFlow controller, for control over network traffic, and to the
158 local <code>ovsdb-server</code>(1) to allow it to monitor and
159 control Open vSwitch configuration.
fe36184b
BP
160 </li>
161
162 <li>
163 <code>ovs-vswitchd</code>(8) and <code>ovsdb-server</code>(1) are
164 conventional components of Open vSwitch.
165 </li>
166 </ul>
167
168 <pre fixed="yes">
169 CMS
170 |
171 |
172 +-----------|-----------+
173 | | |
174 | OVN/CMS Plugin |
175 | | |
176 | | |
177 | OVN Northbound DB |
178 | | |
179 | | |
91ae2065 180 | ovn-northd |
fe36184b
BP
181 | | |
182 +-----------|-----------+
183 |
184 |
ec78987f
JP
185 +-------------------+
186 | OVN Southbound DB |
187 +-------------------+
fe36184b
BP
188 |
189 |
190 +------------------+------------------+
191 | | |
ec78987f 192 HV 1 | | HV n |
fe36184b
BP
193+---------------|---------------+ . +---------------|---------------+
194| | | . | | |
195| ovn-controller | . | ovn-controller |
196| | | | . | | | |
197| | | | | | | |
198| ovs-vswitchd ovsdb-server | | ovs-vswitchd ovsdb-server |
199| | | |
200+-------------------------------+ +-------------------------------+
201 </pre>
202
fa183acc
BP
203 <h2>Information Flow in OVN</h2>
204
205 <p>
206 Configuration data in OVN flows from north to south. The CMS, through its
207 OVN/CMS plugin, passes the logical network configuration to
208 <code>ovn-northd</code> via the northbound database. In turn,
209 <code>ovn-northd</code> compiles the configuration into a lower-level form
210 and passes it to all of the chassis via the southbound database.
211 </p>
212
213 <p>
214 Status information in OVN flows from south to north. OVN currently
215 provides only a few forms of status information. First,
216 <code>ovn-northd</code> populates the <code>up</code> column in the
217 northbound <code>Logical_Switch_Port</code> table: if a logical port's
218 <code>chassis</code> column in the southbound <code>Port_Binding</code>
219 table is nonempty, it sets <code>up</code> to <code>true</code>, otherwise
220 to <code>false</code>. This allows the CMS to detect when a VM's
221 networking has come up.
222 </p>
223
224 <p>
225 Second, OVN provides feedback to the CMS on the realization of its
226 configuration, that is, whether the configuration provided by the CMS has
227 taken effect. This feature requires the CMS to participate in a sequence
228 number protocol, which works the following way:
229 </p>
230
231 <ol>
232 <li>
233 When the CMS updates the configuration in the northbound database, as
234 part of the same transaction, it increments the value of the
235 <code>nb_cfg</code> column in the <code>NB_Global</code> table. (This is
236 only necessary if the CMS wants to know when the configuration has been
237 realized.)
238 </li>
239
240 <li>
241 When <code>ovn-northd</code> updates the southbound database based on a
242 given snapshot of the northbound database, it copies <code>nb_cfg</code>
243 from northbound <code>NB_Global</code> into the southbound database
244 <code>SB_Global</code> table, as part of the same transaction. (Thus, an
245 observer monitoring both databases can determine when the southbound
246 database is caught up with the northbound.)
247 </li>
248
249 <li>
250 After <code>ovn-northd</code> receives confirmation from the southbound
251 database server that its changes have committed, it updates
252 <code>sb_cfg</code> in the northbound <code>NB_Global</code> table to the
253 <code>nb_cfg</code> version that was pushed down. (Thus, the CMS or
254 another observer can determine when the southbound database is caught up
255 without a connection to the southbound database.)
256 </li>
257
258 <li>
259 The <code>ovn-controller</code> process on each chassis receives the
260 updated southbound database, with the updated <code>nb_cfg</code>. This
261 process in turn updates the physical flows installed in the chassis's
262 Open vSwitch instances. When it receives confirmation from Open vSwitch
263 that the physical flows have been updated, it updates <code>nb_cfg</code>
264 in its own <code>Chassis</code> record in the southbound database.
265 </li>
266
267 <li>
268 <code>ovn-northd</code> monitors the <code>nb_cfg</code> column in all of
269 the <code>Chassis</code> records in the southbound database. It keeps
270 track of the minimum value among all the records and copies it into the
271 <code>hv_cfg</code> column in the northbound <code>NB_Global</code>
272 table. (Thus, the CMS or another observer can determine when all of the
273 hypervisors have caught up to the northbound configuration.)
274 </li>
275 </ol>
276
ca1564ec
BP
277 <h2>Chassis Setup</h2>
278
279 <p>
280 Each chassis in an OVN deployment must be configured with an Open vSwitch
281 bridge dedicated for OVN's use, called the <dfn>integration bridge</dfn>.
e43fc07c
RB
282 System startup scripts may create this bridge prior to starting
283 <code>ovn-controller</code> if desired. If this bridge does not exist when
284 ovn-controller starts, it will be created automatically with the default
285 configuration suggested below. The ports on the integration bridge include:
ca1564ec
BP
286 </p>
287
288 <ul>
289 <li>
290 On any chassis, tunnel ports that OVN uses to maintain logical network
291 connectivity. <code>ovn-controller</code> adds, updates, and removes
292 these tunnel ports.
293 </li>
294
295 <li>
296 On a hypervisor, any VIFs that are to be attached to logical networks.
297 The hypervisor itself, or the integration between Open vSwitch and the
2567fb84 298 hypervisor (described in <code>IntegrationGuide.rst</code>) takes care of
ca1564ec
BP
299 this. (This is not part of OVN or new to OVN; this is pre-existing
300 integration work that has already been done on hypervisors that support
301 OVS.)
302 </li>
303
304 <li>
305 On a gateway, the physical port used for logical network connectivity.
306 System startup scripts add this port to the bridge prior to starting
307 <code>ovn-controller</code>. This can be a patch port to another bridge,
308 instead of a physical port, in more sophisticated setups.
309 </li>
310 </ul>
311
312 <p>
313 Other ports should not be attached to the integration bridge. In
314 particular, physical ports attached to the underlay network (as opposed to
315 gateway ports, which are physical ports attached to logical networks) must
316 not be attached to the integration bridge. Underlay physical ports should
317 instead be attached to a separate Open vSwitch bridge (they need not be
318 attached to any bridge at all, in fact).
319 </p>
320
321 <p>
a42226f0
BP
322 The integration bridge should be configured as described below.
323 The effect of each of these settings is documented in
324 <code>ovs-vswitchd.conf.db</code>(5):
ca1564ec
BP
325 </p>
326
e43fc07c
RB
327 <!-- Keep the following in sync with create_br_int() in
328 ovn/controller/ovn-controller.c. -->
a42226f0
BP
329 <dl>
330 <dt><code>fail-mode=secure</code></dt>
331 <dd>
332 Avoids switching packets between isolated logical networks before
333 <code>ovn-controller</code> starts up. See <code>Controller Failure
334 Settings</code> in <code>ovs-vsctl</code>(8) for more information.
335 </dd>
336
337 <dt><code>other-config:disable-in-band=true</code></dt>
338 <dd>
339 Suppresses in-band control flows for the integration bridge. It would be
340 unusual for such flows to show up anyway, because OVN uses a local
341 controller (over a Unix domain socket) instead of a remote controller.
342 It's possible, however, for some other bridge in the same system to have
343 an in-band remote controller, and in that case this suppresses the flows
7c9afefd
SF
344 that in-band control would ordinarily set up. Refer to the documentation
345 for more information.
a42226f0
BP
346 </dd>
347 </dl>
348
ca1564ec
BP
349 <p>
350 The customary name for the integration bridge is <code>br-int</code>, but
351 another name may be used.
352 </p>
353
747b2a45
BP
354 <h2>Logical Networks</h2>
355
356 <p>
357 A <dfn>logical network</dfn> implements the same concepts as physical
358 networks, but they are insulated from the physical network with tunnels or
359 other encapsulations. This allows logical networks to have separate IP and
360 other address spaces that overlap, without conflicting, with those used for
361 physical networks. Logical network topologies can be arranged without
362 regard for the topologies of the physical networks on which they run.
363 </p>
364
365 <p>
366 Logical network concepts in OVN include:
367 </p>
368
369 <ul>
370 <li>
371 <dfn>Logical switches</dfn>, the logical version of Ethernet switches.
372 </li>
373
374 <li>
375 <dfn>Logical routers</dfn>, the logical version of IP routers. Logical
376 switches and routers can be connected into sophisticated topologies.
377 </li>
378
379 <li>
380 <dfn>Logical datapaths</dfn> are the logical version of an OpenFlow
381 switch. Logical switches and routers are both implemented as logical
382 datapaths.
383 </li>
3a77e831
MS
384
385 <li>
386 <p>
387 <dfn>Logical ports</dfn> represent the points of connectivity in and
388 out of logical switches and logical routers. Some common types of
389 logical ports are:
390 </p>
391
392 <ul>
393 <li>
394 Logical ports representing VIFs.
395 </li>
396
397 <li>
398 <dfn>Localnet ports</dfn> represent the points of connectivity
399 between logical switches and the physical network. They are
400 implemented as OVS patch ports between the integration bridge
401 and the separate Open vSwitch bridge that underlay physical
402 ports attach to.
403 </li>
404
405 <li>
406 <dfn>Logical patch ports</dfn> represent the points of
407 connectivity between logical switches and logical routers, and
408 in some cases between peer logical routers. There is a pair of
409 logical patch ports at each such point of connectivity, one on
410 each side.
411 </li>
2a38ef45
DA
412 <li>
413 <dfn>Localport ports</dfn> represent the points of local
414 connectivity between logical switches and VIFs. These ports are
415 present in every chassis (not bound to any particular one) and
416 traffic from them will never go through a tunnel. A
417 <code>localport</code> is expected to only generate traffic destined
418 for a local destination, typically in response to a request it
419 received.
420 One use case is how OpenStack Neutron uses a <code>localport</code>
421 port for serving metadata to VM's residing on every hypervisor. A
422 metadata proxy process is attached to this port on every host and all
423 VM's within the same network will reach it at the same IP/MAC address
424 without any traffic being sent over a tunnel. Further details can be
425 seen at https://docs.openstack.org/developer/networking-ovn/design/metadata_api.html.
426 </li>
3a77e831
MS
427 </ul>
428 </li>
747b2a45
BP
429 </ul>
430
ca1564ec 431 <h2>Life Cycle of a VIF</h2>
fe36184b
BP
432
433 <p>
434 Tables and their schemas presented in isolation are difficult to
435 understand. Here's an example.
436 </p>
437
9fb4636f
GS
438 <p>
439 A VIF on a hypervisor is a virtual network interface attached either
440 to a VM or a container running directly on that hypervisor (This is
441 different from the interface of a container running inside a VM).
442 </p>
443
fe36184b
BP
444 <p>
445 The steps in this example refer often to details of the OVN and OVN
ec78987f 446 Northbound database schemas. Please see <code>ovn-sb</code>(5) and
fe36184b
BP
447 <code>ovn-nb</code>(5), respectively, for the full story on these
448 databases.
449 </p>
450
451 <ol>
452 <li>
453 A VIF's life cycle begins when a CMS administrator creates a new VIF
454 using the CMS user interface or API and adds it to a switch (one
455 implemented by OVN as a logical switch). The CMS updates its own
456 configuration. This includes associating unique, persistent identifier
457 <var>vif-id</var> and Ethernet address <var>mac</var> with the VIF.
458 </li>
459
460 <li>
461 The CMS plugin updates the OVN Northbound database to include the new
80f408f4
JP
462 VIF, by adding a row to the <code>Logical_Switch_Port</code>
463 table. In the new row, <code>name</code> is <var>vif-id</var>,
464 <code>mac</code> is <var>mac</var>, <code>switch</code> points to
465 the OVN logical switch's Logical_Switch record, and other columns
466 are initialized appropriately.
fe36184b
BP
467 </li>
468
469 <li>
5868eb24
BP
470 <code>ovn-northd</code> receives the OVN Northbound database update. In
471 turn, it makes the corresponding updates to the OVN Southbound database,
472 by adding rows to the OVN Southbound database <code>Logical_Flow</code>
473 table to reflect the new port, e.g. add a flow to recognize that packets
474 destined to the new port's MAC address should be delivered to it, and
475 update the flow that delivers broadcast and multicast packets to include
476 the new port. It also creates a record in the <code>Binding</code> table
477 and populates all its columns except the column that identifies the
9fb4636f 478 <code>chassis</code>.
fe36184b
BP
479 </li>
480
481 <li>
482 On every hypervisor, <code>ovn-controller</code> receives the
48605550 483 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
5868eb24
BP
484 in the previous step. As long as the VM that owns the VIF is powered
485 off, <code>ovn-controller</code> cannot do much; it cannot, for example,
fe36184b
BP
486 arrange to send packets to or receive packets from the VIF, because the
487 VIF does not actually exist anywhere.
488 </li>
489
490 <li>
491 Eventually, a user powers on the VM that owns the VIF. On the hypervisor
492 where the VM is powered on, the integration between the hypervisor and
2567fb84 493 Open vSwitch (described in <code>IntegrationGuide.rst</code>) adds the VIF
fe36184b 494 to the OVN integration bridge and stores <var>vif-id</var> in
2f4962f1 495 <code>external_ids</code>:<code>iface-id</code> to indicate that the
fe36184b
BP
496 interface is an instantiation of the new VIF. (None of this code is new
497 in OVN; this is pre-existing integration work that has already been done
498 on hypervisors that support OVS.)
499 </li>
500
501 <li>
502 On the hypervisor where the VM is powered on, <code>ovn-controller</code>
2f4962f1 503 notices <code>external_ids</code>:<code>iface-id</code> in the new
968353c2 504 Interface. In response, in the OVN Southbound DB, it updates the
e387e3e8 505 <code>Binding</code> table's <code>chassis</code> column for the
2f4962f1 506 row that links the logical port from <code>external_ids</code>:<code>
968353c2
HK
507 iface-id</code> to the hypervisor. Afterward, <code>ovn-controller</code>
508 updates the local hypervisor's OpenFlow tables so that packets to and from
509 the VIF are properly handled.
fe36184b
BP
510 </li>
511
512 <li>
513 Some CMS systems, including OpenStack, fully start a VM only when its
91ae2065
RB
514 networking is ready. To support this, <code>ovn-northd</code> notices
515 the <code>chassis</code> column updated for the row in
e387e3e8 516 <code>Binding</code> table and pushes this upward by updating the
80f408f4
JP
517 <ref column="up" table="Logical_Switch_Port" db="OVN_NB"/> column
518 in the OVN Northbound database's <ref table="Logical_Switch_Port"
519 db="OVN_NB"/> table to indicate that the VIF is now up. The CMS,
520 if it uses this feature, can then react by allowing the VM's
521 execution to proceed.
fe36184b
BP
522 </li>
523
524 <li>
525 On every hypervisor but the one where the VIF resides,
9fb4636f 526 <code>ovn-controller</code> notices the completely populated row in the
e387e3e8 527 <code>Binding</code> table. This provides <code>ovn-controller</code>
fe36184b
BP
528 the physical location of the logical port, so each instance updates the
529 OpenFlow tables of its switch (based on logical datapath flows in the OVN
5868eb24
BP
530 DB <code>Logical_Flow</code> table) so that packets to and from the VIF
531 can be properly handled via tunnels.
fe36184b
BP
532 </li>
533
534 <li>
535 Eventually, a user powers off the VM that owns the VIF. On the
6eceebf5 536 hypervisor where the VM was powered off, the VIF is deleted from the OVN
fe36184b
BP
537 integration bridge.
538 </li>
539
540 <li>
6eceebf5 541 On the hypervisor where the VM was powered off,
fe36184b 542 <code>ovn-controller</code> notices that the VIF was deleted. In
9fb4636f 543 response, it removes the <code>Chassis</code> column content in the
e387e3e8 544 <code>Binding</code> table for the logical port.
fe36184b
BP
545 </li>
546
547 <li>
9fb4636f 548 On every hypervisor, <code>ovn-controller</code> notices the empty
e387e3e8 549 <code>Chassis</code> column in the <code>Binding</code> table's row
9fb4636f
GS
550 for the logical port. This means that <code>ovn-controller</code> no
551 longer knows the physical location of the logical port, so each instance
552 updates its OpenFlow table to reflect that.
fe36184b
BP
553 </li>
554
555 <li>
556 Eventually, when the VIF (or its entire VM) is no longer needed by
557 anyone, an administrator deletes the VIF using the CMS user interface or
558 API. The CMS updates its own configuration.
559 </li>
560
561 <li>
562 The CMS plugin removes the VIF from the OVN Northbound database,
80f408f4 563 by deleting its row in the <code>Logical_Switch_Port</code> table.
fe36184b
BP
564 </li>
565
566 <li>
91ae2065 567 <code>ovn-northd</code> receives the OVN Northbound update and in turn
5868eb24
BP
568 updates the OVN Southbound database accordingly, by removing or updating
569 the rows from the OVN Southbound database <code>Logical_Flow</code> table
570 and <code>Binding</code> table that were related to the now-destroyed
571 VIF.
fe36184b
BP
572 </li>
573
574 <li>
575 On every hypervisor, <code>ovn-controller</code> receives the
48605550 576 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
5868eb24
BP
577 in the previous step. <code>ovn-controller</code> updates OpenFlow
578 tables to reflect the update, although there may not be much to do, since
579 the VIF had already become unreachable when it was removed from the
e387e3e8 580 <code>Binding</code> table in a previous step.
fe36184b
BP
581 </li>
582 </ol>
583
a30b56d4 584 <h2>Life Cycle of a Container Interface Inside a VM</h2>
9fb4636f
GS
585
586 <p>
587 OVN provides virtual network abstractions by converting information
588 written in OVN_NB database to OpenFlow flows in each hypervisor. Secure
589 virtual networking for multi-tenants can only be provided if OVN controller
590 is the only entity that can modify flows in Open vSwitch. When the
591 Open vSwitch integration bridge resides in the hypervisor, it is a
592 fair assumption to make that tenant workloads running inside VMs cannot
593 make any changes to Open vSwitch flows.
594 </p>
595
596 <p>
597 If the infrastructure provider trusts the applications inside the
598 containers not to break out and modify the Open vSwitch flows, then
599 containers can be run in hypervisors. This is also the case when
600 containers are run inside the VMs and Open vSwitch integration bridge
601 with flows added by OVN controller resides in the same VM. For both
602 the above cases, the workflow is the same as explained with an example
603 in the previous section ("Life Cycle of a VIF").
604 </p>
605
606 <p>
607 This section talks about the life cycle of a container interface (CIF)
608 when containers are created in the VMs and the Open vSwitch integration
609 bridge resides inside the hypervisor. In this case, even if a container
610 application breaks out, other tenants are not affected because the
611 containers running inside the VMs cannot modify the flows in the
612 Open vSwitch integration bridge.
613 </p>
614
615 <p>
616 When multiple containers are created inside a VM, there are multiple
617 CIFs associated with them. The network traffic associated with these
618 CIFs need to reach the Open vSwitch integration bridge running in the
619 hypervisor for OVN to support virtual network abstractions. OVN should
620 also be able to distinguish network traffic coming from different CIFs.
621 There are two ways to distinguish network traffic of CIFs.
622 </p>
623
624 <p>
625 One way is to provide one VIF for every CIF (1:1 model). This means that
626 there could be a lot of network devices in the hypervisor. This would slow
627 down OVS because of all the additional CPU cycles needed for the management
628 of all the VIFs. It would also mean that the entity creating the
629 containers in a VM should also be able to create the corresponding VIFs in
630 the hypervisor.
631 </p>
632
633 <p>
634 The second way is to provide a single VIF for all the CIFs (1:many model).
635 OVN could then distinguish network traffic coming from different CIFs via
636 a tag written in every packet. OVN uses this mechanism and uses VLAN as
637 the tagging mechanism.
638 </p>
639
640 <ol>
641 <li>
642 A CIF's life cycle begins when a container is spawned inside a VM by
643 the either the same CMS that created the VM or a tenant that owns that VM
644 or even a container Orchestration System that is different than the CMS
645 that initially created the VM. Whoever the entity is, it will need to
646 know the <var>vif-id</var> that is associated with the network interface
647 of the VM through which the container interface's network traffic is
648 expected to go through. The entity that creates the container interface
649 will also need to choose an unused VLAN inside that VM.
650 </li>
651
652 <li>
653 The container spawning entity (either directly or through the CMS that
654 manages the underlying infrastructure) updates the OVN Northbound
655 database to include the new CIF, by adding a row to the
80f408f4
JP
656 <code>Logical_Switch_Port</code> table. In the new row,
657 <code>name</code> is any unique identifier,
658 <code>parent_name</code> is the <var>vif-id</var> of the VM
659 through which the CIF's network traffic is expected to go through
660 and the <code>tag</code> is the VLAN tag that identifies the
9fb4636f
GS
661 network traffic of that CIF.
662 </li>
663
664 <li>
5868eb24
BP
665 <code>ovn-northd</code> receives the OVN Northbound database update. In
666 turn, it makes the corresponding updates to the OVN Southbound database,
667 by adding rows to the OVN Southbound database's <code>Logical_Flow</code>
668 table to reflect the new port and also by creating a new row in the
669 <code>Binding</code> table and populating all its columns except the
670 column that identifies the <code>chassis</code>.
9fb4636f
GS
671 </li>
672
673 <li>
674 On every hypervisor, <code>ovn-controller</code> subscribes to the
e387e3e8 675 changes in the <code>Binding</code> table. When a new row is created
91ae2065 676 by <code>ovn-northd</code> that includes a value in
e387e3e8 677 <code>parent_port</code> column of <code>Binding</code> table, the
91ae2065
RB
678 <code>ovn-controller</code> in the hypervisor whose OVN integration bridge
679 has that same value in <var>vif-id</var> in
2f4962f1 680 <code>external_ids</code>:<code>iface-id</code>
9fb4636f
GS
681 updates the local hypervisor's OpenFlow tables so that packets to and
682 from the VIF with the particular VLAN <code>tag</code> are properly
683 handled. Afterward it updates the <code>chassis</code> column of
e387e3e8 684 the <code>Binding</code> to reflect the physical location.
9fb4636f
GS
685 </li>
686
687 <li>
688 One can only start the application inside the container after the
91ae2065 689 underlying network is ready. To support this, <code>ovn-northd</code>
e387e3e8 690 notices the updated <code>chassis</code> column in <code>Binding</code>
80f408f4 691 table and updates the <ref column="up" table="Logical_Switch_Port"
9fb4636f 692 db="OVN_NB"/> column in the OVN Northbound database's
80f408f4 693 <ref table="Logical_Switch_Port" db="OVN_NB"/> table to indicate that the
9fb4636f
GS
694 CIF is now up. The entity responsible to start the container application
695 queries this value and starts the application.
696 </li>
697
698 <li>
699 Eventually the entity that created and started the container, stops it.
700 The entity, through the CMS (or directly) deletes its row in the
80f408f4 701 <code>Logical_Switch_Port</code> table.
9fb4636f
GS
702 </li>
703
704 <li>
91ae2065 705 <code>ovn-northd</code> receives the OVN Northbound update and in turn
5868eb24
BP
706 updates the OVN Southbound database accordingly, by removing or updating
707 the rows from the OVN Southbound database <code>Logical_Flow</code> table
708 that were related to the now-destroyed CIF. It also deletes the row in
709 the <code>Binding</code> table for that CIF.
9fb4636f
GS
710 </li>
711
712 <li>
713 On every hypervisor, <code>ovn-controller</code> receives the
48605550
BP
714 <code>Logical_Flow</code> table updates that <code>ovn-northd</code> made
715 in the previous step. <code>ovn-controller</code> updates OpenFlow
716 tables to reflect the update.
9fb4636f
GS
717 </li>
718 </ol>
b705f9ea 719
69a832cf 720 <h2>Architectural Physical Life Cycle of a Packet</h2>
b705f9ea 721
b705f9ea 722 <p>
5868eb24
BP
723 This section describes how a packet travels from one virtual machine or
724 container to another through OVN. This description focuses on the physical
725 treatment of a packet; for a description of the logical life cycle of a
726 packet, please refer to the <code>Logical_Flow</code> table in
727 <code>ovn-sb</code>(5).
b705f9ea
JP
728 </p>
729
5868eb24
BP
730 <p>
731 This section mentions several data and metadata fields, for clarity
732 summarized here:
733 </p>
734
735 <dl>
736 <dt>tunnel key</dt>
737 <dd>
738 When OVN encapsulates a packet in Geneve or another tunnel, it attaches
739 extra data to it to allow the receiving OVN instance to process it
740 correctly. This takes different forms depending on the particular
741 encapsulation, but in each case we refer to it here as the ``tunnel
742 key.'' See <code>Tunnel Encapsulations</code>, below, for details.
743 </dd>
744
745 <dt>logical datapath field</dt>
746 <dd>
747 A field that denotes the logical datapath through which a packet is being
4103f6d2
BP
748 processed.
749 <!-- Keep the following in sync with MFF_LOG_DATAPATH in
667e2b0b 750 ovn/lib/logical-fields.h. -->
4103f6d2
BP
751 OVN uses the field that OpenFlow 1.1+ simply (and confusingly) calls
752 ``metadata'' to store the logical datapath. (This field is passed across
753 tunnels as part of the tunnel key.)
5868eb24
BP
754 </dd>
755
756 <dt>logical input port field</dt>
757 <dd>
37910994
JP
758 <p>
759 A field that denotes the logical port from which the packet
760 entered the logical datapath.
761 <!-- Keep the following in sync with MFF_LOG_INPORT in
667e2b0b 762 ovn/lib/logical-fields.h. -->
b221ff0d 763 OVN stores this in Open vSwitch extension register number 14.
37910994
JP
764 </p>
765
766 <p>
767 Geneve and STT tunnels pass this field as part of the tunnel key.
768 Although VXLAN tunnels do not explicitly carry a logical input port,
769 OVN only uses VXLAN to communicate with gateways that from OVN's
770 perspective consist of only a single logical port, so that OVN can set
771 the logical input port field to this one on ingress to the OVN logical
772 pipeline.
773 </p>
5868eb24
BP
774 </dd>
775
776 <dt>logical output port field</dt>
777 <dd>
37910994
JP
778 <p>
779 A field that denotes the logical port from which the packet will
780 leave the logical datapath. This is initialized to 0 at the
781 beginning of the logical ingress pipeline.
782 <!-- Keep the following in sync with MFF_LOG_OUTPORT in
667e2b0b 783 ovn/lib/logical-fields.h. -->
b221ff0d 784 OVN stores this in Open vSwitch extension register number 15.
37910994
JP
785 </p>
786
787 <p>
788 Geneve and STT tunnels pass this field as part of the tunnel key.
789 VXLAN tunnels do not transmit the logical output port field.
475f0a2c
DB
790 Since VXLAN tunnels do not carry a logical output port field in
791 the tunnel key, when a packet is received from VXLAN tunnel by
00c875d0 792 an OVN hypervisor, the packet is resubmitted to table 8 to
475f0a2c
DB
793 determine the output port(s); when the packet reaches table 32,
794 these packets are resubmitted to table 33 for local delivery by
795 checking a MLF_RCV_FROM_VXLAN flag, which is set when the packet
796 arrives from a VXLAN tunnel.
37910994 797 </p>
5868eb24
BP
798 </dd>
799
3bd4ae23 800 <dt>conntrack zone field for logical ports</dt>
78aab811 801 <dd>
3bd4ae23
GS
802 A field that denotes the connection tracking zone for logical ports.
803 The value only has local significance and is not meaningful between
804 chassis. This is initialized to 0 at the beginning of the logical
cc5e28d8
JP
805 <!-- Keep the following in sync with MFF_LOG_CT_ZONE in
806 ovn/lib/logical-fields.h. -->
b221ff0d 807 ingress pipeline. OVN stores this in Open vSwitch extension register
cc5e28d8 808 number 13.
3bd4ae23
GS
809 </dd>
810
06a26dd2 811 <dt>conntrack zone fields for routers</dt>
3bd4ae23 812 <dd>
06a26dd2
MS
813 Fields that denote the connection tracking zones for routers. These
814 values only have local significance and are not meaningful between
b221ff0d 815 chassis. OVN stores the zone information for DNATting in Open vSwitch
cc5e28d8
JP
816 <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
817 MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
b221ff0d
JP
818 extension register number 11 and zone information for SNATing in
819 Open vSwitch extension register number 12.
78aab811
JP
820 </dd>
821
bf143492
JP
822 <dt>logical flow flags</dt>
823 <dd>
475f0a2c
DB
824 The logical flags are intended to handle keeping context between
825 tables in order to decide which rules in subsequent tables are
826 matched. These values only have local significance and are not
827 meaningful between chassis. OVN stores the logical flags in
bf143492
JP
828 <!-- Keep the following in sync with MFF_LOG_FLAGS in
829 ovn/lib/logical-fields.h. -->
475f0a2c 830 Open vSwitch extension register number 10.
bf143492
JP
831 </dd>
832
5868eb24
BP
833 <dt>VLAN ID</dt>
834 <dd>
835 The VLAN ID is used as an interface between OVN and containers nested
836 inside a VM (see <code>Life Cycle of a container interface inside a
837 VM</code>, above, for more information).
838 </dd>
839 </dl>
840
841 <p>
842 Initially, a VM or container on the ingress hypervisor sends a packet on a
843 port attached to the OVN integration bridge. Then:
844 </p>
845
846 <ol>
b705f9ea
JP
847 <li>
848 <p>
5868eb24
BP
849 OpenFlow table 0 performs physical-to-logical translation. It matches
850 the packet's ingress port. Its actions annotate the packet with
851 logical metadata, by setting the logical datapath field to identify the
852 logical datapath that the packet is traversing and the logical input
00c875d0 853 port field to identify the ingress port. Then it resubmits to table 8
5868eb24
BP
854 to enter the logical ingress pipeline.
855 </p>
856
857 <p>
858 Packets that originate from a container nested within a VM are treated
859 in a slightly different way. The originating container can be
860 distinguished based on the VIF-specific VLAN ID, so the
861 physical-to-logical translation flows additionally match on VLAN ID and
862 the actions strip the VLAN header. Following this step, OVN treats
863 packets from containers just like any other packets.
864 </p>
865
866 <p>
867 Table 0 also processes packets that arrive from other chassis. It
868 distinguishes them from other packets by ingress port, which is a
869 tunnel. As with packets just entering the OVN pipeline, the actions
870 annotate these packets with logical datapath and logical ingress port
871 metadata. In addition, the actions set the logical output port field,
872 which is available because in OVN tunneling occurs after the logical
873 output port is known. These three pieces of information are obtained
874 from the tunnel encapsulation metadata (see <code>Tunnel
875 Encapsulations</code> for encoding details). Then the actions resubmit
876 to table 33 to enter the logical egress pipeline.
b705f9ea
JP
877 </p>
878 </li>
879
880 <li>
881 <p>
00c875d0 882 OpenFlow tables 8 through 31 execute the logical ingress pipeline from
5868eb24
BP
883 the <code>Logical_Flow</code> table in the OVN Southbound database.
884 These tables are expressed entirely in terms of logical concepts like
885 logical ports and logical datapaths. A big part of
886 <code>ovn-controller</code>'s job is to translate them into equivalent
887 OpenFlow (in particular it translates the table numbers:
00c875d0 888 <code>Logical_Flow</code> tables 0 through 23 become OpenFlow tables 8
0bac7164 889 through 31).
b705f9ea 890 </p>
5868eb24 891
c80eac1f
BP
892 <p>
893 Each logical flow maps to one or more OpenFlow flows. An actual packet
894 ordinarily matches only one of these, although in some cases it can
895 match more than one of these flows (which is not a problem because all
896 of them have the same actions). <code>ovn-controller</code> uses the
897 first 32 bits of the logical flow's UUID as the cookie for its OpenFlow
898 flow or flows. (This is not necessarily unique, since the first 32
899 bits of a logical flow's UUID is not necessarily unique.)
900 </p>
901
902 <p>
903 Some logical flows can map to the Open vSwitch ``conjunctive match''
96fee5e0 904 extension (see <code>ovs-fields</code>(7)). Flows with a
c80eac1f
BP
905 <code>conjunction</code> action use an OpenFlow cookie of 0, because
906 they can correspond to multiple logical flows. The OpenFlow flow for a
907 conjunctive match includes a match on <code>conj_id</code>.
908 </p>
909
910 <p>
911 Some logical flows may not be represented in the OpenFlow tables on a
912 given hypervisor, if they could not be used on that hypervisor. For
913 example, if no VIF in a logical switch resides on a given hypervisor,
914 and the logical switch is not otherwise reachable on that hypervisor
915 (e.g. over a series of hops through logical switches and routers
916 starting from a VIF on the hypervisor), then the logical flow may not
917 be represented there.
918 </p>
919
0bac7164
BP
920 <p>
921 Most OVN actions have fairly obvious implementations in OpenFlow (with
922 OVS extensions), e.g. <code>next;</code> is implemented as
923 <code>resubmit</code>, <code><var>field</var> =
924 <var>constant</var>;</code> as <code>set_field</code>. A few are worth
925 describing in more detail:
926 </p>
927
928 <dl>
929 <dt><code>output:</code></dt>
930 <dd>
931 Implemented by resubmitting the packet to table 32. If the pipeline
932 executes more than one <code>output</code> action, then each one is
933 separately resubmitted to table 32. This can be used to send
934 multiple copies of the packet to multiple ports. (If the packet was
935 not modified between the <code>output</code> actions, and some of the
936 copies are destined to the same hypervisor, then using a logical
937 multicast output port would save bandwidth between hypervisors.)
938 </dd>
939
940 <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt>
c34a87b6 941 <dt><code>get_nd(<var>P</var>, <var>A</var>);</code></dt>
0bac7164
BP
942 <dd>
943 <p>
944 Implemented by storing arguments into OpenFlow fields, then
bf143492 945 resubmitting to table 66, which <code>ovn-controller</code>
0bac7164
BP
946 populates with flows generated from the <code>MAC_Binding</code>
947 table in the OVN Southbound database. If there is a match in table
bf143492 948 66, then its actions store the bound MAC in the Ethernet
0bac7164
BP
949 destination address field.
950 </p>
951
952 <p>
953 (The OpenFlow actions save and restore the OpenFlow fields used for
954 the arguments, so that the OVN actions do not have to be aware of
955 this temporary use.)
956 </p>
957 </dd>
958
959 <dt><code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
c34a87b6 960 <dt><code>put_nd(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
0bac7164
BP
961 <dd>
962 <p>
963 Implemented by storing the arguments into OpenFlow fields, then
964 outputting a packet to <code>ovn-controller</code>, which updates
965 the <code>MAC_Binding</code> table.
966 </p>
967
968 <p>
969 (The OpenFlow actions save and restore the OpenFlow fields used for
970 the arguments, so that the OVN actions do not have to be aware of
971 this temporary use.)
972 </p>
973 </dd>
974 </dl>
b705f9ea
JP
975 </li>
976
977 <li>
978 <p>
5868eb24
BP
979 OpenFlow tables 32 through 47 implement the <code>output</code> action
980 in the logical ingress pipeline. Specifically, table 32 handles
981 packets to remote hypervisors, table 33 handles packets to the local
bf143492
JP
982 hypervisor, and table 34 checks whether packets whose logical ingress
983 and egress port are the same should be discarded.
5868eb24
BP
984 </p>
985
0b7da177
BP
986 <p>
987 Logical patch ports are a special case. Logical patch ports do not
988 have a physical location and effectively reside on every hypervisor.
989 Thus, flow table 33, for output to ports on the local hypervisor,
990 naturally implements output to unicast logical patch ports too.
991 However, applying the same logic to a logical patch port that is part
992 of a logical multicast group yields packet duplication, because each
993 hypervisor that contains a logical port in the multicast group will
994 also output the packet to the logical patch port. Thus, multicast
995 groups implement output to logical patch ports in table 32.
996 </p>
997
5868eb24
BP
998 <p>
999 Each flow in table 32 matches on a logical output port for unicast or
1000 multicast logical ports that include a logical port on a remote
1001 hypervisor. Each flow's actions implement sending a packet to the port
1002 it matches. For unicast logical output ports on remote hypervisors,
1003 the actions set the tunnel key to the correct value, then send the
1004 packet on the tunnel port to the correct hypervisor. (When the remote
1005 hypervisor receives the packet, table 0 there will recognize it as a
1006 tunneled packet and pass it along to table 33.) For multicast logical
1007 output ports, the actions send one copy of the packet to each remote
1008 hypervisor, in the same way as for unicast destinations. If a
1009 multicast group includes a logical port or ports on the local
1010 hypervisor, then its actions also resubmit to table 33. Table 32 also
2a38ef45 1011 includes:
5868eb24
BP
1012 </p>
1013
2a38ef45
DA
1014 <ul>
1015 <li>
1016 A higher-priority rule to match packets received from VXLAN tunnels,
1017 based on flag MLF_RCV_FROM_VXLAN, and resubmit these packets to table
1018 33 for local delivery. Packets received from VXLAN tunnels reach
1019 here because of a lack of logical output port field in the tunnel key
00c875d0 1020 and thus these packets needed to be submitted to table 8 to
2a38ef45
DA
1021 determine the output port.
1022 </li>
1023 <li>
1024 A higher-priority rule to match packets received from ports of type
1025 <code>localport</code>, based on the logical input port, and resubmit
1026 these packets to table 33 for local delivery. Ports of type
1027 <code>localport</code> exist on every hypervisor and by definition
1028 their traffic should never go out through a tunnel.
1029 </li>
1030 <li>
1031 A fallback flow that resubmits to table 33 if there is no other
1032 match.
1033 </li>
1034 </ul>
1035
5868eb24
BP
1036 <p>
1037 Flows in table 33 resemble those in table 32 but for logical ports that
0b7da177 1038 reside locally rather than remotely. For unicast logical output ports
5868eb24
BP
1039 on the local hypervisor, the actions just resubmit to table 34. For
1040 multicast output ports that include one or more logical ports on the
1041 local hypervisor, for each such logical port <var>P</var>, the actions
1042 change the logical output port to <var>P</var>, then resubmit to table
1043 34.
1044 </p>
1045
6e6c3f91
HZ
1046 <p>
1047 A special case is that when a localnet port exists on the datapath,
1048 remote port is connected by switching to the localnet port. In this
1049 case, instead of adding a flow in table 32 to reach the remote port, a
1050 flow is added in table 33 to switch the logical outport to the localnet
1051 port, and resubmit to table 33 as if it were unicasted to a logical
1052 port on the local hypervisor.
1053 </p>
1054
5868eb24
BP
1055 <p>
1056 Table 34 matches and drops packets for which the logical input and
bf143492 1057 output ports are the same and the MLF_ALLOW_LOOPBACK flag is not
00c875d0 1058 set. It resubmits other packets to table 40.
b705f9ea
JP
1059 </p>
1060 </li>
5868eb24
BP
1061
1062 <li>
1063 <p>
00c875d0 1064 OpenFlow tables 40 through 63 execute the logical egress pipeline from
5868eb24
BP
1065 the <code>Logical_Flow</code> table in the OVN Southbound database.
1066 The egress pipeline can perform a final stage of validation before
1067 packet delivery. Eventually, it may execute an <code>output</code>
1068 action, which <code>ovn-controller</code> implements by resubmitting to
1069 table 64. A packet for which the pipeline never executes
1070 <code>output</code> is effectively dropped (although it may have been
1071 transmitted through a tunnel across a physical network).
1072 </p>
1073
1074 <p>
1075 The egress pipeline cannot change the logical output port or cause
1076 further tunneling.
1077 </p>
1078 </li>
1079
bf143492
JP
1080 <li>
1081 <p>
1082 Table 64 bypasses OpenFlow loopback when MLF_ALLOW_LOOPBACK is set.
1083 Logical loopback was handled in table 34, but OpenFlow by default also
1084 prevents loopback to the OpenFlow ingress port. Thus, when
1085 MLF_ALLOW_LOOPBACK is set, OpenFlow table 64 saves the OpenFlow ingress
1086 port, sets it to zero, resubmits to table 65 for logical-to-physical
1087 transformation, and then restores the OpenFlow ingress port,
1088 effectively disabling OpenFlow loopback prevents. When
1089 MLF_ALLOW_LOOPBACK is unset, table 64 flow simply resubmits to table
1090 65.
1091 </p>
1092 </li>
1093
5868eb24
BP
1094 <li>
1095 <p>
bf143492 1096 OpenFlow table 65 performs logical-to-physical translation, the
5868eb24
BP
1097 opposite of table 0. It matches the packet's logical egress port. Its
1098 actions output the packet to the port attached to the OVN integration
1099 bridge that represents that logical port. If the logical egress port
1100 is a container nested with a VM, then before sending the packet the
1101 actions push on a VLAN header with an appropriate VLAN ID.
1102 </p>
1103 </li>
1104 </ol>
1105
3a77e831
MS
1106 <h2>Logical Routers and Logical Patch Ports</h2>
1107
1108 <p>
1109 Typically logical routers and logical patch ports do not have a
1110 physical location and effectively reside on every hypervisor. This is
1111 the case for logical patch ports between logical routers and logical
1112 switches behind those logical routers, to which VMs (and VIFs) attach.
1113 </p>
1114
1115 <p>
1116 Consider a packet sent from one virtual machine or container to another
1117 VM or container that resides on a different subnet. The packet will
1118 traverse tables 0 to 65 as described in the previous section
1119 <code>Architectural Physical Life Cycle of a Packet</code>, using the
1120 logical datapath representing the logical switch that the sender is
1121 attached to. At table 32, the packet will use the fallback flow that
1122 resubmits locally to table 33 on the same hypervisor. In this case,
1123 all of the processing from table 0 to table 65 occurs on the hypervisor
1124 where the sender resides.
1125 </p>
1126
1127 <p>
1128 When the packet reaches table 65, the logical egress port is a logical
1129 patch port. The implementation in table 65 differs depending on the OVS
1130 version, although the observed behavior is meant to be the same:
1131 </p>
1132
1133 <ul>
1134 <li>
1135 In OVS versions 2.6 and earlier, table 65 outputs to an OVS patch
1136 port that represents the logical patch port. The packet re-enters
1137 the OpenFlow flow table from the OVS patch port's peer in table 0,
1138 which identifies the logical datapath and logical input port based
1139 on the OVS patch port's OpenFlow port number.
1140 </li>
1141
1142 <li>
1143 In OVS versions 2.7 and later, the packet is cloned and resubmitted
00c875d0
MS
1144 directly to the first OpenFlow flow table in the ingress pipeline,
1145 setting the logical ingress port to the peer logical patch port, and
1146 using the peer logical patch port's logical datapath (that
1147 represents the logical router).
3a77e831
MS
1148 </li>
1149 </ul>
1150
1151 <p>
1152 The packet re-enters the ingress pipeline in order to traverse tables
00c875d0 1153 8 to 65 again, this time using the logical datapath representing the
3a77e831
MS
1154 logical router. The processing continues as described in the previous
1155 section <code>Architectural Physical Life Cycle of a Packet</code>.
1156 When the packet reachs table 65, the logical egress port will once
1157 again be a logical patch port. In the same manner as described above,
1158 this logical patch port will cause the packet to be resubmitted to
00c875d0 1159 OpenFlow tables 8 to 65, this time using the logical datapath
3a77e831
MS
1160 representing the logical switch that the destination VM or container
1161 is attached to.
1162 </p>
1163
1164 <p>
00c875d0 1165 The packet traverses tables 8 to 65 a third and final time. If the
3a77e831
MS
1166 destination VM or container resides on a remote hypervisor, then table
1167 32 will send the packet on a tunnel port from the sender's hypervisor
1168 to the remote hypervisor. Finally table 65 will output the packet
1169 directly to the destination VM or container.
1170 </p>
1171
1172 <p>
41a15b71
MS
1173 The following sections describe two exceptions, where logical routers
1174 and/or logical patch ports are associated with a physical location.
3a77e831
MS
1175 </p>
1176
1177 <h3>Gateway Routers</h3>
1178
1179 <p>
1180 A <dfn>gateway router</dfn> is a logical router that is bound to a
1181 physical location. This includes all of the logical patch ports of
1182 the logical router, as well as all of the peer logical patch ports on
1183 logical switches. In the OVN Southbound database, the
1184 <code>Port_Binding</code> entries for these logical patch ports use
1185 the type <code>l3gateway</code> rather than <code>patch</code>, in
1186 order to distinguish that these logical patch ports are bound to a
1187 chassis.
1188 </p>
1189
1190 <p>
1191 When a hypervisor processes a packet on a logical datapath
1192 representing a logical switch, and the logical egress port is a
1193 <code>l3gateway</code> port representing connectivity to a gateway
1194 router, the packet will match a flow in table 32 that sends the
1195 packet on a tunnel port to the chassis where the gateway router
1196 resides. This processing in table 32 is done in the same manner as
1197 for VIFs.
1198 </p>
1199
1200 <p>
1201 Gateway routers are typically used in between distributed logical
1202 routers and physical networks. The distributed logical router and
1203 the logical switches behind it, to which VMs and containers attach,
1204 effectively reside on each hypervisor. The distributed router and
1205 the gateway router are connected by another logical switch, sometimes
1206 referred to as a <code>join</code> logical switch. On the other
1207 side, the gateway router connects to another logical switch that has
1208 a localnet port connecting to the physical network.
1209 </p>
1210
1211 <p>
1212 When using gateway routers, DNAT and SNAT rules are associated with
1213 the gateway router, which provides a central location that can handle
1214 one-to-many SNAT (aka IP masquerading).
1215 </p>
1216
41a15b71
MS
1217 <h3>Distributed Gateway Ports</h3>
1218
1219 <p>
1220 <dfn>Distributed gateway ports</dfn> are logical router patch ports
1221 that directly connect distributed logical routers to logical
1222 switches with localnet ports.
1223 </p>
1224
1225 <p>
1226 The primary design goal of distributed gateway ports is to allow as
1227 much traffic as possible to be handled locally on the hypervisor
1228 where a VM or container resides. Whenever possible, packets from
1229 the VM or container to the outside world should be processed
1230 completely on that VM's or container's hypervisor, eventually
1231 traversing a localnet port instance on that hypervisor to the
1232 physical network. Whenever possible, packets from the outside
1233 world to a VM or container should be directed through the physical
1234 network directly to the VM's or container's hypervisor, where the
1235 packet will enter the integration bridge through a localnet port.
1236 </p>
1237
1238 <p>
1239 In order to allow for the distributed processing of packets
1240 described in the paragraph above, distributed gateway ports need to
1241 be logical patch ports that effectively reside on every hypervisor,
1242 rather than <code>l3gateway</code> ports that are bound to a
1243 particular chassis. However, the flows associated with distributed
1244 gateway ports often need to be associated with physical locations,
1245 for the following reasons:
1246 </p>
1247
1248 <ul>
1249 <li>
1250 <p>
1251 The physical network that the localnet port is attached to
1252 typically uses L2 learning. Any Ethernet address used over the
1253 distributed gateway port must be restricted to a single physical
1254 location so that upstream L2 learning is not confused. Traffic
1255 sent out the distributed gateway port towards the localnet port
1256 with a specific Ethernet address must be sent out one specific
1257 instance of the distributed gateway port on one specific
1258 chassis. Traffic received from the localnet port (or from a VIF
1259 on the same logical switch as the localnet port) with a specific
1260 Ethernet address must be directed to the logical switch's patch
1261 port instance on that specific chassis.
1262 </p>
1263
1264 <p>
1265 Due to the implications of L2 learning, the Ethernet address and
1266 IP address of the distributed gateway port need to be restricted
1267 to a single physical location. For this reason, the user must
1268 specify one chassis associated with the distributed gateway
1269 port. Note that traffic traversing the distributed gateway port
1270 using other Ethernet addresses and IP addresses (e.g. one-to-one
1271 NAT) is not restricted to this chassis.
1272 </p>
1273
1274 <p>
1275 Replies to ARP and ND requests must be restricted to a single
1276 physical location, where the Ethernet address in the reply
1277 resides. This includes ARP and ND replies for the IP address
1278 of the distributed gateway port, which are restricted to the
1279 chassis that the user associated with the distributed gateway
1280 port.
1281 </p>
1282 </li>
1283
1284 <li>
1285 In order to support one-to-many SNAT (aka IP masquerading), where
1286 multiple logical IP addresses spread across multiple chassis are
1287 mapped to a single external IP address, it will be necessary to
1288 handle some of the logical router processing on a specific chassis
1289 in a centralized manner. Since the SNAT external IP address is
1290 typically the distributed gateway port IP address, and for
1291 simplicity, the same chassis associated with the distributed
1292 gateway port is used.
1293 </li>
1294 </ul>
1295
1296 <p>
1297 The details of flow restrictions to specific chassis are described
1298 in the <code>ovn-northd</code> documentation.
1299 </p>
1300
1301 <p>
1302 While most of the physical location dependent aspects of distributed
1303 gateway ports can be handled by restricting some flows to specific
1304 chassis, one additional mechanism is required. When a packet
1305 leaves the ingress pipeline and the logical egress port is the
1306 distributed gateway port, one of two different sets of actions is
1307 required at table 32:
1308 </p>
1309
1310 <ul>
1311 <li>
1312 If the packet can be handled locally on the sender's hypervisor
1313 (e.g. one-to-one NAT traffic), then the packet should just be
1314 resubmitted locally to table 33, in the normal manner for
1315 distributed logical patch ports.
1316 </li>
1317
1318 <li>
1319 However, if the packet needs to be handled on the chassis
1320 associated with the distributed gateway port (e.g. one-to-many
1321 SNAT traffic or non-NAT traffic), then table 32 must send the
1322 packet on a tunnel port to that chassis.
1323 </li>
1324 </ul>
1325
1326 <p>
1327 In order to trigger the second set of actions, the
1328 <code>chassisredirect</code> type of southbound
1329 <code>Port_Binding</code> has been added. Setting the logical
1330 egress port to the type <code>chassisredirect</code> logical port is
1331 simply a way to indicate that although the packet is destined for
1332 the distributed gateway port, it needs to be redirected to a
1333 different chassis. At table 32, packets with this logical egress
1334 port are sent to a specific chassis, in the same way that table 32
1335 directs packets whose logical egress port is a VIF or a type
1336 <code>l3gateway</code> port to different chassis. Once the packet
1337 arrives at that chassis, table 33 resets the logical egress port to
1338 the value representing the distributed gateway port. For each
1339 distributed gateway port, there is one type
1340 <code>chassisredirect</code> port, in addition to the distributed
1341 logical patch port representing the distributed gateway port.
1342 </p>
1343
88058f19
AW
1344 <h2>Life Cycle of a VTEP gateway</h2>
1345
1346 <p>
1347 A gateway is a chassis that forwards traffic between the OVN-managed
1348 part of a logical network and a physical VLAN, extending a
1349 tunnel-based logical network into a physical network.
1350 </p>
1351
1352 <p>
1353 The steps below refer often to details of the OVN and VTEP database
1354 schemas. Please see <code>ovn-sb</code>(5), <code>ovn-nb</code>(5)
1355 and <code>vtep</code>(5), respectively, for the full story on these
1356 databases.
1357 </p>
1358
1359 <ol>
1360 <li>
1361 A VTEP gateway's life cycle begins with the administrator registering
1362 the VTEP gateway as a <code>Physical_Switch</code> table entry in the
1363 <code>VTEP</code> database. The <code>ovn-controller-vtep</code>
1364 connected to this VTEP database, will recognize the new VTEP gateway
1365 and create a new <code>Chassis</code> table entry for it in the
1366 <code>OVN_Southbound</code> database.
1367 </li>
1368
1369 <li>
1370 The administrator can then create a new <code>Logical_Switch</code>
1371 table entry, and bind a particular vlan on a VTEP gateway's port to
1372 any VTEP logical switch. Once a VTEP logical switch is bound to
1373 a VTEP gateway, the <code>ovn-controller-vtep</code> will detect
1374 it and add its name to the <var>vtep_logical_switches</var>
1375 column of the <code>Chassis</code> table in the <code>
1376 OVN_Southbound</code> database. Note, the <var>tunnel_key</var>
1377 column of VTEP logical switch is not filled at creation. The
1378 <code>ovn-controller-vtep</code> will set the column when the
1379 correponding vtep logical switch is bound to an OVN logical network.
1380 </li>
1381
1382 <li>
1383 Now, the administrator can use the CMS to add a VTEP logical switch
1384 to the OVN logical network. To do that, the CMS must first create a
80f408f4 1385 new <code>Logical_Switch_Port</code> table entry in the <code>
88058f19
AW
1386 OVN_Northbound</code> database. Then, the <var>type</var> column
1387 of this entry must be set to "vtep". Next, the <var>
1388 vtep-logical-switch</var> and <var>vtep-physical-switch</var> keys
1389 in the <var>options</var> column must also be specified, since
1390 multiple VTEP gateways can attach to the same VTEP logical switch.
1391 </li>
1392
1393 <li>
1394 The newly created logical port in the <code>OVN_Northbound</code>
1395 database and its configuration will be passed down to the <code>
1396 OVN_Southbound</code> database as a new <code>Port_Binding</code>
1397 table entry. The <code>ovn-controller-vtep</code> will recognize the
1398 change and bind the logical port to the corresponding VTEP gateway
1399 chassis. Configuration of binding the same VTEP logical switch to
1400 a different OVN logical networks is not allowed and a warning will be
1401 generated in the log.
1402 </li>
1403
1404 <li>
1405 Beside binding to the VTEP gateway chassis, the <code>
1406 ovn-controller-vtep</code> will update the <var>tunnel_key</var>
1407 column of the VTEP logical switch to the corresponding <code>
1408 Datapath_Binding</code> table entry's <var>tunnel_key</var> for the
1409 bound OVN logical network.
1410 </li>
1411
1412 <li>
1413 Next, the <code>ovn-controller-vtep</code> will keep reacting to the
1414 configuration change in the <code>Port_Binding</code> in the
1415 <code>OVN_Northbound</code> database, and updating the
1416 <code>Ucast_Macs_Remote</code> table in the <code>VTEP</code> database.
1417 This allows the VTEP gateway to understand where to forward the unicast
1418 traffic coming from the extended external network.
1419 </li>
1420
1421 <li>
1422 Eventually, the VTEP gateway's life cycle ends when the administrator
1423 unregisters the VTEP gateway from the <code>VTEP</code> database.
1424 The <code>ovn-controller-vtep</code> will recognize the event and
1425 remove all related configurations (<code>Chassis</code> table entry
1426 and port bindings) in the <code>OVN_Southbound</code> database.
1427 </li>
1428
1429 <li>
1430 When the <code>ovn-controller-vtep</code> is terminated, all related
1431 configurations in the <code>OVN_Southbound</code> database and
1432 the <code>VTEP</code> database will be cleaned, including
1433 <code>Chassis</code> table entries for all registered VTEP gateways
1434 and their port bindings, and all <code>Ucast_Macs_Remote</code> table
1435 entries and the <code>Logical_Switch</code> tunnel keys.
1436 </li>
1437 </ol>
1438
75ddb5f4
LR
1439 <h1>Security</h1>
1440
1441 <h2>Role-Based Access Controls for the Soutbound DB</h2>
1442 <p>
1443 In order to provide additional security against the possibility of an OVN
1444 chassis becoming compromised in such a way as to allow rogue software to
1445 make arbitrary modifications to the southbound database state and thus
1446 disrupt the OVN network, role-based access controls (see
1447 <code>ovsdb-server(1)</code> for additional details) are provided for the
1448 southbound database.
1449 </p>
1450
1451 <p>
1452 The implementation of role-based access controls (RBAC) requires the
1453 addition of two tables to an OVSDB schema: the <code>RBAC_Role</code>
1454 table, which is indexed by role name and maps the the names of the various
1455 tables that may be modifiable for a given role to individual rows in a
1456 permissions table containing detailed permission information for that role,
1457 and the permission table itself which consists of rows containing the
1458 following information:
1459 </p>
1460 <dl>
1461 <dt><code>Table Name</code></dt>
1462 <dd>
1463 The name of the associated table. This column exists primarily as an
1464 aid for humans reading the contents of this table.
1465 </dd>
1466
1467 <dt><code>Auth Criteria</code></dt>
1468 <dd>
1469 A set of strings containing the names of columns (or column:key pairs
1470 for columns containing string:string maps). The contents of at least
1471 one of the columns or column:key values in a row to be modified,
1472 inserted, or deleted must be equal to the ID of the client attempting
1473 to act on the row in order for the authorization check to pass. If the
1474 authorization criteria is empty, authorization checking is disabled and
1475 all clients for the role will be treated as authorized.
1476 </dd>
1477
1478 <dt><code>Insert/Delete</code></dt>
1479 <dd>
1480 Row insertion/deletion permission; boolean value indicating whether
1481 insertion and deletion of rows is allowed for the associated table.
1482 If true, insertion and deletion of rows is allowed for authorized
1483 clients.
1484 </dd>
1485
1486 <dt><code>Updatable Columns</code></dt>
1487 <dd>
1488 A set of strings containing the names of columns or column:key pairs
1489 that may be updated or mutated by authorized clients. Modifications to
1490 columns within a row are only permitted when the authorization check
1491 for the client passes and all columns to be modified are included in
1492 this set of modifiable columns.
1493 </dd>
1494 </dl>
1495
1496 <p>
1497 RBAC configuration for the OVN southbound database is maintained by
1498 ovn-northd. With RBAC enabled, modifications are only permitted for the
1499 <code>Chassis</code>, <code>Encap</code>, <code>Port_Binding</code>, and
1500 <code>MAC_Binding</code> tables, and are resstricted as follows:
1501 </p>
1502 <dl>
1503 <dt><code>Chassis</code></dt>
1504 <dd>
1505 <p>
1506 <code>Authorization</code>: client ID must match the chassis name.
1507 </p>
1508 <p>
1509 <code>Insert/Delete</code>: authorized row insertion and deletion
1510 are permitted.
1511 </p>
1512 <p>
1513 <code>Update</code>: The columns <code>nb_cfg</code>,
1514 <code>external_ids</code>, <code>encaps</code>, and
1515 <code>vtep_logical_switches</code> may be modified when authorized.
1516 </p>
1517 </dd>
1518
1519 <dt><code>Encap</code></dt>
1520 <dd>
1521 <p>
1522 <code>Authorization</code>: disabled (all clients are considered
1523 to be authorized. Future: add a "creating chassis name" column to
1524 this table and use it for authorization checking.
1525 </p>
1526 <p>
1527 <code>Insert/Delete</code>: row insertion and row deletion
1528 are permitted.
1529 </p>
1530 <p>
1531 <code>Update</code>: The columns <code>type</code>,
1532 <code>options</code>, and <code>ip</code> can be modified.
1533 </p>
1534 </dd>
1535
1536 <dt><code>Port_Binding</code></dt>
1537 <dd>
1538 <p>
1539 <code>Authorization</code>: disabled (all clients are considered
1540 authorized. A future enhancement may add columns (or keys to
1541 <code>external_ids</code>) in order to control which chassis are
1542 allowed to bind each port.
1543 </p>
1544 <p>
1545 <code>Insert/Delete</code>: row insertion/deletion are not permitted
1546 (ovn-northd maintains rows in this table.
1547 </p>
1548 <p>
1549 <code>Update</code>: Only modifications to the <code>chassis</code>
1550 column are permitted.
1551 </p>
1552 </dd>
1553
1554 <dt><code>MAC_Binding</code></dt>
1555 <dd>
1556 <p>
1557 <code>Authorization</code>: disabled (all clients are considered
1558 to be authorized).
1559 </p>
1560 <p>
1561 <code>Insert/Delete</code>: row insertion/deletion are permitted.
1562 </p>
1563 <p>
1564 <code>Update</code>: The columns <code>logical_port</code>,
1565 <code>ip</code>, <code>mac</code>, and <code>datapath</code> may be
1566 modified by ovn-controller.
1567 </p>
1568 </dd>
1569 </dl>
1570
1571 <p>
1572 Enabling RBAC for ovn-controller connections to the southbound database
1573 requires the following steps:
1574 </p>
1575
1576 <ol>
1577 <li>
1578 Creating SSL certificates for each chassis with the certificate CN field
1579 set to the chassis name (e.g. for a chassis with
1580 <code>external-ids:system-id=chassis-1</code>, via the command
1581 "<code>ovs-pki -B 1024 -u req+sign chassis-1 switch</code>").
1582 </li>
1583 <li>
1584 Configuring each ovn-controller to use SSL when connecting to the
1585 southbound database (e.g. via "<code>ovs-vsctl set open .
1586 external-ids:ovn-remote=ssl:x.x.x.x:6642</code>").
1587 </li>
1588 <li>
1589 Configuring a southbound database SSL remote with "ovn-controller" role
1590 (e.g. via "<code>ovn-sbctl set-connection role=ovn-controller
1591 pssl:6642</code>").
1592 </li>
1593 </ol>
1594
5868eb24
BP
1595 <h1>Design Decisions</h1>
1596
1597 <h2>Tunnel Encapsulations</h2>
1598
1599 <p>
1600 OVN annotates logical network packets that it sends from one hypervisor to
1601 another with the following three pieces of metadata, which are encoded in
1602 an encapsulation-specific fashion:
1603 </p>
1604
1605 <ul>
1606 <li>
1607 24-bit logical datapath identifier, from the <code>tunnel_key</code>
1608 column in the OVN Southbound <code>Datapath_Binding</code> table.
1609 </li>
1610
1611 <li>
1612 15-bit logical ingress port identifier. ID 0 is reserved for internal
1613 use within OVN. IDs 1 through 32767, inclusive, may be assigned to
1614 logical ports (see the <code>tunnel_key</code> column in the OVN
1615 Southbound <code>Port_Binding</code> table).
1616 </li>
1617
1618 <li>
1619 16-bit logical egress port identifier. IDs 0 through 32767 have the same
1620 meaning as for logical ingress ports. IDs 32768 through 65535,
1621 inclusive, may be assigned to logical multicast groups (see the
1622 <code>tunnel_key</code> column in the OVN Southbound
1623 <code>Multicast_Group</code> table).
1624 </li>
b705f9ea
JP
1625 </ul>
1626
1627 <p>
5868eb24
BP
1628 For hypervisor-to-hypervisor traffic, OVN supports only Geneve and STT
1629 encapsulations, for the following reasons:
b705f9ea
JP
1630 </p>
1631
5868eb24
BP
1632 <ul>
1633 <li>
1634 Only STT and Geneve support the large amounts of metadata (over 32 bits
1635 per packet) that OVN uses (as described above).
1636 </li>
1637
1638 <li>
1639 STT and Geneve use randomized UDP or TCP source ports that allows
1640 efficient distribution among multiple paths in environments that use ECMP
1641 in their underlay.
1642 </li>
1643
1644 <li>
1645 NICs are available to offload STT and Geneve encapsulation and
1646 decapsulation.
1647 </li>
1648 </ul>
1649
1650 <p>
1651 Due to its flexibility, the preferred encapsulation between hypervisors is
1652 Geneve. For Geneve encapsulation, OVN transmits the logical datapath
1653 identifier in the Geneve VNI.
1654
1655 <!-- Keep the following in sync with ovn/controller/physical.h. -->
1656 OVN transmits the logical ingress and logical egress ports in a TLV with
617609b8 1657 class 0x0102, type 0x80, and a 32-bit value encoded as follows, from MSB to
5868eb24
BP
1658 LSB:
1659 </p>
1660
1661 <diagram>
1662 <header name="">
1663 <bits name="rsv" above="1" below="0" width=".25"/>
1664 <bits name="ingress port" above="15" width=".75"/>
1665 <bits name="egress port" above="16" width=".75"/>
1666 </header>
1667 </diagram>
1668
1669 <p>
1670 Environments whose NICs lack Geneve offload may prefer STT encapsulation
1671 for performance reasons. For STT encapsulation, OVN encodes all three
1672 pieces of logical metadata in the STT 64-bit tunnel ID as follows, from MSB
1673 to LSB:
1674 </p>
1675
1676 <diagram>
1677 <header name="">
1678 <bits name="reserved" above="9" below="0" width=".5"/>
1679 <bits name="ingress port" above="15" width=".75"/>
1680 <bits name="egress port" above="16" width=".75"/>
1681 <bits name="datapath" above="24" width="1.25"/>
1682 </header>
1683 </diagram>
1684
b705f9ea 1685 <p>
5868eb24
BP
1686 For connecting to gateways, in addition to Geneve and STT, OVN supports
1687 VXLAN, because only VXLAN support is common on top-of-rack (ToR) switches.
1688 Currently, gateways have a feature set that matches the capabilities as
1689 defined by the VTEP schema, so fewer bits of metadata are necessary. In
1690 the future, gateways that do not support encapsulations with large amounts
1691 of metadata may continue to have a reduced feature set.
b705f9ea 1692 </p>
fe36184b 1693</manpage>